Running Your SOC Playbooks as Code: Use Cases, a.k.a. Don’t Start With Phishing

February 21, 2024

This article is part of an edited transcript of Cosive Managing Director Kayne Naughton’s keynote presentation at AusCERT2019, a large cyber security conference in Australia. You can watch the recorded presentation in full here. (Cover photo by Hunter Brumels.)

This is part 2 of a series on security orchestration. You can read part 1 here.

The problem(s) with phishing

The first thing that everyone wants to do when they get their brand new SOAR out of the shrinkwrap is solve phishing. I hate to be the one to break it to you, but if we were going to solve phishing, there wouldn’t be six or so anti-phishing vendors out there right now. (Technically malware was the first computer security problem that we struck, with the Morris worm, but in terms of things that face regular users, phishing is the first problem. Paul Graham first started applying bayesian analytics and machine learning to this stuff in the 90s, or something crazy, and we still haven’t solved it yet.)

I know phishing is a big resource drain for most companies, because it’s very fiddly and there’s a lot going on. It’s very context sensitive. It’s very subjective, e.g., is this really an email from DHL? Is this really an email from PayPal that came from paypal-survey.com? (It is, unfortunately.) There are domains, IPs, files, files with files, zips, zips with passwords… it’s all very context-sensitive. Computers aren’t very good at nuance and context. I wouldn’t start with phishing by any means. That’s really starting on expert-mode. Start on easy-mode.

Use Case: ChatOps

I think a good place to start with orchestration is ChatOps. If you use internal chat stuff (I think almost everyone does… even the most stodgy of government departments), it’s a really natural way for humans to communicate and look things up without alt-tabbing all the time.

ChatOps also means that you can work in public. If you’ve got junior people in your team, and you’ve got an incident channel, and you’re there and looking stuff up and communicating what you’re doing, you’re effectively documenting it in front of other people, so they can see what it is that you did. Once you’ve done a few of those things using ChatOps, you can say, well what is it that we do during these incidents? Let’s pull the logs off these channels. That’s a starting point. If you’re sort of narrating for your coworkers as you’re dealing with an incident in a ChatOps channel, you can take that and formalise it as an orchestration playbook.

Use Case: Offboarding/Staff Travel

A use case for orchestration that I think is really high value, and really high pain, yet a lot of people don’t notice it, is the offboarding of staff or, for that matter, handling when people are traveling.

Almost every HR system has hooks in it. So you can go: “When has a staff member been offboarded?” Or if they resigned suddenly, effective this afternoon, or were fired, you need to treat them as high risk. If you’ve got someone who is on a month’s leave, and they shouldn’t have any duties, should they have their account active? Should they be able to log into payroll and finance systems, or whatever else? If you do have people who are suddenly terminated or leave the organisation, you probably want to lodge tickets. You might want to do something like pulling the logs from their workstation before it’s re-imaged, if you don’t already pull them back centrally. A bunch of this stuff is really easy to do.

For that matter, when someone leaves, disable their account. Turn their account off. It means that you stop paying your IT provider for supporting a user who is not really a user. If they haven’t logged-in for three years because you bounced them a while ago, they shouldn’t have an active account.

It’s a really good way to tie your security into the things that matter to your organisation. And there’s probably a person whose job it is to do it by hand, and it probably really sucks, and you can give them more room to do better stuff.

With travellers, I would restrict their groups. If you’ve got Sysadmins who are going on a holiday, or going on a working trip to Beijing (or somewhere like that), and they don’t have Sysadmin tasks, take away their Sysadmin privileges. If they have access to financial systems and they are not currently doing finance stuff while travelling, turn it all off. Make them 2FA for everything if you have the knobs and levers to let you do that.

Also, alert on any out of country log-in attempts that aren’t from people who are supposed to be travelling. If you can put all your people in a ‘Travelling’ or ‘Not travelling’ group, you can look at log-in attempts. You can discover when you get a log-in attempt from a country that someone visited two months ago. Hm… Isn’t that interesting? Maybe we need to all panic and run around the room for a while.

Use Case: Detonating Malware

If you are an organisation that deals in a hands-on way with the malware that you see, this is probably a really good thing to orchestrate, rather than manually handling it. I’m sure people accidentally run malware when trying to drag it from the inbox into the security tool quite often. I’ve never seen it, but I’m sure it happens a lot. That’s a really good case for orchestration.

Pull a sample.
Run it through a sandbox. Based on what the sandbox says, and the rules on the sandbox, you can go and do a bunch of things as a result. I think getting context about the malware is particularly useful, or even, for that matter, if Antivirus fires. What type of malware was it? What does that mean? Do we need to revoke all the certificates that person has on their machine? What certificates do they have? What passwords do they have? What shared passwords do you have in your organisation? (Which are a bit naughty, but that’s a whole other thing). What else is there if you want to go into hunting mode? I’d never say you can automate hunting. But if you find something weird, and you can find some things that are similar to the ‘weird’, then that’s close enough to being hunting. And you can get that little badge in your security book that you’re now a hunter.
Based on rules/heuristics you can:
Force password resets.
Rebuild machines.
Revoke certificates.
Block domain on proxy.
Sinkhole domain.
Report to Google Safe Browsing.
Sweep for similar files on the network and pull SIEM logs for how it started running.

Use Case: “Agentless” malware mitigation

One example of this might be working with Sysmon on Windows (if you don’t have it, it is free, and made by a guy who has worked at Microsoft for 15 years). Sysmon can tell you when processes execute, when they make network connections, when they make registry keys, when they do anything on your system. You can log them into your event log in Windows. You have a collector that goes and runs it into your SIEM platform. It can send notifications when something weird or bad happens. You can take those notifications and run them into your SOAR platform, and then do something about it.

You could do something as easy as “This thing tries to talk to an address that we don’t like. That’s pretty weird.” You can kill the process, lock the file, grab a copy of the file, and drop it in a quarantine folder so your analysts can deal with it. Otherwise, it might be that you get an alert, and by the time you get to it, your AV has already killed it. Having this kind of workflow where you can reach out and touch your endpoints (if you have a tested process) is particularly handy.

Say you have an outbreak. I wouldn’t want to be relying on your platinum level support from your Antivirus provider to give you the immediate response that you need for WannaCry 2.0 or NotPetya. If you know what the hash is that’s on your network, and you can run a job that finds everything with that hash, disable the file and then kill the process, you’ll save the day. You don’t want to do this all the time, but sometimes you’re going to need to do it. And you don’t want to be building these things and running Shell scripts by hand when the badness happens. You want them to already be in place.

When I say this is agentless, you need the logging agent thing, but it’s pretty agentless. There’s not really a whole lot of load in terms of running Sysmon on a box, and you’re not running a bunch of proprietary software. You can do all this for free.

Use Case: Managing Incidents

GitHub have gone extreme when it comes to ChatOps and the way they orchestrate. They were getting a 1.2TB a second DDoS attack and (I can’t find the story again but I’ll recount it as I recall it) the engineer who was on call was on his way to the airport to go to a conference or something. Heard they were getting DDoSed, flicked over Slack, typed !shieldsup and then that called a ChatOps action that went and changed their BGP routing to send them via their DDoS protection provider.

GitHub do all sorts of stuff around organising incidents. Particularly when you’re a large software company, being able to automate the process of organising around an incident is really valuable. You probably have a really stressed out person who is booking meetings, clearing calendars, paging people, standing up calls, alerting people, getting certain people on shift. Automate and orchestrate all that stuff. It also means that your processes are always followed. You don’t realise that you forgot to tell the CISO a day later. Definitely worth doing with your orchestration platform. I don’t know whether I’d change my BGP routes via ChatOps, but if you are super brave and/or super smart, you go for it.

Ideas

In terms of ideas, if you are orchestrating, Phantom used to have cash prizes for the best entries in competitions. I think it ran every month back when they first started out, for the best integrations, the best playbooks, the best whatever. They’ve got a massive catalogue of them. The Phantom ones are also quite easy to read, so even if you’re not a Phantom customer, you can see the logical flows and integrations that people use for certain problems, and then you can leverage that knowledge to give you ideas on what you want to do next.

As I said, the Demisto playbooks are all MIT licensed, so you can literally port them. They’re a bit harder to read, but they’re all detailed, and they integrate with a lot of different things.

Data flows

I notice when a lot of people have frustration with their existing tools, and they get a new tool, they want to use the new tool to do everything. I wouldn’t do that. You want to use your data lake to do your data lake stuff. You want to use your SIEM or your alerting engine to do your alerting. You probably want to send those alerts to your SOAR platform to do something with them, but if you can avoid it, you probably don’t want to be actively sweeping your network for badness using your SOAR platform. They’re not generally designed for that. The licensing model isn’t usually great for that. You’re better off using the right tool for the right job and using SOAR to join everything together.

The other advantage to doing it this way is when you have a component that you don’t want anymore, for example, if you’re using AlienVault and you decide to go Graylog, or drop Splunk and go to a Hadoop cluster. If you have all of your response actions in your SOAR, you’re only swapping out one component, and the webhooks go to the same place. You don’t have to reengineer everything from scratch. By the same token, if you swap SOAR platforms, that should be pretty easy.

Portability/Interoperation

COPS - Collaborative Open Playbook Standard by Demisto. It’s definitely open. It is a standard. But as far as I’m aware, they are the only implementer of said standard, and the owner of said standard. That said, it’s a pretty sensible standard. It’s also very similar in format to OpenStack related products and StackStorm related products. They’re not quite interoperable, but they’re not that far off. It’s really cool that they wrote a standard, documented it, and allow anyone to use it. I really applaud any effort to make it so that you don’t get vendor lock-in.
OpenC2. It’s an Oasis standard on what you should do in order to mitigate something. It’s a standard for response. They control STIX, which is the main standard for threat intel. OpenC2 is their attempt at having a standard for what you do about threat intel and how you apply that to your tooling.
Most use Python code (some have JS too).
Almost all use YAML configs (or JSON that YAML can interpret).

Considerations: Engineering overheads

If you’re thinking: “This is awesome! I’m going to do this as soon as I get back to work,” this will make you go… “Oh.” There are always engineering overheads. The platforms are pretty clean and easy to run. A lot of the integrations that you have don’t necessarily spit out the data in a way that another one that you want to feed it to will want it. You might have mail filtering stuff that defangs the URLs, for example, replacing http with ‘hxxp’. If you go and pass that into your malware sandbox, it won’t consider it a valid URL. You might need stuff to adjust things back to the way they should be. Maybe something ejects a list, but it should be a single IP address, or you take a URL but you need to pass a domain name to something. There’s always that little bit of engineering in the middle that means you probably want some software development/engineering in-house.

Even someone who’s done a two-day Python course probably knows enough to get you there. You don’t need to be writing big software. You just need basic scripting stuff.

Testing

Testing is hard. When you mess up on a single machine, you mess up just that machine. When you mess up on a thousand machines at once… it’s different.

You’ve got to test your software, you’ve got to keep testing your software. You might apply a Windows patch or something that maybe makes things not work the way they used to. Firstly, you need actual testing for your integrations. But you also need to test them regularly and make sure they’re still working the way they are supposed to if you apply patches, for example. Testing is super important because you could easily, I don’t know… delete Excel from your fleet. I have worked at a place where Antivirus flagged Excel one morning and everyone lost Excel. No-one could do any spreadsheets for the day. I don’t know how we didn’t all go ‘Lord of the Flies’ and start stabbing each other with spears, but we got there.

You’ve got to be careful. You don’t want to be the one that wiped out the fleet, and it’s very easy to do with orchestration unless you have humans in the loop to say “Hey, is this a good idea? Do you want to do this?”

Same with threat intel. I would never action threat intel, or a serious alert, without having a human make the final decision to approve or reject the response.

Making it actually happen

Stand up a project to protect the time/resources from operational tasks. You’re probably already at 100% utilisation, so there’s no room to give yourself more room. You need to set aside 20%, or whatever, of your time so that you can get rid of the other 80% so that you get more time to do this stuff. I’m aware of one team where… I wouldn’t say they play foosball all day, but they’ve almost reached Nirvana. They’re looking for more stuff to automate, and it’s diminishing returns. They sit there, the machines go ‘boop’ at them, and they go ‘Yeah’ and accept a response. You can get there. It’s good.
Start small. Find something that is achievable. Show everyone how you achieved it, and that it’s great, and then work on the hard stuff. (Don’t start with phishing).
Find actual cost savings to show initial ROI. Not intangibles: if you can actually save money, that’s great.
Get metrics.
Put in revision/release controls early.
Think about confidentiality/creepiness. It’s easy to get invasive when gathering telemetry, so think about that.

If you’ve read this far, you might be wondering how you can get started with orchestration. This is the core of what we do at Cosive, and we can work with you to figure out what your organisation should orchestrate, and how. Get in touch.

‍