Before jointly founding Cosive with Kayne Naughton and Terry MacDonald, Chris Horsley (Cosive’s CTO) spent many years working in national CSIRTs in both Australia and Japan, as well as doing freelance secure software development for operations teams.
In this interview Chris Horsley (CTO at Cosive) talks about the challenges of building software and doing development in SecOps teams.
Why is it so challenging to build production-worthy software in security operations teams?
In cyber security operations we're often dealing with the same things over and over again.
The work gets quite noisy. It gets quite repetitive. It can also be error prone because there can be many steps required to perform certain operations.
For example, let's say I receive a report of a phishing URL. There are about 10 steps I’ll want to take next to investigate, like asking that the URL gets taken down and is removed at the hosting provider. Take all of those steps and multiply that by the 30 phishing URLs I get per day. That’s a lot of repetitive work.
Cyber security analysts want to solve interesting problems but instead they’re stuck with what I like to think of as trench-digging activities. That kind of work is repetitive, it’s not exciting, and you just want to make it go away. You want to spend your time looking at the one interesting phishing URL out of the thirty that you get in a day, the one that has something unique and novel about it. Naturally, you’ll start thinking of ways to automate the steps involved in handling the non-interesting phishing URLs. That's just one specific use case for software and automation and there's many, many more in a SecOps team.
What are the drawbacks of that approach where someone who knows basic programming creates tools and scripts for the team to use. What can go wrong there?
I should start by saying I've been this person. Everything I'm about to say is in no way derogatory, accusatory, or anything like that. It's just a fact of life: that’s how these things start.
Here’s my own personal experience. I was a person doing triage and incident response. I saw all those opportunities for automation. On a Friday afternoon, when it was quiet, I would jump on the chance to bang together some scripts to solve the worst bits of pain that I was seeing. I’d call what I’d done a proof of concept, a prototype. As most software developers would tell you, the proof of concept usually ends up in production. We were often still relying on the rough and ready prototype three months or a year later. We’d increasingly rely on it to run correctly, but because I knocked it all together on a Friday afternoon, things like logging, error handling, testing, all of those… they were luxuries as far as this thing I built goes.
You end up with Operations relying on a bunch of undocumented scripts that are running every day on a box that everybody's forgotten about. One day one of these scripts will typically stop running–and hopefully people notice that it has–because it can result in unpredictable things happening when the script starts going a bit haywire.
If a team finds themselves in that position where they've got some scripts that are doing critical work but they don’t have the time or maybe the software development experience to “productionise” the scripts, what are some of the steps they can take to make their automations more robust?
I'm an adherent to the belief that in software development, the last 20% of the work takes about 80% of the time. That last 20% is normally documentation, making sure the software has robust test coverage, making sure it has logging and auditing, making sure there's a consistent way to deploy it onto systems rather than somebody copying and pasting code and then just dropping it somewhere and hoping it works. All of those things take time. When you've got someone who's trying to develop software inside an operational team those are all the first things to be put on the backburner indefinitely.
Fundamentally, the problem of developing software in a security team is that software development is a very focused-driven discipline and security operations is a very interrupt-driven discipline. Those two things just don't work together at all. But there are a few strategies you can employ to improve upon that.
One approach I’ve seen is that you guarantee time for software development within the SecOps team. Let's say you've got a pool of five analysts and they, by necessity, must do triage and interrupt-driven work. Could you get one of those analysts and guarantee them a day, a week, or two weeks solely to work on this piece of software and get it up to snuff so that it's robust, it's tested, it's got logging and all those other things we need to make it production-worthy?
Even that amount of time is probably insufficient, but at least it gives the analyst some focused time and hopefully management can guarantee that they're not going to get dragged into incident handling and things like that. But it's never a total guarantee because if you have a really critical vulnerability, a really critical incident, you're going to pull all of your incident handlers into triage, into analysis, and have them drop everything else. That is an absolute essential to handling that emergency. So that's the major drawback of having your analysts who might have scripting skills or software development skills build out software. They’re always going to be interrupted at some point.
The second approach I’ve seen is management assigning a software developer to build automations for the ops team. They’ll get the analysts to develop a specification for how the software is going to work. Then they’ll get the software developer to go and build it. The software developer is an expert on writing test suites and creating build pipelines and everything needed to create high-quality maintainable software. That can definitely work. Where it can fall apart is when the software developer doesn't have the cybersecurity domain knowledge required. Cyber security analysts have a lot of assumed knowledge, and it takes time to get a software developer up to speed with all of those implicit assumptions and knowledge.
Let's go back to the example of phishing sites. Let's assume for argument's sake I’m working at a bank and the phishers are targeting the bank’s customers. Generally a phishing crew will use block lists. If I'm trying to browse the phishing site from a known network range belonging to that bank, the phishing site will be configured to serve up a 404. The phishing crew will make it look like the phish is down for any traffic coming from an IP on the block list. For all the victims of the phish the site is still very much active.
This means there are a bunch of steps I need to take as a software developer to make sure my software approximates a victim. It has to use what looks like a legitimate web browser, not a Python script connecting and using the generic Python user agent. You can tell a mile off that's a script connecting to this phishing site, not a real user. So I need to build some stuff into my automated phishing handling so I look like a victim with a realistic IP address, realistic browser fingerprint, and so on (this is such a common need for analysts that we built Smokeproxy to take care of this). This takes work and it takes some knowledge of the tactics phishers use to hide their activity from analysts.
All of this is stuff I need to convey to a software developer. There's a lot of art in there, a lot of methodology that I would need to tell the developer about. Avoiding detection is just one piece. There’s also what to do when you’re storing these artefacts. Are they malicious? Can I just put them anywhere on a file system? Do I have to take precautions about encrypting these payloads and making sure nobody can accidentally run them? There's a level of hygiene required here and it’s hard to teach. I probably won't dig much further than that, but that's the second challenge.
Then there’s the third approach, which is a reason why we formed Cosive, which is to bring in an external software developer, or team of software developers, who have the time to focus combined with the necessary domain knowledge. They understand the needs of a cyber security team because they’ve worked in the area before, but they also have the focus and the skills to do software development and build something that's really robust and has all the logging, auditing, and testing that you need with production-ready software.
What’s the experience like when ops teams work with software developers who do have the security and ops background required to do secure software development?
Firstly, they're hard to find. Those people are unicorns. I know that because Cosive is always trying to find them! And when we find someone who has that skill set and mindset, we’re very happy to find it.
If you do have someone on your team like that, or you get a consultant to come in and fill that role, typically they should start by understanding the analysts’ workflow. Often that means actually looking over their shoulder and observing how they do their operations.
Every organisation is a bit different. They integrate with a raft of different tools and everybody uses slightly different tools. They’re definitely going to have some bespoke stuff that they, and only they, run. That’s why initial discovery work is really important. If you've got that “unicorn” in-house they can sit with you and learn. Even better if they were once a part of the ops team, they already know how that team works. They know the mindset, they have the domain knowledge. That’s perfect.
Not everyone is lucky enough to have that, so the next step might be to look externally. If someone works with Cosive, we'd sit down with them. We’d understand the immediate pain point they're trying to solve. We’d roll out a solution in phases, working out what we can deliver at the end of each sprint to solve more and more of those pain points. We’d do that while operating with the prerequisite knowledge about, for a phishing site analysis example, evading detection and preserving hygiene. For us, that’s already assumed knowledge for how to build a system like this.
Working with an external partner on this means the operations team can go and do all that interrupt driven work which is very unpredictable in terms of schedule. The software development team can just have their head down and maintain focus. They can do a predictable sprint and know that it will be two weeks and they're going to have 80 hours per person of effort available to them, for example. They're going to be able to churn out software at a much more predictable rate than somebody trying to squeeze it into an operational workflow or schedule.
What are some signs that it might be time to bring in some outside help?
I’d say the first sign is when you start continually hitting production issues. The classic example would be a critical script that runs on a schedule at 11:00 PM every night to do some batch processing. Only one person in the team has any idea how it works because there's no documentation. If someone else came along and tried to maintain that code it’d be very hard.
There are no automated test cases, so if a new person makes a change to the code, they really can't be sure if they're changing fundamental and assumed behaviour of the system that’s going to cause it to break in the future.
Maybe there's no consistent way of deploying this thing. A classic scenario you see is that there's no deployment script. Somebody just copies the script and drops it on a system. One day somebody copies that file as the wrong user or with the wrong permissions and then that 11:00 PM batch job starts breaking every night. Maybe nobody even notices for two months because it's just been silently doing its thing forever and there are no visible signs that it's working.
All of those things are sort of “smells” that tell you that you’ve become really dependent on this software and that you need to spend the last 20% of the time–that takes 80% of the time–to make sure that it’s documented and has automated testing.
I've got to give a shout out at this point to having an automated test suite. I mean, in the olden days, I'm talking 20 years ago, there was usually a spreadsheet with a bunch of checklists on it and people would manually run this, add another input, change this input, and then they would give it a pass or fail on that basis. But you don't always have those testers available or that test run sheet gets out of sync with the actual software. Developing the software and the test cases in parallel is the only way to go. It means that when the next developer comes along and makes a change, they're confident that it passed all of those checks. In fact, they can't deploy the software without it going through those tests.
Of course, writing tests takes time and it slows you down. And within an ops team, it rarely happens because you don't have the time for that. But at the same time, you do want a hardened, tested, production-ready script. That's really the point you want to look at getting some other help on this thing.
When organisations have someone working on building production-ready software for ops teams (whether internal or external), how can they help them be most effective?
The biggest thing is to provide unbroken focus time.
In my travels, I’ve found that the management of an operations team is often used to very interrupt-driven work and might not have a full appreciation for how much focus is required for software development.
When you’re programming, you’re holding so many things in your head at once, that as soon as someone comes along and says “Hey, can you have a look at this?” it all leaves your head. There’s a great comic about this. I’ve heard it can take about 20 minutes to get back to the same point after you’ve been interrupted when you’re doing software development.
That’s why you want to reduce the number of interruptions and unscheduled meetings and all of those things and be very respectful of the time of a software developer. Sometimes even the knowledge that you could be interrupted is nearly as bad as actually being interrupted. If someone can walk up behind you at any time and tap you on the shoulder it’s almost as if your brain becomes resistant to getting into the groove because you know it’s not going to last long.
So you really want to say to a developer: I'm guaranteeing you these three hours so that you can go away and have optimal focus time. And some people use a benchmark of around three to four hours a day as being about as much 100% effort that you can expect from a software developer, just because of the cognitive load required to hold all that context and to problem-solve continually. So that's the ideal. You want to be able to give your developers that unbroken time every day so they can do their best work.
Can scheduling necessary interruptions like meetings in clusters or on the same day help with providing teams with more time to focus?
Yes. And that's very much what we do at Cosive. We try to have Tuesday as our meeting day, which happens to be why we’re doing this chat today (on a Tuesday). We break the day into two halves, a morning and an afternoon, and try to leave at least one or the other uninterrupted so people can get into the right mindset to focus.
I know I personally also try to give people lots of notice when I want to schedule a chat. Rather than saying I want to chat to you right now, I try to give people time and predictability. I’ll say, could we chat in two hours, in three hours, tomorrow, to give people time to finish what they're doing and know that they're not going to be interrupted.
We hope the above discussion helped to demonstrate some of the challenges involved with building production-ready software and scripts in SecOps teams. If you’d like some advice on how to do this better in your own organisation, or would like to explore whether it makes sense to bring in an external partner to help, please reach out to us.