It’s a Company Transformation Issue, Not a Technology Issue
From Webinar GoogleNext ’17 Presented by
Thomas Martin, Founder, NephōSec | Brian Johnson, Founder, DivvyCloud
As we adopted cloud, we started treating it like a technology problem but that actually wasn’t the case. It was a culture problem. What we realized was that the shift to self-service was incredibly important for our ability to compete. All of our competitors were out there building products, doing it much faster and getting those products to market. How could we continue to do that? How could we continue to innovate? We couldn’t do that if IT continued to get in the way. So we need to find a way to allow our engineering organizations to access the cloud, to be able to deploy applications, to innovate through cloud infrastructure without IT and security getting in the way of that. And through that process, we discovered it wasn’t just a technology problem. It was 3 things that came together to create an issue for security.
These 3 things combined led to this incredibly difficult problem. How do we deal with this scale? How could we possibly, as an organization, understand everything that was changing and react to it in a reasonable amount of time based on traditional security and IT processes we had in place which were super slow? That was the problem that we see out there. This is not a technology issue, it’s a Company Transformation issue. How do you get ahead of this and how do you deal with scale?
So what does this mean from a practical aspect?
What I have found in my past experiences, that between 100 – 200 applications, the whole structure starts to fall down. You really have to begin to think about, not only is the CI/CD process important, but about all the configurations, not only in real time upon deployment but on-going and forward.
So what does this lead to?
The other thing we recognize is that when we started to move toward a more cloud-native approach, we thought we would just do alerting. We’ll get alerts every time there is something we need to pay attention to. That really quickly got out of control. It just became “Whack-A-Mole”. There was no way to keep up with it causing alert fatigue. That is getting those Slack messages or emails and how do you know which one to pay attention to, what are the important areas. Because, in reality, of the 20,000 changes an hour that you are dealing with, there’s going to be one of those that might be really important and you might have a hard time identifying which one of those you need to pay attention to.
Traditional IT and Security Processes Ineffective
That traditional IT perimeter and processes that we have always used and relied upon are ineffective. You just can’t handle those kinds of changes at scale. They are still important. I’m not mitigating the fact of perimeter control. But how do we filter out the noise? For those of you who are working for those large enterprise firms, think about the IT procurement process. The development teams knew in their head probably an extra 60 to 90 days in the schedule committed by the time things get through procurement, backlog of servers making it to the data center, by the time they rack, put it in, put up the operating system and get it networked, we are looking at somewhere between 90 to 120 days. I’ve seen even up to 180 days to get procured servers into the data center. Those kinds of processes, when you are trying to stand up something to try to detect it and resolve it, that resource may have already been built and gone. You have to be able to move more quickly.
When you are going through this process, working with engineering, trying to talk about the problems that they are going to face when they start to adopt cloud and you’re going through that transformation, sometimes people don’t understand the scale of the attack surface. What I mean is the offensive nature of what is going on out there. We used to have an exercise we would do with our engineers when they came on board. We’d have them deploy a server into a secured environment where Port 22 is open to the world and completely publicly accessible and set root, root on password and log-in. We just have them time it. How long before that box gets popped? Sometimes when you go through that exercise, it opens your eyes to the amount of things that are out there scanning and looking, trying to find ways in. Ten or twelve years ago, there was an increase in the amount of sophisticated exploits that were being developed. That’s actually started dovetailing down a little bit primarily because it’s not necessary anymore. People are out there opening up sf3 buckets or leaving databases open to the world so it’s not necessary to spend time, energy and resources on those really complex exploits when you can just scan and find a way in. That is part of this equation, not only understanding the security professionals, what scale looks like internally, but also training the engineering organization about what’s important about security, how they need to deploy and how they need to think about people trying to get in. When you can teach them as they go through this process, everyone’s going to get better,everyone’s going to get faster and more innovative.
So, you first have to harvest that data back in. Then you want to unify it so that it’s consistent across all of your individual accounts, all of your VPCs, all those resources are then normalized into a single data plane. Then you want to drive analysis against it. So as you thought about establishing those compliance and security policies of what does it mean to our organization to be compliant. That’s the analysis that gets done in real time against those resources and then being able to take action. What do I want to have happen when this occurs? It’s that if, then scenerio. If Port 22 is open to the world, what do I want to do? Who do I want to wake up? What immediate action do I want to take, not only to protect the company, but also from a forensics perspective as well as to learn? Was it the team who inadvertently did it to resolve an issue? Or were we actually breached? So all that data is captured and dumped off for analytics.
Several years ago, Google made an announcement about the multicloud push. This is absolutely the right way to go. They are doing it on top of Kubernetes. Kubernetes is going to be the element that breaks down the barriors and comtitizes infrastructure. It’s going to be really important, from the enterprise organization perspective, when you’re looking at the infrastructure layer, you can have a unified model because you’re going to have engineers that are using Azure, you’re going to have engineers that are using GCP, you’re going to have engineers using Amazon. You can’t build policies that are going to be just living in those worlds because you’re going to forget about them or they are going to sort of die on the vine or in different ways go on. As we know in security, it doesn’t matter if you have 95% coverage. That 5 % is the one that’s going to get you. So you need to make sure you create a holistic strategy and a holistic policy as you move forward.
So how do you do that?
There’s lots of different ways to think about dealing with this using remediation. In a development environment your remediation might be slightly different than your production environment. In your development environment, you want to do some latency testing. When working for a big bank, we were not allowed to have servers outside of the United States but you want to do some latency testing. So in the development environment, you can spin up an instance in Asia pack for the next two hours. Two hours later, the system is automatically going to come back and clean it up and make sure everything is ok. But when you move that same application to staging, you may not actually have that ability to do that. You might leverage faster remediation. A server comes on Asia pack, it’s killed instantly. You still want to allow them to have that ability to try new services and do more things. You don’t want to lock them down using preventative controls because you need them to go in there and try new things, you need them to innovate. If you block them at the top layer, they’re just going to go around you. They’ll go create an account and try it themselves. That’s the worst place you can be in. Being compromised is bad. Being compromised and not knowing it is way worse. Embrace this. Help them innovate. Help them learn. Go through that process with them and leverage remediation in real time to be able to provide flexibility about how they do that. Then when you get to production, this is where you may want to leverage some preventative controls. The cloud providers today provide different ways to do preventative controls and lock down certain services from being used. As you are taking your engineer through this journey, you want them, at each stage, to understand it’s a little bit more stringent, a little bit tighter, it’s a little bit harder to do what you’re going to do if it doesn’t fit inside the parameters of what we’ve approved. So that when they get to production, it just doesn’t work. They are not surprised when they get there because the whole way through the journey you’ve been teaching them. What’s more important about that is it’s not just about enforcing a policy and then running away. It’s about engaging them. It’s about bringing them into the conversation and saying what is it you are trying to accomplish? What are you trying to do? Let me help you find a secure way of doing that. Help them innovate and help them along that journey.
Think about it as this funnel of restriction. You’re providing guardrails that are much wider in that early stage to generate innovation but by the time you are at stage three, it’s least privileged. In many cases, it’s going to be machine only privileges that are enabled in production to be able to run those services.
Layers of Security Mindset for Self-Service Adoption
When you are in protect mode this is where you are leveraging real time remediation to go in and clean up after things or something is down or clean up security groups, identify databases that have not been connected to in a long time, whatever it might be. All those different elements, you’re going in and cleaning up and fixing that kind of stuff. You are protecting your environment on a regular basis.
Then you have your inflight checks. This is your ability to take things like terraform or cloud formation templates or anything you need to work with and be able to deploy into your environment and provision. Maybe helm chart. And to be able to deploy into that infrastructure and have the engineers integrate with the tool that will allow it to check those things as they are doing it. So when they are going through the CI/CD process, it checks over the system and goes “Hey, I’m about to build these ten resources and this is what it looks like. Is this ok?” Have the CI/CD process then either pass it and say “Yes, you are allowed to do this development but we are going to tell you that this is a problem.” or have it straight fail the build. You want to integrate and bring security into their world, not the other way around. If you try and do that, they are just going to find it annoying and go around you. Those inflight checks are really important.
Finally, those landing spaces, the idea that as you provision accounts, you do it in an automated fashion. When you do that for projects or teams or whatever it might be, you go in and start slapping controls around at provision time. This might be a mixture of remediation and preventative measures. You might force some sort of ability though a CI/CD pipeline where it is getting checked as it goes. All those sort of tighten things around as you go into production and preventative accounts.
The Importance of Having a CloudOps Team
Around filtering out the noise, we can’t rely on the traditional perimeter security control and just providing notification. Ok, it’s great to know that there is a theft happening in aisle 5 but have we filtered out the noise enough so that we know exactly where and pinpoint what is happening and how to resolve it and then be able to take that action to remediate it in a time of cloud speed? The companies that are able to have the most success and are able to get moving the quickest, have an established cloudOps team. When we talk about security, there is this desire to think about traditional infrastructure security. This is about analyzing network traffic, identifying the external threats and coming up with preventative measures to react to those threats. But the security problem that we are facing right now is different than what we have seen before. In previously doing professional exploit development, from the offensive side, you’re thinking about things slightly differently. You’re thinking about how to get into a black box. From the security side when you’re defending against that, you’re looking at traffic to try and figure out what people are throwing at you, what they know about you that you don’t know and so on and so forth. When you are dealing with the cloudOps side of things and helping the engineering teams grow, it’s much more about understanding about what they are doing and what their needs are, making sure they don’t make mistakes. It’s an internal threat and it’s very different because, it turns out, you and they are on the same side. You’re not fighting with one another. You have to find a way to embrace that. What we’ve found is establishing a cloud center of excellence, a cloudOps team that’s going to be focused on security from a cloud perspective and what that means to the internal organization, means you get a lot more innovation a lot quicker. Having that cloudOps team also helps as an excelerant, not only from an adoption perspective, but also from an educational, cultural perspective across the entire organization as people begin to transition out of that data center mindset. In many cases, most organizations of that size are going to always be hybrid. They are going to have their data center with their large ERP systems and others that will remain on-prem but to be able to manage that mindset across the board, it helps to have that cloudOps team.