It’s a Company Transformation Issue, Not a Technology Issue
From Webinar GoogleNext ’17 Presented by
Thomas Martin, Founder, NephōSec | Brian Johnson, Founder, DivvyCloud
As we adopted cloud, we started treating it like a technology problem but that actually wasn’t the case. It was a culture problem. What we realized was that the shift to self-service was incredibly important for our ability to compete. All of our competitors were out there building products, doing it much faster and getting those products to market. How could we continue to do that? How could we continue to innovate? We couldn’t do that if IT continued to get in the way. So we need to find a way to allow our engineering organizations to access the cloud, to be able to deploy applications, to innovate through cloud infrastructure without IT and security getting in the way of that. And through that process, we discovered it wasn’t just a technology problem. It was 3 things that came together to create an issue for security.
- The number of resources managed
- The number of people touching the infrastructure
- How often those resources are changing
These 3 things combined led to this incredibly difficult problem. How do we deal with this scale? How could we possibly, as an organization, understand everything that was changing and react to it in a reasonable amount of time based on traditional security and IT processes we had in place which were super slow? That was the problem that we see out there. This is not a technology issue, it’s a Company Transformation issue. How do you get ahead of this and how do you deal with scale?
So what does this mean from a practical aspect?
This is a Simple 3-Tier Architecture. The opportunities for misconfiguration seem quite small…You’ve got a load balancer, a couple computes spanning across a couple of availability zones, cloud storage, a cloud sequel. If you are managing this at a small scale with a single team, it’s not all that hard. When you start looking at it, there’s at least 20+ opportunities for misconfiguration just in this simple 3-tier architecture. Think about that when you begin to migrate 5, 7, 10 thousand applications across an enterprise. To try to manage this at any kind of scale becomes unwielding.
What I have found in my past experiences, that between 100 – 200 applications, the whole structure starts to fall down. You really have to begin to think about, not only is the CI/CD process important, but about all the configurations, not only in real time upon deployment but on-going and forward.
So what does this lead to?
In our case, it led to a couple of different things. It led to loss of control. We are letting engineers deploy. That’s a great thing. You want that innovation. You need that innovation. In order to survive, the company has to find a way to compete through innovation. It’s important but you lose control. It used to be any changes you made in the infrastructure went through us first so we would know a problem. We would see a mistake and be able to stop it. That is not necessarily the case any more. All things considered, this happened really quickly. So you went from having an IT organization who had processes in place to be able catch these issues through their controls and gateways, to engineers doing all sorts of things all over the place. And the problem is nobody sat down the engineering organization and explained to them 20 years of history of security issues that we hit. It’s not like IT learned that stuff the easy way. We got compromised, we had problems, we had issues. We learned and built processes. Unfortunately those processes really slowed us down.
The other thing we recognize is that when we started to move toward a more cloud-native approach, we thought we would just do alerting. We’ll get alerts every time there is something we need to pay attention to. That really quickly got out of control. It just became “Whack-A-Mole”. There was no way to keep up with it causing alert fatigue. That is getting those Slack messages or emails and how do you know which one to pay attention to, what are the important areas. Because, in reality, of the 20,000 changes an hour that you are dealing with, there’s going to be one of those that might be really important and you might have a hard time identifying which one of those you need to pay attention to.
Noise + Signal
This is really about a signal and noise problem. With all these things going on, how do you reduce all of the noise so you can focus on the signal? You have to leverage automation to do that. There is just no way that we can leverage traditional IT processes to use a run book, to correct problems, to contact the person to talk to them about making the change. By the time that has occurred, the application has been torn down and redeployed three times. So you need to be able to get rid of the noise, leveraging automation, so that your IT staff, your security staff, your SecOps, your CloudOps have the ability to focus on that 10% that they need to be dealing with on a more manual and active basis.
Traditional IT and Security Processes Ineffective
That traditional IT perimeter and processes that we have always used and relied upon are ineffective. You just can’t handle those kinds of changes at scale. They are still important. I’m not mitigating the fact of perimeter control. But how do we filter out the noise? For those of you who are working for those large enterprise firms, think about the IT procurement process. The development teams knew in their head probably an extra 60 to 90 days in the schedule committed by the time things get through procurement, backlog of servers making it to the data center, by the time they rack, put it in, put up the operating system and get it networked, we are looking at somewhere between 90 to 120 days. I’ve seen even up to 180 days to get procured servers into the data center. Those kinds of processes, when you are trying to stand up something to try to detect it and resolve it, that resource may have already been built and gone. You have to be able to move more quickly.
When you are going through this process, working with engineering, trying to talk about the problems that they are going to face when they start to adopt cloud and you’re going through that transformation, sometimes people don’t understand the scale of the attack surface. What I mean is the offensive nature of what is going on out there. We used to have an exercise we would do with our engineers when they came on board. We’d have them deploy a server into a secured environment where Port 22 is open to the world and completely publicly accessible and set root, root on password and log-in. We just have them time it. How long before that box gets popped? Sometimes when you go through that exercise, it opens your eyes to the amount of things that are out there scanning and looking, trying to find ways in. Ten or twelve years ago, there was an increase in the amount of sophisticated exploits that were being developed. That’s actually started dovetailing down a little bit primarily because it’s not necessary anymore. People are out there opening up sf3 buckets or leaving databases open to the world so it’s not necessary to spend time, energy and resources on those really complex exploits when you can just scan and find a way in. That is part of this equation, not only understanding the security professionals, what scale looks like internally, but also training the engineering organization about what’s important about security, how they need to deploy and how they need to think about people trying to get in. When you can teach them as they go through this process, everyone’s going to get better,everyone’s going to get faster and more innovative.
What would be an SLA from a typical event in a data center to when you’re going to respond? How much data could be lost with that cloud storage open to the world? Thinking about it from a remediation standpoint. The first thing, it needs to be near real time. As you go around, it’s really starting at a harvesting point, utilizing all the access points, those APIs across all those resources and harvesting them back real time, not only upon creation but actually upon change, that day two drift to also think about. Things might have been great as you deployed it out of the CI/CD tool chain. But what happened after that point with that engineer? I don’t believe people intentionally do a lot of the configuration mistakes they do. But it’s that middle of the night, day 2 Ops when something is wrong, “I’ll change that back as soon as it’s resolved” and it doesn’t get resolved. It doesn’t get flipped back.
So, you first have to harvest that data back in. Then you want to unify it so that it’s consistent across all of your individual accounts, all of your VPCs, all those resources are then normalized into a single data plane. Then you want to drive analysis against it. So as you thought about establishing those compliance and security policies of what does it mean to our organization to be compliant. That’s the analysis that gets done in real time against those resources and then being able to take action. What do I want to have happen when this occurs? It’s that if, then scenerio. If Port 22 is open to the world, what do I want to do? Who do I want to wake up? What immediate action do I want to take, not only to protect the company, but also from a forensics perspective as well as to learn? Was it the team who inadvertently did it to resolve an issue? Or were we actually breached? So all that data is captured and dumped off for analytics.
Several years ago, Google made an announcement about the multicloud push. This is absolutely the right way to go. They are doing it on top of Kubernetes. Kubernetes is going to be the element that breaks down the barriors and comtitizes infrastructure. It’s going to be really important, from the enterprise organization perspective, when you’re looking at the infrastructure layer, you can have a unified model because you’re going to have engineers that are using Azure, you’re going to have engineers that are using GCP, you’re going to have engineers using Amazon. You can’t build policies that are going to be just living in those worlds because you’re going to forget about them or they are going to sort of die on the vine or in different ways go on. As we know in security, it doesn’t matter if you have 95% coverage. That 5 % is the one that’s going to get you. So you need to make sure you create a holistic strategy and a holistic policy as you move forward.
So how do you do that?
There’s lots of different ways to think about dealing with this using remediation. In a development environment your remediation might be slightly different than your production environment. In your development environment, you want to do some latency testing. When working for a big bank, we were not allowed to have servers outside of the United States but you want to do some latency testing. So in the development environment, you can spin up an instance in Asia pack for the next two hours. Two hours later, the system is automatically going to come back and clean it up and make sure everything is ok. But when you move that same application to staging, you may not actually have that ability to do that. You might leverage faster remediation. A server comes on Asia pack, it’s killed instantly. You still want to allow them to have that ability to try new services and do more things. You don’t want to lock them down using preventative controls because you need them to go in there and try new things, you need them to innovate. If you block them at the top layer, they’re just going to go around you. They’ll go create an account and try it themselves. That’s the worst place you can be in. Being compromised is bad. Being compromised and not knowing it is way worse. Embrace this. Help them innovate. Help them learn. Go through that process with them and leverage remediation in real time to be able to provide flexibility about how they do that. Then when you get to production, this is where you may want to leverage some preventative controls. The cloud providers today provide different ways to do preventative controls and lock down certain services from being used. As you are taking your engineer through this journey, you want them, at each stage, to understand it’s a little bit more stringent, a little bit tighter, it’s a little bit harder to do what you’re going to do if it doesn’t fit inside the parameters of what we’ve approved. So that when they get to production, it just doesn’t work. They are not surprised when they get there because the whole way through the journey you’ve been teaching them. What’s more important about that is it’s not just about enforcing a policy and then running away. It’s about engaging them. It’s about bringing them into the conversation and saying what is it you are trying to accomplish? What are you trying to do? Let me help you find a secure way of doing that. Help them innovate and help them along that journey.
Think about it as this funnel of restriction. You’re providing guardrails that are much wider in that early stage to generate innovation but by the time you are at stage three, it’s least privileged. In many cases, it’s going to be machine only privileges that are enabled in production to be able to run those services.
Layers of Security Mindset for Self-Service Adoption
You can think about it as super fine grain up to coarse grain.
When you are in protect mode this is where you are leveraging real time remediation to go in and clean up after things or something is down or clean up security groups, identify databases that have not been connected to in a long time, whatever it might be. All those different elements, you’re going in and cleaning up and fixing that kind of stuff. You are protecting your environment on a regular basis.
Then you have your inflight checks. This is your ability to take things like terraform or cloud formation templates or anything you need to work with and be able to deploy into your environment and provision. Maybe helm chart. And to be able to deploy into that infrastructure and have the engineers integrate with the tool that will allow it to check those things as they are doing it. So when they are going through the CI/CD process, it checks over the system and goes “Hey, I’m about to build these ten resources and this is what it looks like. Is this ok?” Have the CI/CD process then either pass it and say “Yes, you are allowed to do this development but we are going to tell you that this is a problem.” or have it straight fail the build. You want to integrate and bring security into their world, not the other way around. If you try and do that, they are just going to find it annoying and go around you. Those inflight checks are really important.
Finally, those landing spaces, the idea that as you provision accounts, you do it in an automated fashion. When you do that for projects or teams or whatever it might be, you go in and start slapping controls around at provision time. This might be a mixture of remediation and preventative measures. You might force some sort of ability though a CI/CD pipeline where it is getting checked as it goes. All those sort of tighten things around as you go into production and preventative accounts.
Those layers really go in combination. Those coarse gain or big mindsets say these are never to be violated. Down to that mid grain where you may put a warning there in that Dev cycle but you’re not going to shut it down immediately because you are also facing into that cultural shift. So you also want to educate engineering as to why we are going that direction. Down to those fine grain controls that not only take care of upon launch but really that drift that can occur day 2, … day 30. So those really combine with some of the things we talked about previously around the cycle aspect and gives us, for the leaders, the ability to become more the department of “Yes”, to drive innovation for your company, versus the department of “No”.
The Importance of Having a CloudOps Team
Around filtering out the noise, we can’t rely on the traditional perimeter security control and just providing notification. Ok, it’s great to know that there is a theft happening in aisle 5 but have we filtered out the noise enough so that we know exactly where and pinpoint what is happening and how to resolve it and then be able to take that action to remediate it in a time of cloud speed? The companies that are able to have the most success and are able to get moving the quickest, have an established cloudOps team. When we talk about security, there is this desire to think about traditional infrastructure security. This is about analyzing network traffic, identifying the external threats and coming up with preventative measures to react to those threats. But the security problem that we are facing right now is different than what we have seen before. In previously doing professional exploit development, from the offensive side, you’re thinking about things slightly differently. You’re thinking about how to get into a black box. From the security side when you’re defending against that, you’re looking at traffic to try and figure out what people are throwing at you, what they know about you that you don’t know and so on and so forth. When you are dealing with the cloudOps side of things and helping the engineering teams grow, it’s much more about understanding about what they are doing and what their needs are, making sure they don’t make mistakes. It’s an internal threat and it’s very different because, it turns out, you and they are on the same side. You’re not fighting with one another. You have to find a way to embrace that. What we’ve found is establishing a cloud center of excellence, a cloudOps team that’s going to be focused on security from a cloud perspective and what that means to the internal organization, means you get a lot more innovation a lot quicker. Having that cloudOps team also helps as an excelerant, not only from an adoption perspective, but also from an educational, cultural perspective across the entire organization as people begin to transition out of that data center mindset. In many cases, most organizations of that size are going to always be hybrid. They are going to have their data center with their large ERP systems and others that will remain on-prem but to be able to manage that mindset across the board, it helps to have that cloudOps team.
You need to find a strategy for the organization. This takes us back to the beginning when we started talking about the fact that is not just a technology problem. This is how businesses are going to transform. The introduction of cloud doesn’t just change how you deploy applications. It changes what applications you build, what you take to market, what product you stop development on because you’re able to do it faster. It’s a huge business transformation So as you go through this process, you decide how security, IT, cloudOps is going to address this. It’s important to think about this as a holistic strategy. We talk about those layers, all the way from development to production needs to be taken into consideration and how you engage your engineering staff and teach them as they go.