The Complete Cloud Operations Security Blueprint

Table of Contents


These days, cybersecurity is a paramount discussion at every modern company, and where it’s not being discussed, it should be. We live in a time when disastrous security breaches seem to be weekly occurrences. Cyber threats are costing companies tons of money – on average $3.86 million per data breach globally, costing people their jobs, and causing severe and lasting reputational damage to what were formerly some of the most trusted brands in the world.

More and more, cloud security is at the forefront of the conversation – more specifically security of production systems and applications running in the cloud. And as data is every company’s most valuable asset, that puts what is now surely the most important pieces of security on the shoulders of DevOps engineers.

But in today’s constantly changing and highly distributed world, it can be a big challenge to keep up and cover all your bases. Where do you start? And how can you establish best practice security without slowing the pace of development and deployment?

Securing from within and without

There’s a lot to consider in trying to effectively secure production applications. Most of us naturally start with security at the perimeter – protecting against malicious actors and activities that would infiltrate production environments, and monitoring to ensure the tools and processes in place are doing their jobs. Perimeter defenses will be reviewed in this blueprint (most environment these days are indeed hybrid and not strictly cloud or on-prem), but while important, the cloud has really changed the idea of a perimeter in that it can no longer be defined or limited to a single network our boundary. That’s why we now try to actually limit the attack surface itself in several ways, including controlling for and monitoring access to production systems and user-facing applications.

Too often even the most rigorous of these practices break down and allow unwanted activity in. That’s why securing production systems from within is equally, if not more important for a modern application. This starts with vulnerability management at the application layer itself, but also needs to include protecting endpoints, securing the network and flow of data, and if all else fails, developing a plan for disaster recovery.

It’s a lot to install, control and monitor, and most teams find some amount of orchestration is necessary to manage it all. And while there are many solutions for each of these pieces to the security puzzle, it’s impractical to put something in place for every facet. The important areas to address are Access, Audit, Network, and Endpoint security. Strong security teams are able to unify the tools they do use with something that gives visibility into all of them. The real power is in correlating data from many tools to be able to have a broad and deep view of the security posture, to detect anomalies and potential security incidents, and to be proactive about preventing them.

Let’s take a deeper look at the most important measures to take, data to monitor, and tools to use for an effective Cloud Security Operations strategy.

Security at the “Perimeter”

It should surprise few that often the first thing that is considered is controlling and monitoring the traffic flow in and out of your networks, building a “perimeter fence” to make sure that you don’t have any malicious activities. In the cloud the idea of a perimeter is much harder to define because systems and software are distributed. That’s why best practices recommend limit the attack surface itself and implementing a layered approach of complementary defense mechanisms.


The baseline (and perhaps oldest-world) technology for this is a firewall to filter out traffic that could potentially result in threats. Firewalls filter based on a set of rules that define allowable types and sources of traffic on the network. Traditional firewalls offer capabilities to protect enterprise networks and its users. These are important to implement, but developments in the cloud have given rise to a new set of next generation and web application firewalls that focus on securing distributed production systems. These firewalls protect cloud servers not only against malicious traffic and attacks, but also from other servers. Nearly every major firewall brand – F5 networks, Check Point, Cisco, Cloudflare, Zscale – offers great next-generation firewalls. Of course most cloud providers themselves provide firewall services with fairly robust capabilities: AWS WAF, Azure WAF, and Google Cloud Armor will serve you well.

Intrusion Detection/Prevention Systems (IDS/IPS)

In addition to firewalls, some organizations will deploy intrusion detection/prevention systems (IDS/IPS) which detect suspicious traffic that has made it through the firewall. Intrusion detection monitors for events in the network, scanning and inspecting packets for potential threats or violations of security policy. IPS systems will then warn whoever is responsible for taking action on this information. Intrusion prevention takes this a step further by actually blocking attacks as they develop by controlling access to your network. It also takes steps to prevent attacks proactively by, for example writing rules to the firewall for future attacks that may look similar.

IDS/IPS systems come in several flavors. Host-based IDS (HIDS), like the popular open source OSSEC, look at the events occuring on the machine on your network rather than the traffic on the network itself. Network-based IDS (NIDS) will analyze packets themselves to understand what is happening on the network. Snort, Suricata, and Zeek (formerly Bro) (formerly Bro) are some of the most popular and powerful – all open source tools.

More on the subject:

Bot Protection

Bot abuse is a growing problem and deserves mention here. Malicious bots are often used to scrape content, takeover accounts, fraud clicks and checkouts, and even steal credit card information. In fact Incapsula, one of the leading Bot Protection vendors estimates malicious bots comprise up to 29% of all internet traffic. Bot protection tools like Incapsula, Perimeter X, and Cloudflare will apply rules based on IP reputation, behavior fingerprinting, and known bad signatures to block bad bot traffic.

Some enterprise tools, like Sophos UTM group all or most of these together to provide an all-in-one type solution that includes firewall protection, IPS, advanced threat detection, and more in a single platform. While they can be more expensive, these can be good options for companies who don’t have the internal bandwidth to establish perimeter security standards and implement the requisite tooling and want an easy, out-of-the-box solution. The downside is the lack of flexibility and customization for your environment.

Perimeter Monitoring

Implementing “perimeter” defenses is only the first step. In a zero-trust security model, we cannot simply depend on these systems, but must also monitor the activity. Think of your WAF as a data source here. Firewalls, in addition to web server, database, CDN, and network logs, can provide an important source of information to detect, prevent, and troubleshoot malicious activity. Here are some common areas where log analysis can help:

  • Traffic flows: Look for an unexpected activity or abnormal traffic patterns in your server logs. These can be a sign of attacks in progress or systems that are not properly configured.
  • Rule violations: Analyzing traffic that has been denied can uncover attempts at malicious activity, or simply misconfigured systems. Either way, it’s good to keep an eye on.
  • CDN traffic patterns: identifying normal traffic patterns will help you identify and prevent DDoS attacks by configuring scrubbing filters to prevent large amounts of fake traffic from creating problems.
  • Rule trigger frequency: Sometimes firewall rules become stale and are no longer needed. Unused rules cause unnecessary complexity and should be cleaned or updated.

Protect yourself against DDoS attacks by monitoring traffic with ELK

Separation and VPC Flow Logs

When developing and deploying in the cloud, a general rule to follow is separation as a best practice. You want to separate your production environment from development and test environments, either physically or by some logic. That includes limiting developer access to production environments to only troubleshooting activities and monitoring that access and activity. Probably the most common way to do that is by establishing a separate virtual private cloud (Amazon VPC, Azure Virtual Network, Google Cloud VPC) per environment. Separating production environments will decrease the risk that someone intentionally or inadvertently makes unauthorized modifications that could compromise the system’s integrity or availability. It also allows you to apply different security standards, which should be much higher in production.

When deploying within VPC environments, it’s important to monitor the IP traffic flowing to and from your network interfaces, subnets, and VPCs. Monitoring VPC flow logs is the best way to keep an eye on traffic reaching and leaving from your infrastructure resources such as databases. For applications built on a microservices architecture, the high degree of internal communication between services makes this an extremely important activity. AWS makes things easy with their CloudWatch service to which you can publish flow logs. As an added benefit keeping track of flow logs can help identify latency and other performance issues within your applications.

More on the subject:

Access Control & Monitoring

Securing and monitoring at the perimeter is a good first step to securing your cloud environment, but without control over access to your systems, you are putting your data and company at risk. Most organizations will have some amount of technology, policies, and procedures that limit access to networks, systems, applications, and sensitive data. Implementing and enforcing these can be complicated, especially in the ever-changing world of cloud services. As a best practice, follow the concept of “least privilege” which restricts users’ accesses only to resources they require to do their job.

User Management

From a security standpoint, the human element is often the weakest point of a security operation. To address that, teams should start with controlling who is allowed access to various systems – otherwise known as authorization. Many companies will implement some sort of Identity Access Management (IAM) tool which limits access to systems and tools (and sometimes physical access) based on an employee’s rights as defined by some Access Control List (ACL). IAM tools utilizing ACLs are generally available for services from every major cloud provider.

Microsoft’s Active Directory(AD) has been a standard for decades. More recently, there’s been a lot of focus on streamlining this process by implementing role-based access controls (RBAC) that limit access to systems based on a user’s role within the organization, removing the need to define rights per user.

In addition to authorization, the second consideration for user management is authentication, verifying the identity of users. As a basic step, companies require user IDs and passwords with varying degrees of complexity. Today, multi-factor authentication (MFA), solutions like Okta or OneLogin, is considered best practice as well when dealing with human users. In cases when third-party applications need access to your systems, protocols such as OAuth, OpenId, SAML, or UAF are the preferred authentication method. Check out OWASP’s authentication cheat sheet for best practices.

More on the subject:

Key Management

In addition to managing human access to cloud systems, you need to manage the encryption keys used by your cloud applications and services for securing data at-rest and in-motion (e.g. SSL/TLS). Managing the generation, storage, use, and deletion of cryptographic keys will protect them from loss, corruption, or unauthorized access. While there are many vendors that provide key management options, the most common key management services are offered by the cloud providers themselves, such as AWS Key Management Service (KMS), Azure Key Vault, and Google Cloud Key Management Service (Cloud KMS).

Access Monitoring

Logging and monitoring all these services will help you detect failures, attempts, and attacks in real-time. AWS Cloudtrail or Azure AD can report the activities of users on your systems: logins, server lifecycle commands (e.g. who provisioned that server). The best solutions will allow you to analyze and monitor the logs or metrics of each of the tools discussed surrounding user and key management. You’ll want to make sure you’re tracking at least:

  • User logins
  • Applications logging in and access your systems
  • Database access
  • Key Management Activity

More on the subject:

Endpoint Protection

In today’s distributed world, people and applications need access to your systems from all over the world. Controlling the perimeter and user access will only get you so far in this regard because they are only secured while they are on your network. Endpoint protection extends perimeter security, user access management, and log analytics and monitoring to account for the reality of distributed systems, devices, and users.

Virtual private networks (VPN) enable secure connection to a private network remotely, encrypting all network traffic between the local machine and the private network. There are hundreds of VPN services out there, when we talk about cloud security, each of the cloud providers has their own go-to solution.

Of course, your distributed production servers must also be protected themselves. Making sure the right anti-virus and malware detection software is installed will protect production systems from being infected with malware and should be a no-brainer for any endpoint security program. Vendors such as Symantec, Sophos, Trend Micro, and Check Point Software offer a variety of endpoint protection (EPP) options.

Others such as Cybereason, Carbon Black, and CrowdStrike have taken a different approach to endpoint protection with endpoint detection and response (EDR) solutions. Instead of focusing on prevention, these platforms focus on detecting, investigating, and responding to suspicious activities and potential threats on hosts and other endpoints. They often provide far better capabilities in threat detection and response that traditional EPP solutions.

Vulnerability Management

No matter how much you try to strengthen your perimeter defenses, you simply cannot be certain that those steps have prohibited all unauthorized or malicious traffic from getting through. You must also take steps to protect your applications themselves from attack. Otherwise, you put your data and so your entire organization at risk. In fact, perhaps the most common cause of data breaches, including some of the most high-profile in recent years, are attacks exploiting well known and documented security vulnerabilities in software or its underlying infrastructure. Vulnerability management solutions like Rapid 7 or Qualys are already core to modern security practices, and not taking steps toward vulnerability management is like installing a sophisticated alarm system at the bank but leaving the vault unlocked. Generally, you can think about vulnerability management in three steps: detection, prioritization, and remediation.


There are several ways to discover vulnerabilities, depending on the type. Traditional vulnerability scanners analyze your systems for vulnerabilities in the code you’ve built – security loopholes like accidental password leaks, deviations from best practice like open ports, or insecure software configurations – which can make it easy for malicious actors to access data. Wazuh open source security platform includes configuration assessment and includes vulnerability detection which matches software with CVEs (Common Vulnerabilities and Exposure). Amazon Inspector functions much the same way and is a good option for applications deployed on AWS.

For code that you don’t write yourself, you’ll need to look elsewhere. While the commercial software you may use in your applications releases regular updates and patches for found issues, the prevalence of open source in modern applications means you cannot rely on vendors to notify you of vulnerabilities. Open source component scanning or Software Composition Analysis (SCA) is now an integral component to any software security program. While the National Vulnerability Database (NVD) tracks most known open source vulnerabilities, matching those with instances in your code is generally ineffective. Vendors such as Snyk or Whitesource scan and automatically identify vulnerabilities in open source components and operating systems (Linux is open source after all and prone to vulnerabilities like anything else), and can usually be run as part of a continuous integration pipeline, making detection much more efficient.

The rise of containers in application architectures has blurred the lines between the application and infrastructure layers. It has become incumbent upon DevOps teams to scan container images for known vulnerabilities. Tools like Aqua and Twistlock will scan images in your registry or during the build stage to provide risk factors to help you be smart about which containers can be deployed to clusters and what needs remediation. Kubernetes and other container orchestrators have added another layer of complexity but also additional layers of security for ensuring containers are protected at runtime. It’s important to monitor Kubernetes and each pod for suspicious behaviors, including cluster communication, traffic, and activity inside containers, that could signal a potential exploit.

More on the subject:

Kubernetes security is a growing concern and deserves a closer exploration. To dive deeper into securing Kubernetes environments, read our guide to best practices for Kubernetes security.

More on the subject:


With potentially long and growing lists of vulnerabilities uncovered in your applications, prioritization becomes a critical process for successful vulnerability management. Certain problems need to remediated, or fixed immediately, but some may carry a level of risk that is considered acceptable to the application or organization. You can measure these risk and prioritize based on standardized risk metrics like CVSS (Common Vulnerability Scoring System) scores, but it’s also worth understanding the context of that risk for your application; if for example a component carries a critical risk (CVSS 9.0-10.0) for SQL injection, but the application does not connect to a database with customer or other sensitive data, that risk may be prioritized lower.

More on the subject:


Vulnerabilities that are prioritized as unacceptable risk can be addressed by remediation or mitigation. Remediation is simply fixing the problem in the code. In most cases, installing a missing patch will correct the issue. If no patches are available or remediation proves too cumbersome, you can mitigate by figuring out an appropriate workaround to the issue or simply taking the affected part of the system offline.

As a best practice, vulnerability management should be automated as part of your build and deployment process. Most traditional vulnerability scanning, SCA, and container security products can be easily integrated to kick of scans as part of CI/CD pipelines or release automation with tools like Jenkins, Gitlab CI, Electric Cloud, and GoCD. Monitoring these scans will help you assess the health of your pipelines, make sure they are successful, and troubleshoot failed builds.

Change Control & Monitoring

Configuration Management

Automation helps to ensure the implementation of consistent security controls for all deployed applications. Standardized, automated, and repeatable architectures can be used for common use cases, simplifying the process of auditing and accrediting deployed applications, Configuration management tools like Puppet, Chef, and Ansible and Infrastructure-as-code like Terraform allow teams to set base configurations for components such as IAM and VPC controls to ensure workload owners and deploying architectures that meet foundational requirements and best practices for security.

By analyzing the logs created by these configuration management systems, you can be sure to continuously be notified of any configuration changes, and have full visibility into those events that contributed to that change. You’ll want to be able to audit details such as “Who made this change?” and “From what IP address” to ensure auditability and accountability in the case of a security failure.

More on the subject:

File Integrity Monitoring

Changes are a nearly constant reality in modern IT environments. File integrity monitoring (FIM) is a process that examines files for changes including who made the changes and how they were modified. That usually involves setting a known good baseline for certain important files, monitoring changes against that baseline, alerting for unauthorized or inappropriate changes, and reporting results of the monitoring for audit and compliance purposes.

Multiple compliance standards require file integrity monitoring including PCI-DSS, SOX, NERC CIP, FISMA, HIPAA, and SANS Control 3. There are many options for organizations to embed file integrity monitoring including Tripwire and Wazuh. Auditbeat supports monitoring file and directory changes and is lightweight and configurable for your custom needs.

More on the subject:

How Effective is Your Cloud Security?

There’s a lot that goes into securing your cloud operations. At we analyze security posture in the cloud around four key areas: Access, Audit, Network, and Endpoint. To ensure you are doing the right things to protect your systems, it’s important to have solutions in place to address each area and to unify and correlate data from those solutions for complete visibility.