DevOps is not a framework or a workflow. It's a culture that is overtaking the business world. DevOps ensures collaboration and communication between software engineers (Dev) and IT operations (Ops). With DevOps, changes make it to production faster. Resources are easier to share. And large-scale systems are easier to manage and maintain.
DevOps replaces the model where you have a team that writes the code, another team to test it, yet another team to deploy it, and even another team yet to operate it.
Few roles in DevOps Team,
- DevOps Evangelist - the DevOps leader who is responsible for the success of all the DevOps processes and people.
- Code Release Manager – essentially a project manager that understands the agile methodology. They are responsible for overall progress by measuring metrics on all tasks.
- Automation Expert – responsible for finding the proper tools and implementing the processes that can automate any manual tasks.
- Quality Assurance or Experience Assurance – not to be confused with someone who just finds and reports bugs. Responsible for the user experience and ensures that final product has all the features in the original specifications.
- Software Developer/Tester – the builder and tester of code that ensures each line of code meets the original business requirements.
- Security Engineer - with all the nefarious operators out there you need someone to keep the corporation safe and in compliance. This person needs to work closely with everyone to ensure the integrity of corporate data.
What does DevOps do for you and why would you want to practice it?
Well the first reason is that it's been shown to be effective in improving both IT and business outcomes. The Puppet Labs' state of DevOps 2015 survey indicated that teams using DevOps practices deployed changes 30 times more frequently with 200 times shorter lead times. And instead of that resulting in quality issues, they had 60 times fewer failures and recovered from issues 168 times faster than other organizations.
The second reason is that it makes your daily life easier. Hi-tech is a very interrupt-driven, high pressure exercise in firefighting that can often lead to personal and professional burnout. We've found that the DevOps approach reduces unplanned work, it increases friendly relationships between coworkers, and it reduces stress on the job.Collaboration among everyone participating in delivering software is a key DevOps tenet.
DevOps core values: CAMS
The CAMS Model created by DevOps pioneers John Willis and and Damon Edwards It stands for Culture, Automation, Measurement, and Sharing.
CAMS has become the model set of values used by many DevOps practitioners. Patrick DuBois, is often referred to as the godfather of DevOps, since he coined the term, but he likes to say that DevOps is a human problem.
What is culture? Culture's a lot more than ping pong tables in the office, or free food in the company cafeteria.
Culture is driven by behavior. Culture exists among people with a mutual understanding of each other and where they're coming from. Early on in IT organizations, we split teams into two major groups: Development, they were charged with creating features. And Operations,they were charged with maintaining stability. Walls formed around these silos due to their differing goals. Now, today after this pattern has had a long time to metastasize, these groups don't speak the same language, and they don't have mutual understanding.Changing these underlying behaviors and assumptions is how you can drive change in your company's culture.
Culture is driven by behavior. Culture exists among people with a mutual understanding of each other and where they're coming from. Early on in IT organizations, we split teams into two major groups: Development, they were charged with creating features. And Operations,they were charged with maintaining stability. Walls formed around these silos due to their differing goals. Now, today after this pattern has had a long time to metastasize, these groups don't speak the same language, and they don't have mutual understanding.Changing these underlying behaviors and assumptions is how you can drive change in your company's culture.
This brings us to the 'A' in CAMS. That's 'Automation'. The first thing that people usually think about when they think of DevOps is Automation. In the early days of DevOps, some people coined the term to anybody who was using Chef or Puppet or CFEngine. But part of the point of CAMS is to bring back balance into how we think about it.DevOps is not just about automated tooling.
People and process, they've got to come first. Damon Edwards expressed this as "people over process over tools". All of that said, Automation is a critical part of our DevOps journey.Once you begin to understand your culture, you can create a fabric of automation that allows you to control your systems and your applications. Automation is that accelerator that's going to get you all the other benefits of DevOps. You really want to make sure you prioritize Automation as your primary approach to the problem.
This brings us to the 'M' in CAMS.That stands for 'Measurement'. One of the keys to a rational approach to our systems is the ability for us to measure them. Metrics, they tell you what's happening and whether the changes that we've made have improved anything. There's two major pitfalls in Metrics.First, sometimes we choose the wrong metrics to watch. And, second, sometimes we fail to incentivize them properly. Because of this, DevOps strongly advises you to measure key metrics across the organization. Look for things like MTTR, the mean time to recovery, or Cycle Time. Look for costs, revenue, even something like employee satisfaction. All of these are part of generating a holistic insight across your system. These metrics help engage the team in the overall goal. It's common to see them shared internally, or even exposed externally to customers.
Speaking of sharing, that brings us to the 'S' in CAMS. Sharing ideas and problems is the heart of collaboration. And it's also really at the heart of DevOps. In DevOps, expect to see a high premium placed on openness and transparency. This drives Kaizen, which is a Japanese word that means 'discrete continuous improvement'.
CAMS that's 'Culture, Automation, Measurement, and Sharing'. They're the four fundamental and mutually reinforcing values to bring to a DevOps implementation. They're the "why" behind many of the more-specific techniques that we're going to cover later in this course. We really want to take these values to heart because the rest of your DevOps journey is going to be about trying to realize them in your organization.
CAMS that's 'Culture, Automation, Measurement, and Sharing'. They're the four fundamental and mutually reinforcing values to bring to a DevOps implementation. They're the "why" behind many of the more-specific techniques that we're going to cover later in this course. We really want to take these values to heart because the rest of your DevOps journey is going to be about trying to realize them in your organization.
DevOps principles: The three ways
The most respected set of principles is called The Three Ways.
This model was developed by Gene Kim, author of "Visible Ops" and "The Phoenix Project," and Mike Orzen, author of "Lean IT." The three ways they propose are systems thinking, amplifying feedback loops, and a culture of continuous experimentation and learning.
The first way, systems thinking, tells us that we should make sure to focus on the overall outcome of the entire pipeline in our value chain. It's easy to make the mistake of optimizing one part of that chain at the expense of overall results. When you're trying to optimize performance in an application, for example, increasing performance or system resources in one area causes the bottleneck to move sometimes to an unexpected place.
Adding more applications servers, for example, can overwhelm a database server with connections and bring it down. You have to understand the whole system to optimize it well. The same principle applies to IT organizations. A deployment team might establish processes to make their own work go smoothly and their productivity numbers look good,but those same changes could compromise the development process and reduce the organization's overall ability to deliver software. This overall flow is often called "From Concept to Cash." If you write all the software in the world but you can't deliver it to a customer in a way that they can use it, you lose. The split between development and operations has often been the place where the flow from concept to cash goes wrong. Use systems thinking as guidance when defining success metrics and evaluating the outcome of changes.
Adding more applications servers, for example, can overwhelm a database server with connections and bring it down. You have to understand the whole system to optimize it well. The same principle applies to IT organizations. A deployment team might establish processes to make their own work go smoothly and their productivity numbers look good,but those same changes could compromise the development process and reduce the organization's overall ability to deliver software. This overall flow is often called "From Concept to Cash." If you write all the software in the world but you can't deliver it to a customer in a way that they can use it, you lose. The split between development and operations has often been the place where the flow from concept to cash goes wrong. Use systems thinking as guidance when defining success metrics and evaluating the outcome of changes.
The second way, amplifying feedback loops, is all about creating, shortening, and amplifying feedback loops between the parts of the organization that are in the flow of that value chain. A feedback loop is simply a process that takes its own output into consideration when deciding what to do next.The term originally comes from engineering control systems. Short, effective feedback loops are the key to productive product development, software development, and operations.Effective feedback is what drives any control loop designed to improve a system. Use amplifying feedback loops to help you when you're creating multi-team processes, visualizing metrics, and designing delivery flows.
The third way reminds us to create a work culture that allows for both continuous experimentation and learning. You and your team should be open to learning new things and the best route to that is actively trying them out to see what works, and what doesn't work,instead of falling into analysis paralysis. But it's not just about learning new things, it also means engaging in the continuous practice required to master the skills and tools that are already part of your portfolio.The focus here is on doing. You master your skills by the repetition of practice. And you find new skills by picking them up and trying them.Encourage sharing and trying new ideas.
It's how you use them that matters most. As you continue your DevOps journey, it's important to stay grounded in an understanding of what exact problem a given practice or tool solves for you. The Three Ways provide a practical framework to take the core DevOps values and effectively implement specific processes and tools in alignment with them.
5 Key Methodologies of DevOps are,
One of the first methodologies was coined by Alex Honor and it's called people over process over tools. In short, it recommends identifying who's responsible for a job function first.Then defining the process that needs to happen around them. And then selecting and implementing the tool to perform that process. It seems somewhat obvious, but engineers and sometimes over-zealous tech managers under the sales person are usually awfully tempted to do the reverse and buy a tool first and go back up the chain from there
The second methodology is continuous delivery.It's such a common methodology that some people even wrongly equate it with DevOps. In short, it's the practice of coding, testing, and releasing software frequently, in really small batches so that you can improve the overall quality and velocity.
Third up is lean management. It consists of using small batches of work,work in progress limits, feedback loops and visualization. The same studies showed that lean management practices led to both better organizational outputs, including system throughput and stability and less burn out and greater employee satisfaction at the personal level.
The fourth methodology is change control. In 2004, the book Visible Ops came out. Its research demonstrated that there is a direct correlation between operational success and a control over changes in your environment. But, there's a lot of old-school heavy change control processes out there that do more harm than good. Yeah, there really is. And that's what was really great about Visible Ops, because it describes a light and practical approach to change control. It focused on an emphasis of eliminating fragile artifacts, creating a repeatable build process, managing dependencies and creating an environment of continual improvement.
Fifth and final methodology, infrastructure as code.One of the major realizations of modern operations is that systems can and should be treated like code. System specifications should be checked into source control, go through a code review whether a build, an automated test, and then we can automatically create real systems from the spec and manage them programatically.
With this kind of programatic system, we can compile and run and kill and run systems again, instead of creating hand-crafted permanent fixtures that we maintain manually over time. Yeah, we end up treating servers like cattle, not pets. So these five key methodologies can help you start in on your tangible implementation of DevOps.
10 practices for DevOps success:
None of them are universally good or required to do DevOps, but here are 10 that we've both seen used and they should at least get you thinking.
Practice number 10, incident command system. Bad things happen to our services. In IT, we call these things incidents. There's a lot of old school incident management processes that seem to only apply to really large scale incidents. But we've learned, in real life, that it's full of a mix of small incidents with only an occasional large one.
One of my favorite presentations I ever saw at a conference, was Brent Chapman's Incident Command for IT, What We Can Learn From the Fire Department. It explained how incident command works in the real world for emergency services, and explained how the same process can work for IT, for incidents both small and large. I've use ICS for incident responsein a variety of shops to good effect. It's one of those rare processes that helps the practitioner, instead of inflicting more pain on them, while they're already trying to fix a bad situation.
One of my favorite presentations I ever saw at a conference, was Brent Chapman's Incident Command for IT, What We Can Learn From the Fire Department. It explained how incident command works in the real world for emergency services, and explained how the same process can work for IT, for incidents both small and large. I've use ICS for incident responsein a variety of shops to good effect. It's one of those rare processes that helps the practitioner, instead of inflicting more pain on them, while they're already trying to fix a bad situation.
Practice number nine, developers on call. Most IT organizations have approached applications with the philosophy of, let's make something. And then someone else will be responsible for making sure it works. Needless to say, this hasn't worked out so well. - Teams have begin putting developers on call for the service they created. This creates a very fast feedback loop. Logging and deployment are rapidly improved and core application problems get resolved quickly, instead of lingering for years while making some network operations center person re-start the servers as a work-around.
All right, practice number eight, status pages. Services go down, they have problems. It's a fact of life. The only thing that's been shown to increase customer satisfaction and retain trust during these outages, is communication. - Lenny Rachitsky's blog, transparent uptime, was a tireless advocate for creating public status pages, and communicating promptly and clearly with service users, when an issue arises. Since then, every service I've run, public or private, has a status page that gets updated when there's an issue so that users can be notified of problems, understand what's being done and hear what you've learned from the problem afterwards.
All right, this brings us to practice number seven, blameless postmortems.Decades of research on industrial safety has disproven the idea, that there's a single root cause for an incident, or that we can use human error as an acceptable reason for a failure. - John Allspaw, CTO at Etsy, wrote an article called Blameless Postmortems and A Just Culture, on how to examine these failures and learn from them while avoiding logical fallicies,or relying on scapegoating, to make ourselves feel better while making our real situation worse.
This brings us to practice number six, embedded teams. One of the class DevOps starter problems is that the Dev team wants to ship new code and the Ops team wants to keep the service up. Inside of that, there's an inherent conflict of interest. - One way around this, is to just take their proverbial doctor's advice of don't do that. Some teams re-organize to embed an operations engineer on each development team and make the team responsible for all it's own work, instead of throwing requests over the wall into some queue for those other people to do.This allows both disciplines to closely coordinate with one goal, the success of the service.
All right, practice number eight, status pages. Services go down, they have problems. It's a fact of life. The only thing that's been shown to increase customer satisfaction and retain trust during these outages, is communication. - Lenny Rachitsky's blog, transparent uptime, was a tireless advocate for creating public status pages, and communicating promptly and clearly with service users, when an issue arises. Since then, every service I've run, public or private, has a status page that gets updated when there's an issue so that users can be notified of problems, understand what's being done and hear what you've learned from the problem afterwards.
All right, this brings us to practice number seven, blameless postmortems.Decades of research on industrial safety has disproven the idea, that there's a single root cause for an incident, or that we can use human error as an acceptable reason for a failure. - John Allspaw, CTO at Etsy, wrote an article called Blameless Postmortems and A Just Culture, on how to examine these failures and learn from them while avoiding logical fallicies,or relying on scapegoating, to make ourselves feel better while making our real situation worse.
Practice number 5, the cloud. The devops love of automation and desire for infrastructure's code has met a really powerful ally in the cloud. The most compelling reason to use cloud technologies, it's not cost optimization, it's that cloud solutions give you an entirely API driven way to create and control infrastructure. - This allows you to treat your systems infrastructure exactly as if it were any other program component. As soon as you can conceive of a new deployment strategy or disaster recovery plan or the like, you can try it out without waiting on anyone.The cloud approach to infrastructure can make your other devops changes move along at high velocity.
Alright let's move on to practice number 4, Andon Cords. Frequently in a devops environment you're releasing quickly. Ideally, you have automated testing that catches most issues, but tests aren't perfect. - Enter the Andon Cord. This is an innovation originally used by Toyota on its production line. A physical cord, like the stopper quest cord on a bus, that empowers anyone on the line to pull to stop ship on the production line because they saw some problem.It forms a fundamental part of their quality control system to this day. - You can have the same thing in your software delivery pipeline, that way you can halt an upgrade or deployment to stop the bug from propagating downstream. - We recently added an Andon Cord to our build system. After a developer released a bug to production that he knew about but didn't have a test to catch. Now, everyone can stop ship if they know something's not right. -
Alright, let's move to practice number 3, dependency injection. In a modern application connections to it's external services, like databases or rest services, etc, they're the source of most of the run-time issues. There's a software design pattern called dependency injection, or sometimes called inversion of control. This focuses on loosely coupled dependencies. - In this pattern the application shouldn't know anything about it's external dependencies. Instead, they're passed into the application at run-time.This is very important for a well-behaved application in an infrastructure as code environment. Other patterns like service discovery can be used to obtain the same goal.
Alright let's move on to practice number 2, blue/green deployment. Software deployment. It works only one way, right. I mean traditionally, you take down the software on a server, upgrade it, bring it back up, and then you might even do this in a rolling manner, so, you can maintain the system uptime. - One alternate deployment pattern though, is called the blue/green deployment. Instead of testing a release in a staging environment and then deploying it to a production environment and hoping it works, instead you have two identical systems, blue and green.One is live and the other isn't. To perform an upgrade, you upgrade the offline system, test it, and then shift production traffic over to it. If there's a problem you shift back. This minimizes both downtime from the change itself and risk that the change won't work when it's deployed to production. -
Alright let's move to our last practice, practice number one, the chaos monkey. Old style systems development theories stressed making each component of a system as highly available as possible.This is done in order to achieve the highest possible uptime. But this doesn't work. A transaction that relies on a series of five, 99% available components will only be 95% available, because math. Instead, you need to focus on making the overall system highly reliable, even in the face of unreliable components. - Netflix is one of the leading companiesin new-style technology management, and to ensure they were doing reliability correctly, they invented a piece of software called the Chaos Monkey. Chaos Monkey watches the Netflix system that runs in the Amazon cloud, and occasionally reaches out and trashes a server. Just kills it. This forces the developers and operators creating the systems, to engineer resiliency into their services, instead of being lulled into making the mistake of thinking that their infrastructure is always on. - Well, that's our top 10 list of individual practices from across the various devops practice areas.
Kaizen emphasizes going to look at the actual place where the value's created or where the problem is. Not reports about it, not metrics about it, not processes discussing it, not documentation about it, it's actually going to look at it.In IT it's where people are doing the work. In some cases it might even mean going to the code or the systems themselves to see what they are really doing. - You may have heard the term management by walking around. This is actually an interpretation of Gemba
The Kaizen process is simple. And it's a cycle of plan, do, check, act. First you define what you intend to do and what you expect the results to be. Then you execute on that. Then you analyze the result and make any alterations needed. - If the results of your newest plan are better than the previous base line, now it's the new base line. And in any event it might suggest a subsequent plan, do, check, act cycles.
The simple process of plan, do, check, act does more than just give value and generate any improvements. It's more about teaching people critical thinking skills.
Another Kaizen tool used to get to the root of a problem is called the Five Whys. The idea behind it is simple. When there's a problem, you ask the question why did it happen? And when you get an answer you ask why did that happen? - You can repeat this as much as necessary, but five times is generally enough to exhaust the chain down to a root cause. - When using the five whys, there's four things to keep in mind.
One is to focus on underlying causes not symptoms. -
One is to focus on underlying causes not symptoms. -
Another is to not accept answers like not enough time. You know we always work under constraints. We need to know what caused us to exceed those constraints.
Third, usually there'll be forks in your five why's as multiple causes contribute to one element. A diagram called a fish bone diagram can be used to track all of these.
Alright, fourth and finally, do not accept human error as a root cause. That always points to a process failure, or a lack of a process with sufficient safe guards.
A quote used in five whys activities is people don't fail, processes do. - Alright, and that's Kaizen.
So is Dev Ops exactly the same thing as Agile? -
No you can practice Dev Ops without Agile and vice versa, but it can and frankly probably should be implemented as an extension of Agile since Dev Ops has such strong roots in Agile. Dev Ops manifesto, after consider very slight edits of the Agile manifesto capture the heart of it.
Replace software with systems and add operations to the list of stakeholders and the result is a solid foundation to guide you in your Dev Ops journey
Lean has become an important part of DevOps, especially of successful DevOps implementations.Lean, a systematic process for eliminating waste was originally devised in the manufacturing world.
They identified Seven Principles of Lean that apply to software. Similar to the Just-in-Time credo of Lean manufacturing, and aligned with the Agile idea of being flexible, you try to move fast, but delay decisions and enhance feedback loops and group context. Building integrity in is an important precept and will inform the approachistic continuous integrationand testing you'll hear about later in this course. So let's talk about waste.
The fundamental philosophy of Lean is about recognizing which activities you and your organization perform that add value to the product or service you produce, and which do not. Activities that don't add value are called waste. Lean recognizes three major types of waste, and they all have Japanese names. Muda, Muri, and Mura. Muri is the major form of waste and it comes in two types. Type one, which is technically waste, but necessary for some reason, like compliance.
The fundamental philosophy of Lean is about recognizing which activities you and your organization perform that add value to the product or service you produce, and which do not. Activities that don't add value are called waste. Lean recognizes three major types of waste, and they all have Japanese names. Muda, Muri, and Mura. Muri is the major form of waste and it comes in two types. Type one, which is technically waste, but necessary for some reason, like compliance.
And type two, which is just plain wasteful. The Poppendiecks also define seven primary wastes that are endemic in software development. This includes bugs and delays, but it also includes spending effort on features that aren't needed.
Value Stream Mapping, where you analyze the entire pathway of value creation, and understand what exact value's added where, how long it takes, and where waste resides in that pathway.In Lean product development, that value stream is referred to as Concept to Cash. The entire pathway from the idea to its realization, including all the production and distribution required to get it to customers.
DevOps stands on the shoulders of giants and there are a lot of concepts from the various ITSM and SDLC frameworks and maturity models that are worth learning. IT service management is a realization that service delivery is an part of the overall software development lifecycle, which should properly be managed from design to development to deployment to maintenance to retirement. ITSM is clearly one of DevOps's ancestors. ITIL was the first ITSM framework.
ITSM means Information Technology Service Management
ITIL means IT Infrastructure Library ITIL is a UK government standard.
ITIL v2 recognizes four primary phases of the service lifecycle, service strategy, design, transition and operation. It has guidance for just about every kind of IT process you've ever heard of, from incident management to portfolio management to capacity management to service catalogs.
Infrastructure as a code
It's a completely programmatic approach to infrastructure that allows us to leverage development practices for our systems. The heart of infrastructure automation, and the area best served with tools. Is configuration management.
First, provisioning. Is the process of making a server ready for operation. Including hardware, OS, system services, and network connectivity.
Deployment is the process of automatically deploying and upgrading applications on a server.
And then orchestration, is the act of performing coordinated operations across multiple systems.
Configuration management itself is an over arching term, dealing with change control of system configuration after initial provisioning.But it's also a often applied to maintaining and upgrading applications and application dependencies. There are also a couple important terms describing how tools approach configuration management.
Imperative, also known as procedural. This is an approach where commands desired to produce a state are defined and then executed.
And then there's declarative, also known as functional. This is an approach where you define the desired state, and the tool converges the existing system on the model.
Idempotent. This is the ability to execute the CM procedure repeatedly. And end up in the same state each time.
And finally, self service, is the ability for an end user to kick off one of these processes without having to go through other people.
Ubuntu's Juju is an open source example of this approach, where not only the infrastructure but also the services running on it are modeled and controlled together. You may need to dynamically configure machines running in their environment.Using tools like Chef, Puppet, Ansible, Salt, or CFEngine can help you accomplish this.
These configuration management tools allow you to specify recipes for how a system should get built. This includes the OS dependencies, system configuration, accounts, SSL certs, and most importantly, your application. These tools have been on the market for some time, and many have become full-featured development environments. Chef, you often use the generic Ruby linter, Rubocop, and the Chef specific linter, Foodcritic, to get code hygiene coverage. Unit testing is also possible with a tool like Chefspec, and full integration testing is done with KitchenCI, which runs a full converge with test harnesses, with hooks to server spec and test suites.
Chef, Puppet, and their peers also can take the place of a searchable CMDB. To do that, Chef runs a piece of software called Ohai on the systems, and Ohai profiles the system and stores all the metadata about it in a Chef server. - This works, but then your CMDB is only as up to date as the latest convergence, which most people run either hourly or on demand. It doesn't help as much with dynamic workloads, where state changes in seconds or even minutes. - ETCD, ZooKeeper and Consul are a few common tools to perform service discovery and state tracking across your infrastructure.
Containers generally don't use Chef or Puppet as their configuration is generally handled with a simple text file called the Docker file. Containers have just enough software and OS to make the container work.A lot of the functionality of Chef and Puppet aren't relevant to containers and are more relevant to long-running systems. - Docker Swarm, Google's Kubernetes, and Mesos are three popular Docker platforms that do orchestration of containers. They allow you to bring in multiple hosts and run your container workload across all of them.
They handle the deployment, the orchestration and scaling. Since the container is the application,these solutions get to the fully automated level that Juju does for service solutions. - Some private container services, like Rancher, Google's Cloud Platform, or Amazon's ECS take care of running hosts for your containers so you can focus just on your application.
Habitat is by the people that make Chef,and it bills itself as application automation. - While Chef is more about configuring infrastructure, Habitat extends down into the application build cycle, and then into the application deploy cycle, bookending and complementing Chef.
Continuous Delivery
In continuous delivery, you have an application that on every code commit is built automatically. Unit tests are run, and the application is deployed into a production like environment. Also, automated acceptance tests are run, and the change either passes or fails testing minutes after it's checked in. Code is always in a working state, with every continuous delivery.
Continuous Integration is the practice of automatically building and unit testing the entire application frequently. Ideally on every source code check in.
Continuous Delivery is the additional practice of deploying every change to a production like environment, and performing automated integration and acceptance testing.
Continuous Deployment extends this, to where every change goes through full enough automated testing. That it's deployed automatically to production
Continuous Deployment extends this, to where every change goes through full enough automated testing. That it's deployed automatically to production
Six practices that we think are critical for getting continuous integration are,
- The first practice, all builds should pass the coffee test. The build should take less time than it takes to get a cup of coffee. NThe longer a build takes, the more people naturally wait until they have a larger batch of changes, which increases your work in progress.
- Second practice, commit really small bits. Seek to commit the smallest amount of code per commit. Small changes are much easier for everyone on the team to reason about.It also makes isolating the failures way easier.
- Practice number three, don't leave the build broken. When you leave the build broken, you block delivery. I often suggest the team get together, make a pact, you want to delay meetings or stop all other work until the build is fixed. Now, this is really purely a cultural problem, and how you handle broken builds, that sets a tone for the rest of your delivery culture
- Use a trunk based development flow. It helps keep work down to a limited work in progress, insures the code is reviewed and checked frequently, and reduces wasteful and error-prone rework especially when you're trying to merge branches There are two main practices people use when developing, branch-based and trunk-based. Trunk-based means that there are no long running branches, and the trunk, also called master, is always integrated across all of the developers by its very nature. Now, all developers work off trunk and commit back to trunk multiple times a day. Now, instead of keeping separate code branches, developers branch in code and use feature flags.
- Don't allow flaky tests, fix them. You have no way to know if you can really trust the build.
- The build should return a status, a log, and an artifact. The status should be a simple pass/fail or a red/green. A build log is a record of all the tests run and all the results of the run. The artifact should be uploaded and tagged with a build number. This adds trust, assures audibility, and the immutability of the artifacts
Five Best Practices for Continuous Delivery are
In the continuous integration, we discussed the artifacts that are created upon the successful completion of each build. - These artifacts shouldn't be rebuilt for staging, testing, and production environments. - Yeah, they should be built once and then used in all the environments. This way, you know that your testing steps are valid since they all use the same artifact. - Your artifacts also shouldn't be allowed to change along the way. They need to be stored and have permissions set in such a way that they're immutable. - In the continuous delivery pipeline that I built at my job, I set the permissions so that the CI system can only write in the artifact to the artifact repository and the deployment system that we call Deployer only has read access to the artifact.
- We want artifacts to be built once and immutable for two reasons. - First, it's going to create trust between the teams. When they're debugging an issue, you want the dev and the ops and the QA, all the teams to have confidence that the underlying bits didn't change underneath them between the different stages. - Yeah, and then a quick checksum can prove that you're all looking at the exact same artifact version. - Yeah, and the second reason is auditability. One of the great parts about building the continuous delivery pipeline is that you can trace specific code versions in source control to successful build artifacts to a running system.
Rebuilding or changing an artifact along the way would break your auditability. - Code is checked in a version control system. That commit triggers a build in your CI system. Once the build finishes, the resulting artifacts are published to a central repository. Next we have a deployment workflow to deploy those artifacts to a live environment that's as much of a copy of production as possible. You may call this environment CI, staging, test, or pre-prod.One of the reasons we move code to this environment is to do all the acceptance testing, smoke tests, and integration tests that are difficult to fully simulate on dev desktops or build servers.
One of the reasons we move code to this environment is to do all the acceptance testing, smoke tests, and integration tests that are difficult to fully simulate on dev desktops or build servers. This brings up another crucial point. Your system needs to stop the pipeline if there's breakage at any point. - A human should be able to lock the CD pipeline using an Andon Cord.
But even more importantly, the CD pipeline shouldn't allow progression from stage to stage without assurance that the last stage was run successfully. We have two checks implemented mainly. First, if there's any failure encountered in the deployment system, it's going to lock up and it'll notify all the team in chat. Second, each stage of the deployment audits the previous stage and checks that not only that no errors occurred, but also that it's in the expected state that it should be.
In other words, redeploying should leave your system in the same state. - Yeah, you can accomplish this by using a mutable packaging mechanism like Docker retainers or through a configuration management tool like Puppet or Chef. But this is another area where trust and confidence factors into your pipeline.
Trace a single code change through it and answer these two questions. -
Are you able to audit a single change and trace it through the whole system? This is going to be called your cycle.
And how fast can you move that single change into production? This is your overall cycle time. -
I encourage you to start recording metrics off your pipeline. Focus on cycle time, the measure on how long it takes a code check-in to pass through each of the steps in the process all the way to production.
You know, another thing I like to do is to understand team flow. And you can do that by keeping a pulse on the team through tracking the frequency of deploys as they're happening. - That's right, and one way to improve those metrics is in how your perform QA.
Role of QA
Let's kick off with unit testing. These are the tests done at the lowest level of the language or framework that it supports.Let's say you have a calculator application. A function called add would take two numbers, and big surprise here, it adds them together. In unit testing, we would write a test inside the codebase to validate that function. These are the hallmarks of unit testing; usually, the fastest testing available, it stubs out external dependencies with fake values, and it easily runs on the developer's machine.
Code hygiene is the sum of the best practices from your development community for the particular language or framework that you're using. It is usually accomplished through linters, formatters, and you can also check for banned functions in your code.
Next, is integration testing. It's technical testing similar to unit testing, but performed with all the apps components and dependencies operating in a test environment.
Next, we're lumping together a few; test-driven development TDD, behavior-driven development BDH, and acceptance-test-driven development ATD. Now, they're all related movements in software development that focus on testing from an outside-in perspective. Each one varies slightly in their implementation. So even though we're lumping them together in the same category of tests, let's briefly cover what they mean.
TDD, test-driven development. It's a development practice that starts with writing tests before you write any code. The flow with TDD is that you start with the desired outcome written as a test. You then write code to pass the test, and then you rinse, wash, and repeat. Now, this flow encourages high feedback, and a bonus is that while the application is being written, a comprehensive test read is also being developed.
Behavior-driven development also called BDD, that encourages the developer to work with the business stakeholder to describe the desired business functionality of the application and expresses the test in a DSL that is pretty close to the English language.
Infrastructure testing. This is the next test category and sometimes it can be a slow-running test. It involves starting up a host and running the configuration management code, running all the tests, and then turning it all off.
Another test category is performance testing. Basic performance testing should be part of everything that you do. But there's also a dedicated performance test that you want to have.You want to have load tests, stress tests, soak tests, and spike tests. All these tests are really great for out-of-band nightly runs. They're usually long-running and they consume a lot of resources.
The last category is security testing. It might be useful to think about this as a simulated attack test. Gauntlet lets you use BDD language when testing for security attacks from the outside in.
Testing is critical to get right if you want to be able to set up a continuous delivery pipeline. It's the only way you can trust that the changes you made won't break the system and keep up a high rate of speed.
The six key phases for continuous delivery and the tooling that's associated with it are Version Control, CI Systems, Build, Test, you'll need an Artifact repository and Deployment.
Version control is where we commit code changes and can view the entire history of all changes ever made.
It allows the developers to stay in sync with each other by treating each change as an independent layer in the code. - You know, most organizations today opt to use Git in the form of GitHub or BitBucket either as a SAS or sometimes as an on-prem enterprise version.These add additional collaboration and sharing benefits, often found in social media. You can kind of think of it as Facebook meets version control. Alright, next up is continuous integration. - Jenkins, being open-source, is popular in many organizations.Its UI is a little bit difficult to navigate at times but it has tons of community support and almost every provider integrates with it. There's also a commercial offering of it from CloudBees. Other options include GoCD, Bamboo and TeamCity. - There's also been a good amount of adoption of continuous integration as a service from companies like Travis CI or Circle CI. Alright now let's talk about build tools. Build tools are very language dependent.You might be going simply with Make or Rake if you're using Ruby, but these just execute a consistent set of steps every time.
- Or you can take a workflow approach with Maven, that can allow you to run reproducible builds and tests from your developer desktop, all the way to your CI system. - If you're doing testing on the front-end code, it's really popular to use Gulp. And, if you're building out infrastructure, something like Packer from HashiCorp. - Most development languages have unit testing built in, or there's a strong recommendation by the community for what to use like JUnit for Java. - The same goes for code hygiene with linters and formatters. Kind of link golang. There's golint or gofmt. Or maybe for using Ruby, there's a version called RuboCop. - Integration testing is usually performed with test-driven frameworks or by using in-house scripts. - Testing frameworks and tools in this area include Robot, Protractor and Cucumber. The cool thing about these is they let you express an outside end approach to the code. - They can hook into Selenium for UI testing, or you can use Selenium on its own.If you end up doing a lot of acceptance testing for the front end, there's a great SAS offeringcalled Sauce Labs you can use.
- Let's say you're doing infrastructure testing. You'll probably be using tools like Kitchen CI for Chef. That actually creates new instances and runs a full convergence of the code to test. - Tools like ApacheBench or Jmeter can help you do performance testing. - And you got to add some security testing in there as well. Galen and Mitten are two open-sourceoutside end testing tools. There's also tools that do code inspection like the open-source tool Brakeman or there's also paid offerings from companies like Veracode. - One of the most important attributes of these tests is your ability to run them on the Dev Desktop prior to check-in, and not just rely on your CI pipeline.
Tools like Vagrant, Otto and Docker Compose let you deploy and run your whole app locally.So you can run not just your unit tests, but also integration and acceptance tests, at any time. - Okay, once your code has been built and tested, the artifacts have to go somewhere.- Popular solutions like Artifactory or its open source eqivalent, Nexus, manage lots of different artifact formats. Or a specific output, like a Docker image, can be sent to Docker Hub or your internal docker registry.
So finally, we get to deployment. Rundeck is a nice workflow driven deployment option. It lets you define a job, put often permissions around it and then automate a workflow across your systems. And Deployment is a popular workflow that people use it to automate.
- Some people use their configuration management tooling for doing application deployment or other people just write their own custom tooling. Some use commercial tools from folks like UrbanCode and ThoughtWorks. There's an open-source offering from Etsy, it's called Deployinator. It provides a dashboard that lets you run your own deployment workflows. - Tools in this space change a lot. And like we said earlier, nobody has the exact same continuous delivery pipeline and that's okay. - You know, when building out pipelines for different places I've been at, I try to think about what's the easiest thing I could build here? I ask, I try to really focus on the question of; What's the minimum viable product for this portion of the pipeline? - That's right.
- Some people use their configuration management tooling for doing application deployment or other people just write their own custom tooling. Some use commercial tools from folks like UrbanCode and ThoughtWorks. There's an open-source offering from Etsy, it's called Deployinator. It provides a dashboard that lets you run your own deployment workflows. - Tools in this space change a lot. And like we said earlier, nobody has the exact same continuous delivery pipeline and that's okay. - You know, when building out pipelines for different places I've been at, I try to think about what's the easiest thing I could build here? I ask, I try to really focus on the question of; What's the minimum viable product for this portion of the pipeline? - That's right.
Reliability engineering
You may have heard the term Site Reliability Engineering. That's a term that Google popularized for this approach. Google has product teams support their own servicesuntil they reach a certain level of traffic and maturity. And even then they have the development team handle 5% of the operational workload ongoing. - This keeps a healthy feedback loop in place that continually improves the product's operational abilities. There's even a new O'Reilly book called Site Reliability Engineering by some Google engineers that has a lot of good insights, especially for large web shops on this topic.
Site Reliability Engineering
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy
https://landing.google.com/sre/book.html
APM (application performance management )tools do distributed lightweight profiling across a whole architecture, and let you bring together timings, and metrics to identify bottlenecks and slowdowns. You can run APM tools in production, and may have to if you can't reproduce a problem in staging. I've also found them to be more useful in the development process. Find the problems before you roll them out. In a distributed system, performance issues are often worse than straight up failures.
Keeping a handle on performance issues, including running baselines on every single build in your CI pipeline, is critical to your service's health. There's way more to say about these topics. The general approach however, is make sure you have operational expertise incorporated into the development phase of your product, and that you design in performance and availability from the beginning.
Finally, you want to implement things to make maintenance easier.Our approach to reliability engineering is to complete the operation's feedback loop to development. This works best in a Lean fashion.
Finally, you want to implement things to make maintenance easier.Our approach to reliability engineering is to complete the operation's feedback loop to development. This works best in a Lean fashion.
Service performance and uptime, software component metrics, system metrics, app metrics, performance, and finally, security. Service performance and uptime monitoring is implemented at the very highest level of a service set or application. These are often referred to as synthetic checks or you know, and they're synthetic because they're not real customers or real traffic.It's the simplest form of monitoring, to answer the question of, is it working.
The next area of monitoring is software component metrics. This is monitoring that's done on ports or processes, usually located on the host. This moves into layers, so instead of answering, is my service working, it's asking, is this particular host working.
The next area is a layer deeper, it's system metrics. They can be anything from, like, CPU or memory. These are time series metrics, and they get stored and graphed where you can look at them and answer the question, is this service or host or process, is it functioning normally? Alright, next we get into application metrics.
Application metrics are telemetry from your application that give you a sense of what your application is actually doing. A couple of examples of these, like when you emit how long a certain function call is taking, or maybe the number of logins in the last hour, or account of all the error events that have happened.
Real user monitoring, also called RUM, it usually uses front-end instrumentation, for example, like a JavaScript page tag. It captures the performance observed by the users of the actual system. It's able to tell you what your customers are actually experiencing.
The last area is security monitoring. Now, attackers don't hack systems magically and emit a special packet that just takes everything down. It's a process, and there's enough digital exhaust created from the attack progression, that monitoring is possible, though sadly, it's often rare. Security monitoring includes four key areas. System, think of things like Bad TLS, SSL settings. You know, maybe open ports and services, or other system configuration problems.
Application security, this is like knowing when XSS or SQL injection are in (mumbles) on your site. Custom events in the application, things like password resets, invalid logins, or new account creations
Application security, this is like knowing when XSS or SQL injection are in (mumbles) on your site. Custom events in the application, things like password resets, invalid logins, or new account creations
- What happened,
- when it happened,
- where did it happen,
- who was involved, and
- where the entity came from.
Logging can be used for lots of purposes, ranging from audit to forensics to trouble shooting, resource management, to intrusion detection to user experience.
One of the first goals in any environment is centralized logging. Sending all the logs, via syslog or store and forward agents, to a centralized location, that's key but how you do it is important. With this in mind, there are five principles I'd like to cover.
First, do not collect log data that you're never planning to use. Number two, retain log data for as long as it's conceivable that it could be used or longer, if it's prescribed by regulations. The impact of keeping too much overloads cost for resources, for maintenance, and it inhibits overall growth.Three, log all you can, but alert only on what you must respond to. Fourth, don't try to make your logging more available or more secure than your production stack. This is a very lean approach and if you overbuild capacity, or in this case, defense and availability, you have misallocated resources in your system. Logging should meet business needs, not exceed them. Fifth, logs change, as in their format or their messages. New versions of software bring changes. Creating a feedback loop that encourages everyone to take ownership of their logs in the centralized logging system, that's important.
Monitoring, metrics, and logging are three feedback loops that bring operations back into design. Remember, the other processes that can create feedback loops, such as incident command system, blameless postmortems, and transparence uptimes.
DevOps Monitoring Tools
Well, first, there's the rise of SaaS. Many new monitoring offerings are provided as a service. From simple endpoint monitoring like Pingdom, to system and metric monitoring like Datadog, Netuitive, Ruxit, and Librato, to full application performance management tools like New Relic and AppDynamics. These provide extremely fast onboarding and come well provided with many integrations for many modern technologies.
- And there's a whole category of open source tools, like statsd, ganglia, graphite, and grafana. You can use these to collect large scale distributed custom metrics. - That's right, and you can pull those and put them into a time series data base like InfluxDB or OpenTSDBto process them. - There are application libraries specifically designed to admit metrics into these. Like the excellent metrics library from Coda Hale. - There are plenty of new open source monitoring solutions designed with more dynamic architectures in mind.
Icinga and Sensu are two solutions somewhat similar to Nagios in concept, and can use the large existing set of Nagios plugins, but have more modern UIs and are easier to update in an ephemeral infrastructure world. - You know, and containers have brought their own set of monitoring tools with them. Stuff like the open source tools like Prometheus and Sysdig. - Log management has become a first order part of the monitoring landscape. This started in earnest with Splunk, the first log management tool anyone ever wanted to use.
Yeah, and then it moved to SaaS, so with Sumo Logic, log entries, and similar loferings, but it's come back around full circle as an excellent open source log management system has emerged, composed of elasticsearch log stashin kibana, it's often referred to as the Elk stack. - Pagerduty and VictorOps are two sterling examples of SaaS incident management tools that help you holistically manage your alerting and on call burden. - This means you don't have to rely on the scheduling and routing and functionality built into the monitoring tools themselves.
Well, there's even an open source project called flapjack, at flapjack.io, that can help you do that yourself, if you wish. - Statuspage.io provides status pages as a service. You may have seen some of these in use by some of your SaaS providers. In accordance with transparent uptime principles, services can gateway their status and metrics to external or internal customers and allow them to subscribe to updates off these pages.
But a command dispatcher like Rundeck, Saltstack, or Anciple, those are a good part of your operational environment for purposes of runbook automation. This means running can procedures across systems for convenience and reduction in the manual error.
The 10 best DevOps books you need to read
Number 10, Visible Ops. Visible Ops by Gene Kim is one of the bestselling IT books of all time. It boils down ITIL into four key practices that his research shows to bring high value to organizations through a Lean implementation of change control principles.
Number nine, Continuous Delivery. Continuous Delivery is the book on continuous delivery. It was written by David Farley and Jez Humble. This book is so chock full of practices and principles along with common antipatterns that it really was useful along that journey.
Number eight, Release It! With an exclamation point. This book's premise is to design and deploy production ready software with an emphasis on production ready. Release It! has given much of the industry a new vocabulary. Author Michael Nygard provides his designs patterns for stability, security and transparency. It won the Dr. Dobb's Jolt Award for productivity in 2008. -
Alright, book number seven, Effective DevOps. It was written by Jennifer Davis and Katherine Daniels.This features lots of practical advice for organizational alignment in DevOps and it makes sure to fit the cultural aspects alongside the tooling. I especially like the focus on culture with all the interesting case studies they did. -
Number six, Lean Software Development, An Agile Toolkit. Mary and Tom Poppendieck authored this seminal work on bringing Lean concepts into software development, and exploring the benefit of value stream mapping and waste reduction. They explain the seven Lean principles applicable to software and cover a wide variety of conceptual tools, along with plenty of examples.This book is the single best introduction to the topic of Lean software. -
Book five, Web Operations, this book is edited by John Allspaw who gave the groundbreaking 10 Deploys a Day presentation of velocity back in 2009. This book is a collection of essays from practitioners ranging from monitoring to handling post-mortems to dealing with stability with databases. It also contains medical doctor Richard Cook's amazing paper How Complex Systems Fail.
Book four, The Practice of Cloud System Administration. This book, written by Tom Limoncelli is a textbook on system administration topics that continues to be updated.
t has an entire section on DevOps, and if I was pinned down to recommend just one book to a sysadmin or ops engineer, this is probably the book I would choose. If your role is hands-on, read this book.
Alright, book three, The DevOps Handbook. Subtitled, "How to create world-class agility, reliability "and security in technology organizations." This book is by Gene Kim, Jez Humble, Patrick Debois and John Willis. It was under development for five years by these leaders of the DevOps movement, and it's the standard reference on DevOps.
Book two, Leading The Transformation. One book that deserves special mention for enterprises is Gary Gruver and Tommy Mouser's book, Leading The Transformation,
Book two, Leading The Transformation. One book that deserves special mention for enterprises is Gary Gruver and Tommy Mouser's book, Leading The Transformation,
Applying Agile and DevOps Principles at Scale. This is a book for directors, VPs, CTOs, and anyone in charge of leading IT organizational change of any size. Gruver describes leading DevOps transformations at HP in the firmware division for printers, and at the retailerMacy's, both with incredible success.
Alright, our top book recommendation is The Phoenix Project. This is the bestselling book by Gene Kim, George Spafford and Kevin Behr. It's a modern retelling of Goldratt's The Goal.This is in a novel format and it walks you through one company's problems and their transformation to Lean and DevOps principles
Alright, our top book recommendation is The Phoenix Project. This is the bestselling book by Gene Kim, George Spafford and Kevin Behr. It's a modern retelling of Goldratt's The Goal.This is in a novel format and it walks you through one company's problems and their transformation to Lean and DevOps principles