Ann's Blog

Saturday, September 29, 2018

Wednesday, September 5, 2018

Evidence shows that being kind to friends, family and strangers really does improve your mental and physical wellbeing. The Mental Health Foundation have put together some suggestions that you may wish to try throughout September

At home and in your community

Call a friend that you haven’t spoken to for a while
Send a letter to your nan and grandad
Send flowers to a friend out of the blue
Offer to pick up some groceries for your elderly neighbour
Help a friend pack for a move
Send someone a handwritten thank you note
Offer to babysit for a friend
Walk your friend’s dog
Tell your family members how much you love and appreciate them
Help out at home with household chores
Check on someone you know who is going through a tough time
Help a friend get active

At work

Make a cup of tea for your colleagues
Get to know the new staff member
Lend your ear - listen to your colleague who is having a bad day
Say good morning
Bake a cake or healthy treat for your colleagues
Give praise to a colleague for something they’ve done well

In public places

Give up your seat to an elderly, disabled or pregnant person
Take a minute to help a tourist who is lost even though you are in a rush
Have a conversation with a homeless person
Help a mother carrying her pushchair down the stairs or hold the door for her
Let a fellow driver merge into your lane
Pick up some rubbish lying around in the street
Smile and say hello to people you may pass every day, but have never spoken to before

Source https://www.mentalhealth.org.uk/sites/default/files/50-random-acts-of-kindness.pdf

Thanks
Ann

Wednesday, July 18, 2018

Cloud Types and Service Models

Some of the characteristics that define cloud computing include metered usage, where we pay only for those IT resources that we use in the cloud.

Another characteristic is resource pooling, where the cloud provider is responsible for pooling together all of the physical resources like server hardware, storage, network equipment, and that's made available to cloud subscribers, otherwise called tenants.

Another characteristic is that we should be able to access our cloud IT resources over a network, and in the case of a public cloud that means access from anywhere over the Internet.

Rapid elasticity is another characteristic so that we can quickly provision resources and deprovision them as required, and this is often done through a self-provisioning web portal.

A public cloud is one whose services are potentially accessible to all Internet users. We say potentially because there might be a requirement to sign up for an account or pay a subscription fee, but potentially it is available. A public cloud has worldwide geographic locations, and that's definitely the case with Amazon Web Services. The cloud provider is responsible for acquiring all of the hardware and making sure it's available for the IT services that they sell as cloud services to their customers.

A private cloud, on the other hand, is accessible only to a single organization and not to everybody over the Internet, and that's because it's organization owned and maintained hardware. However, a private cloud still does adhere to the exact same cloud characteristics that a public cloud does. For example, having a self-provisioned rapid elasticity of pooled IT resources available, that's still a cloud. In this case it's private because it's on hardware owned by the organization. The purpose of a private cloud is really apparent in larger government agencies and enterprises where we can track usage of IT resources and then use that for departmental chargeback.

A hybrid cloud is the best of both worlds. The two worlds we're talking about are the on-premises IT computing environment and the cloud computing environment. We have to consider that the migration of on-premises systems and data could potentially take a long time. So, for example, we might have data stored on-premises and in the cloud at the same time. And this is possible, for example, using the Amazon Web Services Storage Gateway, where we've got a cached copy of data available locally on the Gateway appliance on our on-premises network, but it's also replicating that data into the cloud. We might also, as another example, have a hardware VPN that links our on-premises environment to an Amazon Web Services Virtual Private Cloud, essentially a virtual network running in the cloud.

A community cloud serves the same needs that are required across multiple tenants. For example, Amazon Web Services has a government cloud in the United States, where it deals with things like sensitive data requirements, regulatory compliance. It's managed by US personnel and it's also FedRAMP compliant. FedRAMP, of course, is the Federal Risk and Authorization Management Program. So having these specific types of clouds available, in this case the government cloud, is referred to as a community cloud.

Cloud computing service models.

So what is a service model anyway? Well, as it applies to cloud computing, it really correlates to the type of cloud service that we would subscribe to. So let's think about IT components like virtual machines and databases and websites and storage. Each of these examples correlates to a specific type of cloud computing service model.

Let's start with Infrastructure as a Service, otherwise called IaaS. This includes things in Amazon Web Services like EC2 virtual machines. Or S3 cloud storage, or virtual networks which are called VPCs, Virtual Private Clouds. That's core IT infrastructure. And so it's considered Infrastructure as a Service.

Another type of cloud computing model is Platform as a Service, otherwise called PaaS. This deals with things like databases or even things like searching, such as the Amazon CloudSearch capability.

Software as a Service is called SaaS, and this is the way we would deal with things like websites or using Amazon Web Services WorkDocs. Well we can work with office productivity documents like Excel and Word documents in the cloud.

Security as a Service is called SECaaS. This deals with security that's being provided by a provider. So we're essentially transferring that risk out to some kind of a hosted solution. And it comes in many forms. It could be spam or malware scanning done for email in the cloud. Or as we see here, we've got an option in Amazon Web Services called AWS Shield. The purpose of this offering is for distributed denial of service attack protection.

A DDoS occurs when an attacker has control of slave machines, otherwise called #zombies. And the collection of these on a network is called a #botnet. Well, the attacker can issue commands to those slaves so that they could attack a victim host, as pictured here, or an entire network. Such as to flood it with traffic thereby preventing legitimate traffic from getting to, for example, a legitimate website. And in many cases a lot of these botnets are actually for rent by malicious users to the highest bidder. So for a fee, potentially we could pay for the use of a botnet to bring down a network or a host. Now luckily with Amazon Web Services, this can be mitigated using AWS Shield. DDoS protection mechanisms will often do things like looking at irregular traffic flows and blocking certain IP addresses.

Tuesday, July 10, 2018

Basic Agile Scrum Interview QA

AGILE

Agile software development refers to a group of software development methodologies based on iterative development, where requirements and solutions evolve through collaboration between self-organizing cross-functional teams. Agile methods or Agile processes generally promote a disciplined project management process that encourages frequent inspection and adaptation, a leadership philosophy that encourages teamwork, self-organization and accountability, a set of engineering best practices intended to allow for rapid delivery of high-quality software, and a business approach that aligns development with customer needs and company goals. Agile development refers to any development process that is aligned with the concepts of the Agile Manifesto. The Manifesto was developed by a group fourteen leading figures in the software industry, and reflects their experience of what approaches do and do not work for software development. Read more about the Agile Manifesto.

SCRUM

Scrum is a subset of Agile. It is a lightweight process framework for agile development, and the most widely-used one.

A “process framework” is a particular set of practices that must be followed in order for a process to be consistent with the framework. (For example, the Scrum process framework requires the use of development cycles called Sprints, the XP framework requires pair programming, and so forth.)
“Lightweight” means that the overhead of the process is kept as small as possible, to maximize the amount of productive time available for getting useful work done.

A Scrum process is distinguished from other agile processes by specific concepts and practices, divided into the three categories of Roles, Artifacts, and Time Boxes. These and other terms used in Scrum are defined below. Scrum is most often used to manage complex software and product development, using iterative and incremental practices. Scrum significantly increases productivity and reduces time to benefits relative to classic “waterfall” processes. Scrum processes enable organizations to adjust smoothly to rapidly-changing requirements, and produce a product that meets evolving business goals. An agile Scrum process benefits the organization by helping it to

Increase the quality of the deliverables
Cope better with change (and expect the changes)
Provide better estimates while spending less time creating them
Be more in control of the project schedule and state

1. What is the duration of a scrum sprint?

Answer: Generally, the duration of a scrum sprint (scrum cycle) depends upon the size of project and team working on it. The team size may vary from 3-9 members. In general, a scrum script complete in 3-4 weeks. Thus, on an average, the duration of a scrum sprint (scrum cycle) is 4 weeks. This type of sprint-based Agile scrum interview questions is very common in an agile or scrum master interview.

2. What is Velocity?

Answer: Velocity question is generally posed to understand if you have done some real work and familiar with the term. Its definition “Velocity is the rate at which team progresses print by sprint” should be enough. You can also add saying the important feature of velocity that it can’t be compared to two different scrum teams.

3. What do you know about impediments in Scrum? Give some examples of impediments.

Answer: Impediments are the obstacles or issues faced by scrum team which slow down their speed of work. If something is trying to block the scrum team from their getting work “Done” then it is an impediment. Impediments can come in any form. Some of the impediments are given as –

Resource missing or sick team member
Technical, operational, organizational problems
Lack of management supportive system
Business problems
External issues such as weather, war etc
Lack of skill or knowledge

While answering impediments related agile scrum interview questions remember that you may be asked the way to remove any of the mentioned impediment.

4. What is the difference and similarity between Agile and Scrum?

Answer: Difference between Agile and Scrum – Agile is a broad spectrum, it is a methodology used for project management while Scrum is just a form of the Agile that describes the process and its steps more concisely. Agile is a practice whereas scrum is a procedure to pursue this practice.

The similarity between Agile and Scrum – The Agile involves completing projects in steps or incrementally. The Agile methodology is considered to be iterative in nature. Being a form of Agile, Scrum is same as that of the Agile. It is also incremental and iterative.

5. What is increment? Explain.

Answer: This is one of the commonly asked agile scrum interview questions and a quick answer can be given this way. An increment is the total of all the product backlogs items completed during a sprint. Each increment includes all the previous sprint increment values as it is cumulative. It must be in the available mode in the subsequent release as it is a step to reach your goal.

6. What is the “build-breaker”?

Answer: The build-breaker is a situation that arises when there is a bug in the software. Due to this sudden unexpected bug, compilation process stops or execution fails or a warning is generated. The responsibility of the tester is then to get the software back to the normal working stage removing the bug.

7. What do you understand by Daily Stand-Up?

Answer: You may surely get an interview question about daily stand-up. So, what should be the answer to this question? The daily stand-up is an everyday meeting (most preferably held in the morning) in which the whole team meets for almost 15 minutes to find answer to the following three questions –

What was done yesterday?
What is your plan for today?
Is there any impediment or block that restricts you from completing your task?

The daily stand-up is an effective way to motivate the team and make them set a goal for the day.

8. What do you know about Scrum ban?

Answer: Scrum-ban is a Scrum and Kanban-based model for the software development. This model is specifically used for the projects that need continuous maintenance, have various programming errors or have some sudden changes. This model promotes the completion of a project in minimum time for a programming error or user story.

Sunday, July 8, 2018

Potatoes, Eggs, and Coffee Beans

Once upon a time a daughter complained to her father that her life was miserable and that she didn’t know how she was going to make it. She was tired of fighting and struggling all the time. It seemed just as one problem was solved, another one soon followed.

Her father, a chef, took her to the kitchen. He filled three pots with water and placed each on a high fire. Once the three pots began to boil, he placed potatoes in one pot, eggs in the second pot, and ground coffee beans in the third pot.

He then let them sit and boil, without saying a word to his daughter. The daughter, moaned and impatiently waited, wondering what he was doing.

After twenty minutes he turned off the burners. He took the potatoes out of the pot and placed them in a bowl. He pulled the boiled eggs out and placed them in a bowl.

He then ladled the coffee out and placed it in a cup. Turning to her he asked. “Daughter, what do you see?”

“Potatoes, eggs, and coffee,” she hastily replied.

“Look closer,” he said, “and touch the potatoes.” She did and noted that they were soft. He then asked her to take an egg and break it. After pulling off the shell, she observed the hard-boiled egg.

Finally, he asked her to sip the coffee. Its rich aroma brought a smile to her face.

“Father, what does this mean?” she asked.

He then explained that the potatoes, the eggs and coffee beans had each faced the same adversity– the boiling water.

However, each one reacted differently.

The potato went in strong, hard, and unrelenting, but in boiling water, it became soft and weak.

The egg was fragile, with the thin outer shell protecting its liquid interior until it was put in the boiling water. Then the inside of the egg became hard.

However, the ground coffee beans were unique. After they were exposed to the boiling water, they changed the water and created something new.

“Which are you,” he asked his daughter. “When adversity knocks on your door, how do you respond? Are you a potato, an egg, or a coffee bean? “

Moral:In life, things happen around us, things happen to us, but the only thing that truly matters is what happens within us.

Which one are you?

Friday, July 6, 2018

Cloud Computing

Cloud computing is the on-demand delivery of compute power, database storage, applications, and other IT resources through a cloud services platform via the internet with pay-as-you-go pricing.

The best way to start with that is to compare it to traditional IT computing. Where on-premises on our own networks, we would at some point have a capital investment in hardware. So think of things like having a server room constructed, getting racks and then populating those racks with equipment. With things like telecom equipment, routers, switches, servers, storage arrays, and so on. Then, we have to account for powering that equipment. We then have to think about HVAC, heating, ventilation and air conditioning, to make sure that we've got optimal environmental conditions to maximize the lifetime of our equipment. Then there's licensing. We have to license our software. We have to install it, configure it and maintain it over time, including updates. So with traditional IT computing, certainly there is quite a large need for an IT staff to take care of all of our on-premises IT systems.

But with cloud computing, at least with public cloud computing, we are talking about hosted IT services. Things like servers and related storage, and databases, and web apps can all be run on provider equipment that we don't have to purchase or maintain. So in other words, we only pay for the services that are used. And another part of the cloud is self-provisioning, where on-demand, we can provision, for example additional virtual machines or storage. We can even scale back on it and that way we're saving money because we're only paying for what we are using. With cloud computing, all of these self-provisioned services need to be available over a network.

In the case of public clouds, that network is the Internet.

But something to watch out for is vendor lock-in. When we start looking at cloud computing providers, we want to make sure that we've got a provider that won't lock us into a proprietary file format for instance. If we're creating documents using some kind of cloud-based software, we want to make sure that data is portable and that we can move it back on-premises or even to another provider should that need arise.

Then there is responsibility. This really gets broken between the cloud provider and the cloud consumer or subscriber, otherwise called a tenant. So the degree of responsibility really depends on the specific cloud service that we're talking about. But bear in mind that there is more responsibility with cloud computing services when we have more control. So if we need to be able to control underlying virtual machines, that's fine, but then it's up to us to manage those virtual machines and to make sure that they're updated.

The hardware is the provider's responsibility. Things like power, physical data center facilities in which equipment is housed, servers, all that stuff. The software, depending on what we're talking about, could be split between the provider's responsibility and the subscriber's responsibility. For example, the provider might make a cloud-based email app available, but the subscriber configures it and adds user accounts, and determines things like how data is stored related to that mail service. Users and groups would be the subscriber's responsibility when it comes to identity and access management.

Working with data and, for example, determining if that data is encrypted when stored in the cloud, that would be the subscriber's responsibility. Things like data center security would be the provider's responsibility. Whereas, as we've mentioned, data security would be the subscriber's responsibility when it comes to things like data encryption. The network connection however is the subscriber's responsibility, and it's always a good idea with cloud computing, at least with public cloud computing, to make sure you've got not one, but at least two network paths to that cloud provider.

AmazonWeb Services (https://aws.amazon.com/free/) manages their own data center facilities and they are responsible for the security of them, as well as physical hardware security like locked server racks. They're responsible for the configuration of the network infrastructure, as well as the virtualization infrastructure that will host virtual machines.

The subscriber would be responsible for things like AMIs. An AMI, or A-M-I, is an Amazon Machine Image, essentially a blueprint from which we create virtual machine instances. We get to choose that AMI when we build a new virtual machine. We, as a subscriber, would also be responsible for applications that we run in virtual machines, the configuration of those virtual machines, setting up credentials to authenticate to the virtual machines, and also dealing with data at rest and in transit and our data stores.

We can see what is managed by AWS customers. So data, applications, depending on what we're configuring, the operating system running in a virtual machine, firewall configurations, encryption. However, what's managed by Amazon Web Services are the underlying foundation services, the compute servers, the hypervisor servers that we run virtual machines on. The cloud also has a number of characteristics. Just because you're running virtual machines, for instance, doesn't mean that you have a cloud computing environment.

A cloud is defined by resource pooling. So, we've got all this IT infrastructure pooled together that can be allocated as needed. Rapid elasticity means that we can quickly provision or de-provision resources as we need. And that's done through an on-demand self-provisioned portal, usually web-based. Broad network access means that we've got connectivity available to our cloud services. It's always available. And measured service means that it's metered, much like a utility, in that we only pay for those resources that we've actually used. So, now we've talked about some of the basic characteristics of the cloud and defined what cloud computing is.

Sunday, July 1, 2018

The Elephant Rope

As a man was passing the elephants, he suddenly stopped, confused by the fact that these huge creatures were being held by only a small rope tied to their front leg. No chains, no cages. It was obvious that the elephants could, at anytime, break away from their bonds but for some reason, they did not.

He saw a trainer nearby and asked why these animals just stood there and made no attempt to get away. “Well,” trainer said, “when they are very young and much smaller we use the same size rope to tie them and, at that age, it’s enough to hold them. As they grow up, they are conditioned to believe they cannot break away. They believe the rope can still hold them, so they never try to break free.”

The man was amazed. These animals could at any time break free from their bonds but because they believed they couldn’t, they were stuck right where they were.

Like the elephants, how many of us go through life hanging onto a belief that we cannot do something, simply because we failed at it once before?

Failure is part of learning; we should never give up the struggle in life.

Tuesday, June 19, 2018

Devops

What is Dev Ops ?

DevOps is not a framework or a workflow. It's a culture that is overtaking the business world. DevOps ensures collaboration and communication between software engineers (Dev) and IT operations (Ops). With DevOps, changes make it to production faster. Resources are easier to share. And large-scale systems are easier to manage and maintain.

DevOps replaces the model where you have a team that writes the code, another team to test it, yet another team to deploy it, and even another team yet to operate it.

Few roles in DevOps Team,

DevOps Evangelist - the DevOps leader who is responsible for the success of all the DevOps processes and people.
Code Release Manager – essentially a project manager that understands the agile methodology. They are responsible for overall progress by measuring metrics on all tasks.
Automation Expert – responsible for finding the proper tools and implementing the processes that can automate any manual tasks.
Quality Assurance or Experience Assurance – not to be confused with someone who just finds and reports bugs. Responsible for the user experience and ensures that final product has all the features in the original specifications.
Software Developer/Tester – the builder and tester of code that ensures each line of code meets the original business requirements.
Security Engineer - with all the nefarious operators out there you need someone to keep the corporation safe and in compliance. This person needs to work closely with everyone to ensure the integrity of corporate data.

What does DevOps do for you and why would you want to practice it?

Well the first reason is that it's been shown to be effective in improving both IT and business outcomes. The Puppet Labs' state of DevOps 2015 survey indicated that teams using DevOps practices deployed changes 30 times more frequently with 200 times shorter lead times. And instead of that resulting in quality issues, they had 60 times fewer failures and recovered from issues 168 times faster than other organizations.

The second reason is that it makes your daily life easier. Hi-tech is a very interrupt-driven, high pressure exercise in firefighting that can often lead to personal and professional burnout. We've found that the DevOps approach reduces unplanned work, it increases friendly relationships between coworkers, and it reduces stress on the job.Collaboration among everyone participating in delivering software is a key DevOps tenet.

DevOps core values: CAMS

The CAMS Model created by DevOps pioneers John Willis and and Damon Edwards It stands for Culture, Automation, Measurement, and Sharing.

CAMS has become the model set of values used by many DevOps practitioners. Patrick DuBois, is often referred to as the godfather of DevOps, since he coined the term, but he likes to say that DevOps is a human problem.

What is culture? Culture's a lot more than ping pong tables in the office, or free food in the company cafeteria.

Culture is driven by behavior. Culture exists among people with a mutual understanding of each other and where they're coming from. Early on in IT organizations, we split teams into two major groups: Development, they were charged with creating features. And Operations,they were charged with maintaining stability. Walls formed around these silos due to their differing goals. Now, today after this pattern has had a long time to metastasize, these groups don't speak the same language, and they don't have mutual understanding.Changing these underlying behaviors and assumptions is how you can drive change in your company's culture.

This brings us to the 'A' in CAMS. That's 'Automation'. The first thing that people usually think about when they think of DevOps is Automation. In the early days of DevOps, some people coined the term to anybody who was using Chef or Puppet or CFEngine. But part of the point of CAMS is to bring back balance into how we think about it.DevOps is not just about automated tooling.

People and process, they've got to come first. Damon Edwards expressed this as "people over process over tools". All of that said, Automation is a critical part of our DevOps journey.Once you begin to understand your culture, you can create a fabric of automation that allows you to control your systems and your applications. Automation is that accelerator that's going to get you all the other benefits of DevOps. You really want to make sure you prioritize Automation as your primary approach to the problem.

This brings us to the 'M' in CAMS.That stands for 'Measurement'. One of the keys to a rational approach to our systems is the ability for us to measure them. Metrics, they tell you what's happening and whether the changes that we've made have improved anything. There's two major pitfalls in Metrics.First, sometimes we choose the wrong metrics to watch. And, second, sometimes we fail to incentivize them properly. Because of this, DevOps strongly advises you to measure key metrics across the organization. Look for things like MTTR, the mean time to recovery, or Cycle Time. Look for costs, revenue, even something like employee satisfaction. All of these are part of generating a holistic insight across your system. These metrics help engage the team in the overall goal. It's common to see them shared internally, or even exposed externally to customers.

Speaking of sharing, that brings us to the 'S' in CAMS. Sharing ideas and problems is the heart of collaboration. And it's also really at the heart of DevOps. In DevOps, expect to see a high premium placed on openness and transparency. This drives Kaizen, which is a Japanese word that means 'discrete continuous improvement'.

CAMS that's 'Culture, Automation, Measurement, and Sharing'. They're the four fundamental and mutually reinforcing values to bring to a DevOps implementation. They're the "why" behind many of the more-specific techniques that we're going to cover later in this course. We really want to take these values to heart because the rest of your DevOps journey is going to be about trying to realize them in your organization.

DevOps principles: The three ways

The most respected set of principles is called The Three Ways.

This model was developed by Gene Kim, author of "Visible Ops" and "The Phoenix Project," and Mike Orzen, author of "Lean IT." The three ways they propose are systems thinking, amplifying feedback loops, and a culture of continuous experimentation and learning.

The first way, systems thinking, tells us that we should make sure to focus on the overall outcome of the entire pipeline in our value chain. It's easy to make the mistake of optimizing one part of that chain at the expense of overall results. When you're trying to optimize performance in an application, for example, increasing performance or system resources in one area causes the bottleneck to move sometimes to an unexpected place.

Adding more applications servers, for example, can overwhelm a database server with connections and bring it down. You have to understand the whole system to optimize it well. The same principle applies to IT organizations. A deployment team might establish processes to make their own work go smoothly and their productivity numbers look good,but those same changes could compromise the development process and reduce the organization's overall ability to deliver software. This overall flow is often called "From Concept to Cash." If you write all the software in the world but you can't deliver it to a customer in a way that they can use it, you lose. The split between development and operations has often been the place where the flow from concept to cash goes wrong. Use systems thinking as guidance when defining success metrics and evaluating the outcome of changes.

The second way, amplifying feedback loops, is all about creating, shortening, and amplifying feedback loops between the parts of the organization that are in the flow of that value chain. A feedback loop is simply a process that takes its own output into consideration when deciding what to do next.The term originally comes from engineering control systems. Short, effective feedback loops are the key to productive product development, software development, and operations.Effective feedback is what drives any control loop designed to improve a system. Use amplifying feedback loops to help you when you're creating multi-team processes, visualizing metrics, and designing delivery flows.

The third way reminds us to create a work culture that allows for both continuous experimentation and learning. You and your team should be open to learning new things and the best route to that is actively trying them out to see what works, and what doesn't work,instead of falling into analysis paralysis. But it's not just about learning new things, it also means engaging in the continuous practice required to master the skills and tools that are already part of your portfolio.The focus here is on doing. You master your skills by the repetition of practice. And you find new skills by picking them up and trying them.Encourage sharing and trying new ideas.

It's how you use them that matters most. As you continue your DevOps journey, it's important to stay grounded in an understanding of what exact problem a given practice or tool solves for you. The Three Ways provide a practical framework to take the core DevOps values and effectively implement specific processes and tools in alignment with them.

5 Key Methodologies of DevOps are,

One of the first methodologies was coined by Alex Honor and it's called people over process over tools. In short, it recommends identifying who's responsible for a job function first.Then defining the process that needs to happen around them. And then selecting and implementing the tool to perform that process. It seems somewhat obvious, but engineers and sometimes over-zealous tech managers under the sales person are usually awfully tempted to do the reverse and buy a tool first and go back up the chain from there

The second methodology is continuous delivery.It's such a common methodology that some people even wrongly equate it with DevOps. In short, it's the practice of coding, testing, and releasing software frequently, in really small batches so that you can improve the overall quality and velocity.

Third up is lean management. It consists of using small batches of work,work in progress limits, feedback loops and visualization. The same studies showed that lean management practices led to both better organizational outputs, including system throughput and stability and less burn out and greater employee satisfaction at the personal level.

The fourth methodology is change control. In 2004, the book Visible Ops came out. Its research demonstrated that there is a direct correlation between operational success and a control over changes in your environment. But, there's a lot of old-school heavy change control processes out there that do more harm than good. Yeah, there really is. And that's what was really great about Visible Ops, because it describes a light and practical approach to change control. It focused on an emphasis of eliminating fragile artifacts, creating a repeatable build process, managing dependencies and creating an environment of continual improvement.

Fifth and final methodology, infrastructure as code.One of the major realizations of modern operations is that systems can and should be treated like code. System specifications should be checked into source control, go through a code review whether a build, an automated test, and then we can automatically create real systems from the spec and manage them programatically.

With this kind of programatic system, we can compile and run and kill and run systems again, instead of creating hand-crafted permanent fixtures that we maintain manually over time. Yeah, we end up treating servers like cattle, not pets. So these five key methodologies can help you start in on your tangible implementation of DevOps.

10 practices for DevOps success:

None of them are universally good or required to do DevOps, but here are 10 that we've both seen used and they should at least get you thinking.

Practice number 10, incident command system. Bad things happen to our services. In IT, we call these things incidents. There's a lot of old school incident management processes that seem to only apply to really large scale incidents. But we've learned, in real life, that it's full of a mix of small incidents with only an occasional large one.
One of my favorite presentations I ever saw at a conference, was Brent Chapman's Incident Command for IT, What We Can Learn From the Fire Department. It explained how incident command works in the real world for emergency services, and explained how the same process can work for IT, for incidents both small and large. I've use ICS for incident responsein a variety of shops to good effect. It's one of those rare processes that helps the practitioner, instead of inflicting more pain on them, while they're already trying to fix a bad situation.

Practice number nine, developers on call. Most IT organizations have approached applications with the philosophy of, let's make something. And then someone else will be responsible for making sure it works. Needless to say, this hasn't worked out so well. - Teams have begin putting developers on call for the service they created. This creates a very fast feedback loop. Logging and deployment are rapidly improved and core application problems get resolved quickly, instead of lingering for years while making some network operations center person re-start the servers as a work-around.

All right, practice number eight, status pages. Services go down, they have problems. It's a fact of life. The only thing that's been shown to increase customer satisfaction and retain trust during these outages, is communication. - Lenny Rachitsky's blog, transparent uptime, was a tireless advocate for creating public status pages, and communicating promptly and clearly with service users, when an issue arises. Since then, every service I've run, public or private, has a status page that gets updated when there's an issue so that users can be notified of problems, understand what's being done and hear what you've learned from the problem afterwards.

All right, this brings us to practice number seven, blameless postmortems.Decades of research on industrial safety has disproven the idea, that there's a single root cause for an incident, or that we can use human error as an acceptable reason for a failure. - John Allspaw, CTO at Etsy, wrote an article called Blameless Postmortems and A Just Culture, on how to examine these failures and learn from them while avoiding logical fallicies,or relying on scapegoating, to make ourselves feel better while making our real situation worse.

This brings us to practice number six, embedded teams. One of the class DevOps starter problems is that the Dev team wants to ship new code and the Ops team wants to keep the service up. Inside of that, there's an inherent conflict of interest. - One way around this, is to just take their proverbial doctor's advice of don't do that. Some teams re-organize to embed an operations engineer on each development team and make the team responsible for all it's own work, instead of throwing requests over the wall into some queue for those other people to do.This allows both disciplines to closely coordinate with one goal, the success of the service.

Practice number 5, the cloud. The devops love of automation and desire for infrastructure's code has met a really powerful ally in the cloud. The most compelling reason to use cloud technologies, it's not cost optimization, it's that cloud solutions give you an entirely API driven way to create and control infrastructure. - This allows you to treat your systems infrastructure exactly as if it were any other program component. As soon as you can conceive of a new deployment strategy or disaster recovery plan or the like, you can try it out without waiting on anyone.The cloud approach to infrastructure can make your other devops changes move along at high velocity.

Alright let's move on to practice number 4, Andon Cords. Frequently in a devops environment you're releasing quickly. Ideally, you have automated testing that catches most issues, but tests aren't perfect. - Enter the Andon Cord. This is an innovation originally used by Toyota on its production line. A physical cord, like the stopper quest cord on a bus, that empowers anyone on the line to pull to stop ship on the production line because they saw some problem.It forms a fundamental part of their quality control system to this day. - You can have the same thing in your software delivery pipeline, that way you can halt an upgrade or deployment to stop the bug from propagating downstream. - We recently added an Andon Cord to our build system. After a developer released a bug to production that he knew about but didn't have a test to catch. Now, everyone can stop ship if they know something's not right. -

Alright, let's move to practice number 3, dependency injection. In a modern application connections to it's external services, like databases or rest services, etc, they're the source of most of the run-time issues. There's a software design pattern called dependency injection, or sometimes called inversion of control. This focuses on loosely coupled dependencies. - In this pattern the application shouldn't know anything about it's external dependencies. Instead, they're passed into the application at run-time.This is very important for a well-behaved application in an infrastructure as code environment. Other patterns like service discovery can be used to obtain the same goal.

Alright let's move on to practice number 2, blue/green deployment. Software deployment. It works only one way, right. I mean traditionally, you take down the software on a server, upgrade it, bring it back up, and then you might even do this in a rolling manner, so, you can maintain the system uptime. - One alternate deployment pattern though, is called the blue/green deployment. Instead of testing a release in a staging environment and then deploying it to a production environment and hoping it works, instead you have two identical systems, blue and green.One is live and the other isn't. To perform an upgrade, you upgrade the offline system, test it, and then shift production traffic over to it. If there's a problem you shift back. This minimizes both downtime from the change itself and risk that the change won't work when it's deployed to production. -

Alright let's move to our last practice, practice number one, the chaos monkey. Old style systems development theories stressed making each component of a system as highly available as possible.This is done in order to achieve the highest possible uptime. But this doesn't work. A transaction that relies on a series of five, 99% available components will only be 95% available, because math. Instead, you need to focus on making the overall system highly reliable, even in the face of unreliable components. - Netflix is one of the leading companiesin new-style technology management, and to ensure they were doing reliability correctly, they invented a piece of software called the Chaos Monkey. Chaos Monkey watches the Netflix system that runs in the Amazon cloud, and occasionally reaches out and trashes a server. Just kills it. This forces the developers and operators creating the systems, to engineer resiliency into their services, instead of being lulled into making the mistake of thinking that their infrastructure is always on. - Well, that's our top 10 list of individual practices from across the various devops practice areas.

Building Blocks of Devops are

Kaizen

Kaizen emphasizes going to look at the actual place where the value's created or where the problem is. Not reports about it, not metrics about it, not processes discussing it, not documentation about it, it's actually going to look at it.In IT it's where people are doing the work. In some cases it might even mean going to the code or the systems themselves to see what they are really doing. - You may have heard the term management by walking around. This is actually an interpretation of Gemba

The Kaizen process is simple. And it's a cycle of plan, do, check, act. First you define what you intend to do and what you expect the results to be. Then you execute on that. Then you analyze the result and make any alterations needed. - If the results of your newest plan are better than the previous base line, now it's the new base line. And in any event it might suggest a subsequent plan, do, check, act cycles.

The simple process of plan, do, check, act does more than just give value and generate any improvements. It's more about teaching people critical thinking skills.

Another Kaizen tool used to get to the root of a problem is called the Five Whys. The idea behind it is simple. When there's a problem, you ask the question why did it happen? And when you get an answer you ask why did that happen? - You can repeat this as much as necessary, but five times is generally enough to exhaust the chain down to a root cause. - When using the five whys, there's four things to keep in mind.

One is to focus on underlying causes not symptoms. -

Another is to not accept answers like not enough time. You know we always work under constraints. We need to know what caused us to exceed those constraints.

Third, usually there'll be forks in your five why's as multiple causes contribute to one element. A diagram called a fish bone diagram can be used to track all of these.

Alright, fourth and finally, do not accept human error as a root cause. That always points to a process failure, or a lack of a process with sufficient safe guards.

A quote used in five whys activities is people don't fail, processes do. - Alright, and that's Kaizen.

So is Dev Ops exactly the same thing as Agile? -

No you can practice Dev Ops without Agile and vice versa, but it can and frankly probably should be implemented as an extension of Agile since Dev Ops has such strong roots in Agile. Dev Ops manifesto, after consider very slight edits of the Agile manifesto capture the heart of it.

Replace software with systems and add operations to the list of stakeholders and the result is a solid foundation to guide you in your Dev Ops journey

Lean has become an important part of DevOps, especially of successful DevOps implementations.Lean, a systematic process for eliminating waste was originally devised in the manufacturing world.

They identified Seven Principles of Lean that apply to software. Similar to the Just-in-Time credo of Lean manufacturing, and aligned with the Agile idea of being flexible, you try to move fast, but delay decisions and enhance feedback loops and group context. Building integrity in is an important precept and will inform the approachistic continuous integrationand testing you'll hear about later in this course. So let's talk about waste.

The fundamental philosophy of Lean is about recognizing which activities you and your organization perform that add value to the product or service you produce, and which do not. Activities that don't add value are called waste. Lean recognizes three major types of waste, and they all have Japanese names. Muda, Muri, and Mura. Muri is the major form of waste and it comes in two types. Type one, which is technically waste, but necessary for some reason, like compliance.

And type two, which is just plain wasteful. The Poppendiecks also define seven primary wastes that are endemic in software development. This includes bugs and delays, but it also includes spending effort on features that aren't needed.

Value Stream Mapping, where you analyze the entire pathway of value creation, and understand what exact value's added where, how long it takes, and where waste resides in that pathway.In Lean product development, that value stream is referred to as Concept to Cash. The entire pathway from the idea to its realization, including all the production and distribution required to get it to customers.

DevOps stands on the shoulders of giants and there are a lot of concepts from the various ITSM and SDLC frameworks and maturity models that are worth learning. IT service management is a realization that service delivery is an part of the overall software development lifecycle, which should properly be managed from design to development to deployment to maintenance to retirement. ITSM is clearly one of DevOps's ancestors. ITIL was the first ITSM framework.

ITSM means Information Technology Service Management

ITIL means IT Infrastructure Library ITIL is a UK government standard.

ITIL v2 recognizes four primary phases of the service lifecycle, service strategy, design, transition and operation. It has guidance for just about every kind of IT process you've ever heard of, from incident management to portfolio management to capacity management to service catalogs.

Infrastructure as a code

It's a completely programmatic approach to infrastructure that allows us to leverage development practices for our systems. The heart of infrastructure automation, and the area best served with tools. Is configuration management.

First, provisioning. Is the process of making a server ready for operation. Including hardware, OS, system services, and network connectivity.

Deployment is the process of automatically deploying and upgrading applications on a server.

And then orchestration, is the act of performing coordinated operations across multiple systems.

Configuration management itself is an over arching term, dealing with change control of system configuration after initial provisioning.But it's also a often applied to maintaining and upgrading applications and application dependencies. There are also a couple important terms describing how tools approach configuration management.

Imperative, also known as procedural. This is an approach where commands desired to produce a state are defined and then executed.

And then there's declarative, also known as functional. This is an approach where you define the desired state, and the tool converges the existing system on the model.

Idempotent. This is the ability to execute the CM procedure repeatedly. And end up in the same state each time.

And finally, self service, is the ability for an end user to kick off one of these processes without having to go through other people.

Ubuntu's Juju is an open source example of this approach, where not only the infrastructure but also the services running on it are modeled and controlled together. You may need to dynamically configure machines running in their environment.Using tools like Chef, Puppet, Ansible, Salt, or CFEngine can help you accomplish this.

These configuration management tools allow you to specify recipes for how a system should get built. This includes the OS dependencies, system configuration, accounts, SSL certs, and most importantly, your application. These tools have been on the market for some time, and many have become full-featured development environments. Chef, you often use the generic Ruby linter, Rubocop, and the Chef specific linter, Foodcritic, to get code hygiene coverage. Unit testing is also possible with a tool like Chefspec, and full integration testing is done with KitchenCI, which runs a full converge with test harnesses, with hooks to server spec and test suites.

Chef, Puppet, and their peers also can take the place of a searchable CMDB. To do that, Chef runs a piece of software called Ohai on the systems, and Ohai profiles the system and stores all the metadata about it in a Chef server. - This works, but then your CMDB is only as up to date as the latest convergence, which most people run either hourly or on demand. It doesn't help as much with dynamic workloads, where state changes in seconds or even minutes. - ETCD, ZooKeeper and Consul are a few common tools to perform service discovery and state tracking across your infrastructure.

Containers generally don't use Chef or Puppet as their configuration is generally handled with a simple text file called the Docker file. Containers have just enough software and OS to make the container work.A lot of the functionality of Chef and Puppet aren't relevant to containers and are more relevant to long-running systems. - Docker Swarm, Google's Kubernetes, and Mesos are three popular Docker platforms that do orchestration of containers. They allow you to bring in multiple hosts and run your container workload across all of them.

They handle the deployment, the orchestration and scaling. Since the container is the application,these solutions get to the fully automated level that Juju does for service solutions. - Some private container services, like Rancher, Google's Cloud Platform, or Amazon's ECS take care of running hosts for your containers so you can focus just on your application.

Habitat is by the people that make Chef,and it bills itself as application automation. - While Chef is more about configuring infrastructure, Habitat extends down into the application build cycle, and then into the application deploy cycle, bookending and complementing Chef.

Continuous Delivery

In continuous delivery, you have an application that on every code commit is built automatically. Unit tests are run, and the application is deployed into a production like environment. Also, automated acceptance tests are run, and the change either passes or fails testing minutes after it's checked in. Code is always in a working state, with every continuous delivery.

Continuous Integration is the practice of automatically building and unit testing the entire application frequently. Ideally on every source code check in.

Continuous Delivery is the additional practice of deploying every change to a production like environment, and performing automated integration and acceptance testing.

Continuous Deployment extends this, to where every change goes through full enough automated testing. That it's deployed automatically to production

Six practices that we think are critical for getting continuous integration are,

The first practice, all builds should pass the coffee test. The build should take less time than it takes to get a cup of coffee. NThe longer a build takes, the more people naturally wait until they have a larger batch of changes, which increases your work in progress.
Second practice, commit really small bits. Seek to commit the smallest amount of code per commit. Small changes are much easier for everyone on the team to reason about.It also makes isolating the failures way easier.
Practice number three, don't leave the build broken. When you leave the build broken, you block delivery. I often suggest the team get together, make a pact, you want to delay meetings or stop all other work until the build is fixed. Now, this is really purely a cultural problem, and how you handle broken builds, that sets a tone for the rest of your delivery culture
Use a trunk based development flow. It helps keep work down to a limited work in progress, insures the code is reviewed and checked frequently, and reduces wasteful and error-prone rework especially when you're trying to merge branches There are two main practices people use when developing, branch-based and trunk-based. Trunk-based means that there are no long running branches, and the trunk, also called master, is always integrated across all of the developers by its very nature. Now, all developers work off trunk and commit back to trunk multiple times a day. Now, instead of keeping separate code branches, developers branch in code and use feature flags.
Don't allow flaky tests, fix them. You have no way to know if you can really trust the build.
The build should return a status, a log, and an artifact. The status should be a simple pass/fail or a red/green. A build log is a record of all the tests run and all the results of the run. The artifact should be uploaded and tagged with a build number. This adds trust, assures audibility, and the immutability of the artifacts

Five Best Practices for Continuous Delivery are

In the continuous integration, we discussed the artifacts that are created upon the successful completion of each build. - These artifacts shouldn't be rebuilt for staging, testing, and production environments. - Yeah, they should be built once and then used in all the environments. This way, you know that your testing steps are valid since they all use the same artifact. - Your artifacts also shouldn't be allowed to change along the way. They need to be stored and have permissions set in such a way that they're immutable. - In the continuous delivery pipeline that I built at my job, I set the permissions so that the CI system can only write in the artifact to the artifact repository and the deployment system that we call Deployer only has read access to the artifact.

- We want artifacts to be built once and immutable for two reasons. - First, it's going to create trust between the teams. When they're debugging an issue, you want the dev and the ops and the QA, all the teams to have confidence that the underlying bits didn't change underneath them between the different stages. - Yeah, and then a quick checksum can prove that you're all looking at the exact same artifact version. - Yeah, and the second reason is auditability. One of the great parts about building the continuous delivery pipeline is that you can trace specific code versions in source control to successful build artifacts to a running system.

Rebuilding or changing an artifact along the way would break your auditability. - Code is checked in a version control system. That commit triggers a build in your CI system. Once the build finishes, the resulting artifacts are published to a central repository. Next we have a deployment workflow to deploy those artifacts to a live environment that's as much of a copy of production as possible. You may call this environment CI, staging, test, or pre-prod.One of the reasons we move code to this environment is to do all the acceptance testing, smoke tests, and integration tests that are difficult to fully simulate on dev desktops or build servers.

One of the reasons we move code to this environment is to do all the acceptance testing, smoke tests, and integration tests that are difficult to fully simulate on dev desktops or build servers. This brings up another crucial point. Your system needs to stop the pipeline if there's breakage at any point. - A human should be able to lock the CD pipeline using an Andon Cord.

But even more importantly, the CD pipeline shouldn't allow progression from stage to stage without assurance that the last stage was run successfully. We have two checks implemented mainly. First, if there's any failure encountered in the deployment system, it's going to lock up and it'll notify all the team in chat. Second, each stage of the deployment audits the previous stage and checks that not only that no errors occurred, but also that it's in the expected state that it should be.

In other words, redeploying should leave your system in the same state. - Yeah, you can accomplish this by using a mutable packaging mechanism like Docker retainers or through a configuration management tool like Puppet or Chef. But this is another area where trust and confidence factors into your pipeline.

Trace a single code change through it and answer these two questions. -

Are you able to audit a single change and trace it through the whole system? This is going to be called your cycle.

And how fast can you move that single change into production? This is your overall cycle time. -

I encourage you to start recording metrics off your pipeline. Focus on cycle time, the measure on how long it takes a code check-in to pass through each of the steps in the process all the way to production.

You know, another thing I like to do is to understand team flow. And you can do that by keeping a pulse on the team through tracking the frequency of deploys as they're happening. - That's right, and one way to improve those metrics is in how your perform QA.

Role of QA

Let's kick off with unit testing. These are the tests done at the lowest level of the language or framework that it supports.Let's say you have a calculator application. A function called add would take two numbers, and big surprise here, it adds them together. In unit testing, we would write a test inside the codebase to validate that function. These are the hallmarks of unit testing; usually, the fastest testing available, it stubs out external dependencies with fake values, and it easily runs on the developer's machine.

Code hygiene is the sum of the best practices from your development community for the particular language or framework that you're using. It is usually accomplished through linters, formatters, and you can also check for banned functions in your code.

Next, is integration testing. It's technical testing similar to unit testing, but performed with all the apps components and dependencies operating in a test environment.

Next, we're lumping together a few; test-driven development TDD, behavior-driven development BDH, and acceptance-test-driven development ATD. Now, they're all related movements in software development that focus on testing from an outside-in perspective. Each one varies slightly in their implementation. So even though we're lumping them together in the same category of tests, let's briefly cover what they mean.

TDD, test-driven development. It's a development practice that starts with writing tests before you write any code. The flow with TDD is that you start with the desired outcome written as a test. You then write code to pass the test, and then you rinse, wash, and repeat. Now, this flow encourages high feedback, and a bonus is that while the application is being written, a comprehensive test read is also being developed.

Behavior-driven development also called BDD, that encourages the developer to work with the business stakeholder to describe the desired business functionality of the application and expresses the test in a DSL that is pretty close to the English language.

Next, we have acceptance-test-driven development, ATDD.It builds on TDD and BDD, and it's involved in finding scenarios from the end users' perspective. You write automated tests with examples of these use cases, and then running them repeatedly while the code is being developed.

Infrastructure testing. This is the next test category and sometimes it can be a slow-running test. It involves starting up a host and running the configuration management code, running all the tests, and then turning it all off.

Another test category is performance testing. Basic performance testing should be part of everything that you do. But there's also a dedicated performance test that you want to have.You want to have load tests, stress tests, soak tests, and spike tests. All these tests are really great for out-of-band nightly runs. They're usually long-running and they consume a lot of resources.

The last category is security testing. It might be useful to think about this as a simulated attack test. Gauntlet lets you use BDD language when testing for security attacks from the outside in.

Testing is critical to get right if you want to be able to set up a continuous delivery pipeline. It's the only way you can trust that the changes you made won't break the system and keep up a high rate of speed.

The six key phases for continuous delivery and the tooling that's associated with it are Version Control, CI Systems, Build, Test, you'll need an Artifact repository and Deployment.

Version control is where we commit code changes and can view the entire history of all changes ever made.

It allows the developers to stay in sync with each other by treating each change as an independent layer in the code. - You know, most organizations today opt to use Git in the form of GitHub or BitBucket either as a SAS or sometimes as an on-prem enterprise version.These add additional collaboration and sharing benefits, often found in social media. You can kind of think of it as Facebook meets version control. Alright, next up is continuous integration. - Jenkins, being open-source, is popular in many organizations.Its UI is a little bit difficult to navigate at times but it has tons of community support and almost every provider integrates with it. There's also a commercial offering of it from CloudBees. Other options include GoCD, Bamboo and TeamCity. - There's also been a good amount of adoption of continuous integration as a service from companies like Travis CI or Circle CI. Alright now let's talk about build tools. Build tools are very language dependent.You might be going simply with Make or Rake if you're using Ruby, but these just execute a consistent set of steps every time.

- Or you can take a workflow approach with Maven, that can allow you to run reproducible builds and tests from your developer desktop, all the way to your CI system. - If you're doing testing on the front-end code, it's really popular to use Gulp. And, if you're building out infrastructure, something like Packer from HashiCorp. - Most development languages have unit testing built in, or there's a strong recommendation by the community for what to use like JUnit for Java. - The same goes for code hygiene with linters and formatters. Kind of link golang. There's golint or gofmt. Or maybe for using Ruby, there's a version called RuboCop. - Integration testing is usually performed with test-driven frameworks or by using in-house scripts. - Testing frameworks and tools in this area include Robot, Protractor and Cucumber. The cool thing about these is they let you express an outside end approach to the code. - They can hook into Selenium for UI testing, or you can use Selenium on its own.If you end up doing a lot of acceptance testing for the front end, there's a great SAS offeringcalled Sauce Labs you can use.

- Let's say you're doing infrastructure testing. You'll probably be using tools like Kitchen CI for Chef. That actually creates new instances and runs a full convergence of the code to test. - Tools like ApacheBench or Jmeter can help you do performance testing. - And you got to add some security testing in there as well. Galen and Mitten are two open-sourceoutside end testing tools. There's also tools that do code inspection like the open-source tool Brakeman or there's also paid offerings from companies like Veracode. - One of the most important attributes of these tests is your ability to run them on the Dev Desktop prior to check-in, and not just rely on your CI pipeline.

Tools like Vagrant, Otto and Docker Compose let you deploy and run your whole app locally.So you can run not just your unit tests, but also integration and acceptance tests, at any time. - Okay, once your code has been built and tested, the artifacts have to go somewhere.- Popular solutions like Artifactory or its open source eqivalent, Nexus, manage lots of different artifact formats. Or a specific output, like a Docker image, can be sent to Docker Hub or your internal docker registry.

So finally, we get to deployment. Rundeck is a nice workflow driven deployment option. It lets you define a job, put often permissions around it and then automate a workflow across your systems. And Deployment is a popular workflow that people use it to automate.

- Some people use their configuration management tooling for doing application deployment or other people just write their own custom tooling. Some use commercial tools from folks like UrbanCode and ThoughtWorks. There's an open-source offering from Etsy, it's called Deployinator. It provides a dashboard that lets you run your own deployment workflows. - Tools in this space change a lot. And like we said earlier, nobody has the exact same continuous delivery pipeline and that's okay. - You know, when building out pipelines for different places I've been at, I try to think about what's the easiest thing I could build here? I ask, I try to really focus on the question of; What's the minimum viable product for this portion of the pipeline? - That's right.

Reliability engineering

In engineering, reliability describes the ability of a system or component to function under stated conditions for a specified period of time. - In IT this includes availability, performance, security, and all the other factors that allow your service to actually deliver its capabilities to the users

You may have heard the term Site Reliability Engineering. That's a term that Google popularized for this approach. Google has product teams support their own servicesuntil they reach a certain level of traffic and maturity. And even then they have the development team handle 5% of the operational workload ongoing. - This keeps a healthy feedback loop in place that continually improves the product's operational abilities. There's even a new O'Reilly book called Site Reliability Engineering by some Google engineers that has a lot of good insights, especially for large web shops on this topic.

Site Reliability Engineering

Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy

https://landing.google.com/sre/book.html

APM (application performance management )tools do distributed lightweight profiling across a whole architecture, and let you bring together timings, and metrics to identify bottlenecks and slowdowns. You can run APM tools in production, and may have to if you can't reproduce a problem in staging. I've also found them to be more useful in the development process. Find the problems before you roll them out. In a distributed system, performance issues are often worse than straight up failures.

Keeping a handle on performance issues, including running baselines on every single build in your CI pipeline, is critical to your service's health. There's way more to say about these topics. The general approach however, is make sure you have operational expertise incorporated into the development phase of your product, and that you design in performance and availability from the beginning.

Finally, you want to implement things to make maintenance easier.Our approach to reliability engineering is to complete the operation's feedback loop to development. This works best in a Lean fashion.

Let's take a look at the six areas of monitoring that we suggest measuring.

Service performance and uptime, software component metrics, system metrics, app metrics, performance, and finally, security. Service performance and uptime monitoring is implemented at the very highest level of a service set or application. These are often referred to as synthetic checks or you know, and they're synthetic because they're not real customers or real traffic.It's the simplest form of monitoring, to answer the question of, is it working.

The next area of monitoring is software component metrics. This is monitoring that's done on ports or processes, usually located on the host. This moves into layers, so instead of answering, is my service working, it's asking, is this particular host working.

The next area is a layer deeper, it's system metrics. They can be anything from, like, CPU or memory. These are time series metrics, and they get stored and graphed where you can look at them and answer the question, is this service or host or process, is it functioning normally? Alright, next we get into application metrics.

Application metrics are telemetry from your application that give you a sense of what your application is actually doing. A couple of examples of these, like when you emit how long a certain function call is taking, or maybe the number of logins in the last hour, or account of all the error events that have happened.

Real user monitoring, also called RUM, it usually uses front-end instrumentation, for example, like a JavaScript page tag. It captures the performance observed by the users of the actual system. It's able to tell you what your customers are actually experiencing.

The last area is security monitoring. Now, attackers don't hack systems magically and emit a special packet that just takes everything down. It's a process, and there's enough digital exhaust created from the attack progression, that monitoring is possible, though sadly, it's often rare. Security monitoring includes four key areas. System, think of things like Bad TLS, SSL settings. You know, maybe open ports and services, or other system configuration problems.

Application security, this is like knowing when XSS or SQL injection are in (mumbles) on your site. Custom events in the application, things like password resets, invalid logins, or new account creations

Logs are a great devops monitoring tool because while they can be consumed by an operations team, they can also be fed back to developers to provide meaningful feedback.Before getting into it, remember the 5 Ws of logging.

What happened,
when it happened,
where did it happen,
who was involved, and
where the entity came from.

Logging can be used for lots of purposes, ranging from audit to forensics to trouble shooting, resource management, to intrusion detection to user experience.

One of the first goals in any environment is centralized logging. Sending all the logs, via syslog or store and forward agents, to a centralized location, that's key but how you do it is important. With this in mind, there are five principles I'd like to cover.

First, do not collect log data that you're never planning to use. Number two, retain log data for as long as it's conceivable that it could be used or longer, if it's prescribed by regulations. The impact of keeping too much overloads cost for resources, for maintenance, and it inhibits overall growth.Three, log all you can, but alert only on what you must respond to. Fourth, don't try to make your logging more available or more secure than your production stack. This is a very lean approach and if you overbuild capacity, or in this case, defense and availability, you have misallocated resources in your system. Logging should meet business needs, not exceed them. Fifth, logs change, as in their format or their messages. New versions of software bring changes. Creating a feedback loop that encourages everyone to take ownership of their logs in the centralized logging system, that's important.

Monitoring, metrics, and logging are three feedback loops that bring operations back into design. Remember, the other processes that can create feedback loops, such as incident command system, blameless postmortems, and transparence uptimes.

DevOps Monitoring Tools

Well, first, there's the rise of SaaS. Many new monitoring offerings are provided as a service. From simple endpoint monitoring like Pingdom, to system and metric monitoring like Datadog, Netuitive, Ruxit, and Librato, to full application performance management tools like New Relic and AppDynamics. These provide extremely fast onboarding and come well provided with many integrations for many modern technologies.

- And there's a whole category of open source tools, like statsd, ganglia, graphite, and grafana. You can use these to collect large scale distributed custom metrics. - That's right, and you can pull those and put them into a time series data base like InfluxDB or OpenTSDBto process them. - There are application libraries specifically designed to admit metrics into these. Like the excellent metrics library from Coda Hale. - There are plenty of new open source monitoring solutions designed with more dynamic architectures in mind.

Icinga and Sensu are two solutions somewhat similar to Nagios in concept, and can use the large existing set of Nagios plugins, but have more modern UIs and are easier to update in an ephemeral infrastructure world. - You know, and containers have brought their own set of monitoring tools with them. Stuff like the open source tools like Prometheus and Sysdig. - Log management has become a first order part of the monitoring landscape. This started in earnest with Splunk, the first log management tool anyone ever wanted to use.

Yeah, and then it moved to SaaS, so with Sumo Logic, log entries, and similar loferings, but it's come back around full circle as an excellent open source log management system has emerged, composed of elasticsearch log stashin kibana, it's often referred to as the Elk stack. - Pagerduty and VictorOps are two sterling examples of SaaS incident management tools that help you holistically manage your alerting and on call burden. - This means you don't have to rely on the scheduling and routing and functionality built into the monitoring tools themselves.

Well, there's even an open source project called flapjack, at flapjack.io, that can help you do that yourself, if you wish. - Statuspage.io provides status pages as a service. You may have seen some of these in use by some of your SaaS providers. In accordance with transparent uptime principles, services can gateway their status and metrics to external or internal customers and allow them to subscribe to updates off these pages.

But a command dispatcher like Rundeck, Saltstack, or Anciple, those are a good part of your operational environment for purposes of runbook automation. This means running can procedures across systems for convenience and reduction in the manual error.

The 10 best DevOps books you need to read

Number 10, Visible Ops. Visible Ops by Gene Kim is one of the bestselling IT books of all time. It boils down ITIL into four key practices that his research shows to bring high value to organizations through a Lean implementation of change control principles.

Number nine, Continuous Delivery. Continuous Delivery is the book on continuous delivery. It was written by David Farley and Jez Humble. This book is so chock full of practices and principles along with common antipatterns that it really was useful along that journey.

Number eight, Release It! With an exclamation point. This book's premise is to design and deploy production ready software with an emphasis on production ready. Release It! has given much of the industry a new vocabulary. Author Michael Nygard provides his designs patterns for stability, security and transparency. It won the Dr. Dobb's Jolt Award for productivity in 2008. -

Alright, book number seven, Effective DevOps. It was written by Jennifer Davis and Katherine Daniels.This features lots of practical advice for organizational alignment in DevOps and it makes sure to fit the cultural aspects alongside the tooling. I especially like the focus on culture with all the interesting case studies they did. -

Number six, Lean Software Development, An Agile Toolkit. Mary and Tom Poppendieck authored this seminal work on bringing Lean concepts into software development, and exploring the benefit of value stream mapping and waste reduction. They explain the seven Lean principles applicable to software and cover a wide variety of conceptual tools, along with plenty of examples.This book is the single best introduction to the topic of Lean software. -

Book five, Web Operations, this book is edited by John Allspaw who gave the groundbreaking 10 Deploys a Day presentation of velocity back in 2009. This book is a collection of essays from practitioners ranging from monitoring to handling post-mortems to dealing with stability with databases. It also contains medical doctor Richard Cook's amazing paper How Complex Systems Fail.

Book four, The Practice of Cloud System Administration. This book, written by Tom Limoncelli is a textbook on system administration topics that continues to be updated.

t has an entire section on DevOps, and if I was pinned down to recommend just one book to a sysadmin or ops engineer, this is probably the book I would choose. If your role is hands-on, read this book.

Alright, book three, The DevOps Handbook. Subtitled, "How to create world-class agility, reliability "and security in technology organizations." This book is by Gene Kim, Jez Humble, Patrick Debois and John Willis. It was under development for five years by these leaders of the DevOps movement, and it's the standard reference on DevOps.

Book two, Leading The Transformation. One book that deserves special mention for enterprises is Gary Gruver and Tommy Mouser's book, Leading The Transformation,

Applying Agile and DevOps Principles at Scale. This is a book for directors, VPs, CTOs, and anyone in charge of leading IT organizational change of any size. Gruver describes leading DevOps transformations at HP in the firmware division for printers, and at the retailerMacy's, both with incredible success.

Alright, our top book recommendation is The Phoenix Project. This is the bestselling book by Gene Kim, George Spafford and Kevin Behr. It's a modern retelling of Goldratt's The Goal.This is in a novel format and it walks you through one company's problems and their transformation to Lean and DevOps principles