Site Reliability Engineering

AKA SRE. Posts related to optimization of services, reduction of downtime, best DevOps practices, etc.

ansible-exec: ansible-playbook wrapper for executing playbooks

By | 2018-04-17T18:53:38+00:00 August 26th, 2014|Blog|

Ansible is a great automation tool. We use it for server provisioning, application deployments and running maintenance scripts. One problem it does have however, is how (in)convenient it is to run playbooks as opposed to regular shell scripts. Write and run enough Ansible playbooks, and eventually you’ll get tired of the repetitive typing your fingers have to do.

Naught: Zero Downtime for Node.js Applications

By | 2018-04-17T18:37:34+00:00 March 22nd, 2014|Blog|

Service downtime is a harmful event to most technology businesses, especially to those who require their services to be constantly available. Downtime has many causes, such as hardware failures and network issues. In today’s web-scale world, application deployment is one of the main reasons for such downtime. This is particularly common with organizations performing Continuous Delivery, in which developers deploy their code at an unprecedented speed. Since there is always a good chance that the new code contains errors, the frequency of application changes holds a high risk of service malfunction.

Easy Modeling of Distributed Production with Vagrant & Ansible

By | 2018-04-17T18:33:41+00:00 July 14th, 2014|Blog|

Modeling your production environment correctly is very important for development. Developers need to be able to run and test their code locally for the development process to be efficient, and many times this requires setting up infrastructure that exists in production on their local machines. The basic solution is a simple Vagrant box containing all your infrastructure and application code, like the one we mentioned in our Devbox post. 

Top 3 Takeaways from SREcon16

By | 2018-04-17T18:44:56+00:00 April 12th, 2016|Blog|

SREcon16 is a wrap, and our team had a blast at this year’s event! Both days were non-stop action: demos, discussions, and - of course - handing out our fair share of panda swag. Between the buzz on the floor and in the sessions, what topics were top of mind at this year’s show? Here are our three key takeaways:

Sam Kendall’s noisy alert problem

By | 2018-04-17T18:15:30+00:00 February 23rd, 2016|Blog|

Sam’s a father of two boys living in the bucolic LA suburb of West Covina. He’s a family first guy who paints model military cargo planes for fun, makes award-winning paella, hates his commute, and loathes his phone between the hours of midnight and 4:00 AM.

Sam was a kid when he joined News Corp as a help desk analyst in 2000. More than 15 years later and he’s now Sr. Director of IT managing a growing team of 30 NOC engineers, sys admins, and DBAs. Over the years, he has received more promotions than Trump on his own Twitter feed by delivering results and never wavering from two core beliefs that influence everything he does: