Tuesday, December 6, 2016

Techpost - The Test-Maintain Loop saved me during a Jenkins Server infrastructure upgrade

Technical Post (just a warning for the non-technical follower).

As some of you know, earlier this year I created a first version of an Ansible Playbook Test framework and put it on Ansible Galaxy.  

Yesterday, this tool saved me from creating a disaster on my Production Jenkins CI instance on AWS (Amazon Web Services). 

I'd like to share that story as an example for others of what is possible if you include testing as part of your Infrastructure as Code strategy.

How I was saved yesterday:

(some details removed for security and simplicity reasons).

Execute the Playbook...
ansible-playbook -i Inventory/CASPAR/staging/ CASPAR_setup_jenkinsservers.yml -u ubuntu --ask-vault-pass --ask-sudo-pass
The result is a new AWS instance server with Java 8 loaded,  appropriate users and groups setup and a base Jenkins Image loaded.  The key here is SETUP (the minimum needed to get a server into MAINTAIN status. The server is in "Staging".

Then, the repetitive playbook is executed (this one runs on a regular basis to keep servers up-to-date in all environments. In my case, I execute for Dev/Staging and Production ( the same playbook is used and applied to all 3 environments ).

ansible-playbook -i Inventory/CASPAR/staging/ CASPAR_maintain_jenkinsservers.yml -u ubuntu --ask-vault-pass --ask-sudo-pass

Note: The only difference in the name is the word "maintain". 

Note: To run the same playbook in Prod or Dev, I simply run the same playbook against /CASPAR/prod/  ( Ansible's dynamic Inventory auto-finds the appropriate machines based on tags )

Then, while the machine is still in Staging, the following test playbook is executed....
ansible-playbook -i Inventory/CASPAR/staging/ CASPAR_test_jenkinsservers.yml -u ubuntu --ask-vault-pass --ask-sudo-pass

The "test" playbook executes all predefined tests to ensure the server is in good shape. 
If there are no errors, all that needs to happen is for the machine to be re-tagged in AWS from Staging to Prod and then the next time the maintain playbook is executed, it will have any appropriate changes using the SAME playbook as before (different Gateway addresses, different database connection string, etc).

Then of course, the TEST playbook is executed again (one final test).. 

The Test playbook can now also serve as a Governance check playbook as well and could be executed by the same team or externally where needed. It provides a means for safer, more comfortable changes, while also providing a built-in governance component if needed.

Yesterday, when I ran the tests in Staging I received an error about missing packages.

ansible-playbook -i Inventory/CASPAR/staging/ CASPAR_test_jenkinsservers.yml -u ubuntu --ask-vault-pass --ask-sudo-pass > test.log
grep "TEST_PASSED" test.log
grep "TEST_FAILED" test.log

I received the message...

 "msg": "TEST_FAILED: package xxxxxx expected present "

(xxxxxx is a hidden package only for this post for security reasons).

If I had converted the host to Production, it would have caused big problems in my production environment.

After doing some research, If found that I had previously requested a newer version of an AMI ( an Amazon Machine Image ). 

Although the entire "setup" and "maintain" playbooks ran flawlessly (with no errors), what I did not know was the newer AMI was missing a critical Operating System package that my environment needed.

I modified the "maintain" playbook to include the missing package, re-ran the "maintain" playbook and then re-ran the Test Playbook.  Everything passed. Now, I know the Staging and Prod machines will always be up-to-date with this package when the "maintain" playbook runs it's continuous loop.

The new Jenkins Server was tagged as "Prod" and then the previous server deleted from AWS. The transition was painless.

By taking this approach and adding new checks to my server first as they become evident, I ensure that I will  not deploy something to production that has not already been determined to be a potential problem. 

I will no longer have this issue or one related to missing this package again.  If an image contains the missing image, no problem.. It will simply pass. Ansible does not re-install packages if they already exist (unless "latest" is specified in the version").

Brief History of the Test/Maintain/Govern Loop

The purpose of creating the Test/Maintain/Govern Loop for Playbooks was to show a Test-First approach to infrastructure delivery to make the transition to Infrastructure as code easier to get accustomed to.

The approach uses knowledge taken from years of insight from the software development world in delivery of complicated environments and applies it to the Infrastructure as Code domain.  

Technical Notes:

Jenkins CI server running in Production on AWS.  

Ansible Playbook uses to Setup/Maintain and Test server(s) in both Staging, and Production. ((how my environment works for build servers.. TODAY).

In AWS, tags are used to determine if a machine is "in production" or "in staging". They are both live in AWS in the same VPC (A VPC is like a private IP range within AWS for my hosts to reach each other).

Playbooks are formatted into YAML (a markup format) to have Dev/Staging/Prod in the same playbook.

A unique matching approach allows the same Playbook to run many times in Dev and Staging. This helps to ensure that when the Playbook runs on the Production machine, it has already executed many times already (and confirmed correct).

An often missing catch with playbooks is that "If" statements can be used to determine of parts of playbooks are executed.  A playbook command can be set to only run a certain instruction only IF a certain environment exists (an example).

When I want to upgrade my Jenkins server or reconfigure a new one, I take an approach of.. "Build a new one, run the setup/maintain and test on it, and IF everything is OK, move it into the Production Tag and then disable the older server. This allows me to ensure all is well before activating a new production change.  

Think of the saying ....  

"All Servers Are Temporary"

If you feel that your organization could benefit from learning about a Test First approach to Infrastructure, please feel free to reach out to me.  I provide 1/2 day or full day sessions in the Toronto area or full-day sessions plus expenses anywhere else worldwide. 

A link to the original presentation be found here..... 

If you are so inclined, here's a link to the root repository... 

A sample "test" (also used for governance) playbook is located here...