Graphite threshold alerting with Sensu

Instrumenting your code to report application level metrics is definitely one of the most powerful monitoring tasks you can accomplish.  It is damn satisfying to get working the first time as well.  Having the ability to look at your application and how it is performing at a granular level can help identify potential issues or bottlenecks but can also give you a greater understanding of how people are interacting with the application at a broad scale.  Everybody loves having these types of metrics to talk about their apps and products so this style of monitoring is a great win for the whole team.

I don’t want to dive in to the specifics of WHAT you should monitor here, that will be unique to every environment.  Instead of covering the what and how of instrumenting the code to report specific metrics, I will be running through an example of what the process might look like for instrumenting a check and alarm for monitoring and alerting purposes at an operations level.  I am not a developer, so I don’t spend a lot of time thinking about what types of things are important to collect metrics on.  Usually my job instead is to figure out how to monitor and alert effectively, based on the metrics that developers come up with.

Sensu has a great plugin to check Graphite thresholds in their plugin repo.  If you haven’t looked already, take a minute to glance over the options a little bit and see how the plugin works.  It is a pretty simple plugin but has been able to do everything I need it to.

One common monitoring task is to check how long requests are taking.  So in this example, we are querying the Graphite server and reporting a critical  status (status 1) if the request averages more than 7 seconds for a response time.

Here is the command you would run manually to check this threshold.  Make sure to download the script if you haven’t already, you can just copy the code directly or clone the repo if you are doing this manually.  If you are using Sensu you can use the sensu_plugin LWRP to grab the script (more on that below).

./check-data -s <servername:port> -t <graphite query> -c 7000 -u user -p password
./check-data -s graphite.example.com -t alias(stats.timer.server.response_time.mean, 'Mean') -c 7000 -u myuser -p awesomepassword

There are a few things to note.  The -s flag specifies which graphite server or endpoint to hit, -t specifies the target or the graphite query to run the script against, the -c flag sets the threshold, -u and -p are used if your Graphite server uses authentication.  If your Graphite instance is public it should probably use auth, otherwise if it is internal only, probably not as important.  Obviously these are just dummy values, included to give you a better idea of what a real command should look like.  Use your own values in their place.

The query we’re running is against a statsd metric that for mean response time for a request that gets recorded from the code (this is the developer instrumenting their code part I mentioned).  This check is specific to my environment so you will need to modify any of your queries to make sure to alert on a useful metric and threshold in your own environment.

Here’s an example of what the graphite graph (rendered in Grafana) looks like.

Sensu Graph

Obviously this is just a sample but it should give you the general idea of what to look for.

If you examine the script, there are a few Ruby Gem requirements to get the script to run, which you will need to be installed if you haven’t already.  They are sensu-plugin, json, json-uri and openssl.  You don’t need the sensu-plugin if you are just running the check manually but you WILL need to have it installed on the Sensu client that will be running the scheduled check.  That can be done manually or with the Sensu Chef recipe (specifically for turning on Sensu embedded ruby and ruby gems), which I recommend using anyway if you plan on doing any type of deployments at scale using Sensu.

Here is the Chef code looks like if you use Sensu to deploy this check automatically.

sensu_check "check_request_time" do 
  command "#{node['sensu']['plugindir']}/check-data.rb -s graphite.example.com -t \"alias(stats.timers.server.facedetection.response_time.mean, 'Mean')\" -c 7000 -a 360 -u myuser -p awesomepassword"
  handlers ["pagerduty", "slack"] 
  subscribers ["core"] 
  interval 60 
  standalone true 
  additional(:notification => "Request time above threshold", :occurrences => 5)
end

This should look familiar if you have any background using the Sensu Chef cookbook.  Basically we are using the sensu_check LWRP to execute the script with the different parameters we want, using the pagerduty and slack handlers, which are just fancy ways to pipe out the results of the check.  We are also saying we want to run this on a scheduled interval time of 60 seconds as a standalone check, which means it will be executed on the client node (not the Sensu server itself).  Finally, we are saying that after 5 failed checks we want to append a message to the handler that says what exactly is going wrong.

You can stick this logic in an existing recipe or create a new one that handles your metrics threshold checks.  I’m not sure what the best practice is for where to put the check but I have a recipe that runs standalone threshold checks that I stuck this logic in to and it seems to work.  Once the logic has been added you should be able to run chef-client for the new check to get picked up.

Read More

Patching CVE-6271 (shellshock) with Chef

If you haven’t heard the news yet, a recently disclosed vulnerability has been released that exploits environmental variables in bash.  This has some far reaching implications because bash is so widespread and runs on many different types of devices, for example network gear, routers, switches, firewalls, etc.  If that doesn’t scare you then you probably don’t need to finish reading this article.  For more information you can check out this article that helped to break the story.

I have been seeing a lot “OMG the world is on fire, patch patch patch!” posts and sentiment surrounding this recently disclosed vulnerability, but basically have not seen anybody taking the time to explain how to patch and fix this issue.  It is not a difficult fix but it might not be obvious to the more casual user or those who do not have a sysadmin or security background.

Debian/Ubuntu:

Use the following commands to search through your installed packages for the correct package release.  You can check the Ubuntu USN for versions.

dpkg -l | grep '^ii' or
dpkg-query --show bash

If you are on Ubuntu 12.04 you will need update to the following version:

bash    4.2-2ubuntu2.3

If you are on Ubuntu 13.10, and have this package (or below), you are vulnerable.  Update to 14.04!

 bash 4.2-5ubuntu3

If you are on Ubuntu 14.04, be sure to update to the most recently patched patch.

bash 4.3-7ubuntu1.3

Luckily, the update process is pretty straight forward.

apt-get update
apt-get --only-upgrade install bash

If you have the luxury of managing your environment with some sort of automation or configuration management tool (get this in place if you don’t have it already!) then this process can be managed quite efficiently.

It is easy to check if a server that is being managed by Chef has the vulnerability by using knife search:

knife search node 'bash_shellshock_vulnerable:true'

From here you could createa a recipe to patch the servers or fix each one by hand.  Another cool trick is that you can blast out the update to Debian based servers with the following command:

knife ssh 'platform_family:debian' 'sudo apt-get update; sudo apt-get install -y bash'; knife ssh 'platform_family:redhat' 'sudo yum -y install bash'

This will iterate over every server in your Chef server environment that is in the Debain family (including Ubuntu) or RHEL family (including CentOS) and update the server packages so that the latest patched bash version gets pulled down and then gets updated to the latest version.

You may need to tweak the syntax a little, -x to override the ssh user and -i to feed an identity file.  This is so much faster than manually installing the update on all your servers or even fiddling around with a tool like Fabric, which is still better than nothing.

One caveat to note:  If you are not on an LTS version of Ubuntu, you will need to upgrade your server(s) first to an LTS, either 12.04 or 14.04 to qualify for this patch.  Ubuntu 13.10 went out of support in August which was about a month ago as per the time of this writing so you will want to get your OS up date.

One more thing:  The early patches to address this vulnerability did not entirely fix the issue, so make sure that you have the correct patch installed.  If you patched right away there is a good chance you may still be vulnerable, so simply rerun your knife ssh command to reapply the newest patch, now that the dust is beginning to settle.

Outside of this vulnerability, it is a good idea to get your OS on an Ubuntu LTS version anyway to continue receiving critical updates for software as well as security patches for a longer duration than the normal, 6 month release cycle of the server distribution.

Read More

test kitchen

Test Kitchen Tricks

I have been working a lot with Chef and Test kitchen lately and thus have learned a few interesting tricks when running tests with these tools.  Test Kitchen is one of my favorite tools when working with Chef configuration management because it is very easy to use and has a number of powerful features that make testing things in Chef simple and easy.

Test Kitchen itself sits on top of Vagrant and Virtualbox by default so to get started with the most basic usage example of Test Kitchen you will need to have Vagrant installed along with a few other items.

Then to install Test Kitchen.

gem install test-kitchen

That’s pretty much it.  The official docs have some pretty detailed usage and in fact I have learned many of the tricks that I will be writing about today from the docs.

Once you are comfortable with Test Kitchen you can begin leveraging some of the more powerful features, which is what the remainder of this post will cover.  There is a great talk given by the creator of test kitchen at this year’s Chefconf by the creator of Test Kitchen about some of the lessons learned and cool things that you can do with the tool.  If you haven’t already seen it, it is worth a watch.

Anyway, let’s get started.

1) Fuzzy matching

This one is great for the lazy people out there.  It basically allows you match a certain unique part of a command instead of typing out an entire command.  So for example, you can just type in a partial name for a command to return the desired full command.  Since Test Kitchen uses regular expression matching, this can be a very powerful feature.

2) Custom drivers

One reason that Test Kitchen is so flexible is because it can leverage many different plugins and drivers.  And, since it is open source, if there is functionality missing from a driver you can simply write your own.  Currently there is an awesome list of drivers available for Test Kitchen to use, and a wide variety of options available to hopefully suit most testing scenarios.

Of course, there are others as well.  These just happen to be the drivers that I have tried and can verify.  There is even support for alternate configuration management tool testing, which can be handy for those that are not using Chef specifically.  For example there is a salt driver available.

3) .kitchen.local.yml

This is a nice handy little bit that is often overlooked but allows a nice amount of control by overriding the default .kitchen.yml configuration file with specific options.  So for example, if you are using the ec2 driver in your configuration but need to test locally with Vagrant you can simply drop a .kitchen.local.yml on your dev machine and override the driver (and any other settings you might need to change). I have created the following .kitchen.local.yml for testing on a local Vagrant box using 32 bit Ubuntu to highlight the override capabilities of Test Kitchen.

driver: 
 name: vagrant 
 
platforms: 
 - name: ubuntu-1310-i386 
 - name: ubuntu-1404-i386

4) Kitchen diagnose

An awesome tool for diagnosing issues with Test Kitchen.  Running the diagnose will give you lots of juicy info about what your test machines are doing (or should be doing) and a ton of configuration information about them.  Basically, if something is misbehaving this is the first place you should look for clues.

If you want to blast info and settings for all your configurations, just run the following,

kitchen diagnose

5) Concurrency

If you have a large number of systems that need to have tests run on them then running your Test Kitchen tests in parallel is a great way to speed up your total testing time.  Turning on concurrency is pretty straight forward, just add the “-c” flag and the number of instances to run on (the default is 9999).

kitchen converge -c 5

6) Verbose logging

This one can be helpful if your kitchen run is failing with no real clues or helpful information provided by the diagnose command.  It seems obvious but getting this one to work gave me some trouble initially.  To turn on verbosity simply add the debug flag to your test kitchen command.

So for example, if you want to converge a node with verbosity turned on, you would use this command.

kitchen converge -l debug

I recommend taking a look at some or all of these tricks to help improve your integration testing with Test Kitchen.  Of course as I stated, all of this is pretty well documented.  Even if you are already familiar with this tool, sometimes it just helps to have a refresher to remind you of a great tool and to jar your memory.  Let me know if you have any other handy tricks and I will be sure to post them here.

Read More

Chef data bags with Test Kitchen

As a step towards integrating your Chef cookbooks with Jenkins CI and your testing/release pipeline it is important to make sure that local changes pass unit and integration tests before being accepted and committed into version control.  For example, when running test kitchen it is important to fully simulate what data bags and encrypted data bags are doing on a local box for many tests to pass correctly.  So, today I would like to focus on a stumbling block towards Jenkins and integration testing that I ran in to recently.  There are a few lessons that I learned along the way that I would like to share to help clarify things a little bit because there wasn’t much good info out there on how to do this.

First, I need to give credit where it is due.  This post was a great resource in my journey to find a solution to my test kitchen data bag issue.

The largest roadblock I found along the way was that the version of test kitchen I was using was being shipped with chef-solo as the primary driver.  There has been a lot of discussion around this topic lately and (from what I understand) has pretty much become the general consensus within the Chef community that chef-solo should be replaced by chef-zero.  There are a number of advantages to using chef-zero instead of chef-solo, including a lesson I learned the hard way, which is that chef-zero has the ability to act as a stand alone Chef server – unlocking the ability to store data bags and encrypted data bags without having to do any sort of wacky hacking to get Chef to compile and converge correctly.

There was a good post written recently that expounds more on the benefits of using chef-zero instead of chef-solo.  It is here, and is definitely worth the read if you are interested in learning more about the benefits of chef-zero.

So with that knowledge in mind, here is what a newly updated sample .kitchen.yml file might look like:

--- 
driver: 
 name: vagrant 
 
provisioner: 
 name: chef_zero 
 
platforms: 
 - name: ubuntu-13.10-i386 
 - name: centos-6.4-i386 
 
suites: 
 - name: default 
 data_bags_path: "test/integration/data_bags" 
 run_list: 
 - recipe[recipe-to-test] 
 attributes:

It’s a pretty straight forward config.  The biggest change that you will notice in this config is that instead of using chef-solo as the provisioner it has been changed to chef-zero – I now know that it makes all the difference in the world.  The next big change to observe is the data_bags_path in the suites section.  This bit of configuration basically tells the Chef provisioner to go look at the specified file path when chef-zero spins up and use that to store data bag, encrypted data bag or other information that potentially would live on the Chef server that client’s would use.

So in the test/integration/data_bags directory I have a directory and json file inside that directory for the specific data I am interested in, called sensu/ssl.json.  This file essentially contains the same information that is stored on the Chef server about the ssl certificates used for live hosts in the production environment, just mirrored into a sandbox/integration testing environment.

If you’re interested, here is a sample of what the  ssl.json file might look like:

{ 
 "id": "ssl", 
 "server": { 
 "key": "-----BEGIN RSA PRIVATE KEY-----gM
 "cert": "-----BEGIN CERTIFICATE-----gM
 "cacert": "-----BEGIN CERTIFICATE-----gM
 }, 
 "client": { 
 "key": "-----BEGIN RSA PRIVATE KEY-----gM
 "cert": "-----BEGIN CERTIFICATE-----gM
 } 
}

Note that the “id” is “ssl”.  As far as I know the file name must match up to the id when you are creating this json file.

Now you should be able to create and converge your test recipe with test kitchen:

kitchen create ubuntu
kitchen converge ubuntu

If you have any difficulty, let me know.  I tried to be thorough in this write up but could have accidentally skipped important information.  The main keys or takeaways though should be 1) use chef-zero wherever possible and 2) make sure you have your data bag paths and files created correctly and referenced correctly in your .kitchen.yml file.  Finally, if you are still having issues, make sure you have triple checked the spelling and json syntax of your paths and configs.

Read More

Leveraging Nagios Plugins with Chef and Sensu

Setting up Nagios plugins to run in a Sensu and Chef managed environment is straightforward and uncomplicated. For example, I recently have been interested in monitoring SSL certificate date expiration and it just so happens that the Nagios check_http plugin does exactly what I’m looking for.

The integration between Sensu and the Nagios plugins is very nice.  For convenience in our Sensu environment, we like to put the additional Nagios plugins on to all of the systems we monitor because the footprint is negligible and it allows for some nice flexibility of services and checks to monitor should an additional service get added to a server in the future that we hadn’t anticipated.  For the amount of effort it takes to get the checks onto the server and to get working, adding the Nagios plugins is totally worth the effort.

The first step is to add the Nagios plugins to your Chef recipe.  I am using a generic Chef recipe for my Sensu clients that takes care of some of the more tedious tasks including downloading the appropriate scripts and checks for the clients to run as well as some other dependencies and items that Sensu likes to have.  Luckily there is a public Debian package available for installing the Nagios plugins so it easy to add them.  Just add this snippet into your Chef recipe for Sensu clients:

apt_package "nagios-plugins" do 
 action :install 
end

After you run your next chef-client job you will have access to a variety of checks provided by the Nagios plugins package as illustrated below.

nagios checks

There are a number of examples available but to run the check_http for cert expiration by hand you can run this command:

/usr/lib/nagios/plugins/check_http -H <sitename> -C 30,10

Where <sitename> is the URL of the website you would like to check.  Now that we are able to run this check manually, go ahead and roll that in to your Chef recipe for Sensu.  An example of this might look similar to the following:

sensu_check "check_web" do 
  command "/usr/lib/nagios/plugins/check_http -H localhost -C 30,10" 
  handlers ["pagerduty", "slack"] 
  subscribers ["core"] 
  interval 60
  standalone true
  additional(:notification => "Certificate will expire soon", :occurrences => 5) 
end

You may not want to run this check on every host so it may be a good idea to run this check as a stand alone check.  It is simple enough to add this snippet in to any recipe and tack on the “standalone true” attribute to the sensu_check resource.  I have an example of what this standalone attribute looks like in the example above for reference.

Adding in Nagios plugins gives you a very nice set of additional tools to add to your monitoring arsenal for not that much effort.  You never know when something from the Nagios plugins might come in handy so I suggest you try them out.  There are many other uses for the Nagios plugins so I suggest taking a look.

Read More