Composing a Graphite server with Docker

May 11, 2015August 31, 2015 Josh Reichardt 2 Comments

Recently our Graphite server needed to be overhauled, which I was not looking forward to. Luckily Docker makes the process of building identical and reproducible images for configuring a new server much easier and painless than other methods.

Introduction

If you don’t know what Graphite is you can check out the documentation for more info. Basically it is a tool to collect and aggregate metrics of pretty much any kind, in to a central location. It is a great complement to something like statsd for metric collection and aggregation, which I will go over later.

The setup I will be describing today leverages a handful of components to work. The first and most important part is Graphite. This includes all of the parts that make up Graphite, including the carbon aggregator and carbon cache for the collection and processing of metrics as well as the whisper db for storing metrics.

There are several other alternative backends but I don’t have any experience with them so won’t be posting any details. If you are interested, InfluxDB and OpenTSDB both look like interesting alternative backends to whisper for storing metrics.

The Problem

Graphite is known to be notoriously difficult to install and configure properly. If you haven’t tried to set up Graphite before, give it a try.

Another argument that I hear quite a bit is that the Graphite workload doesn’t really fit in with the Docker model. In a distributed or highly available architecture that might be the case but in the example I cove here, we are taking a different approach.

The design and implementation separates data on to an EBS volume which is a durable storage resource, so it doesn’t matter if the server were to have problems. With our approach and process we can reprovision the server and have everything up and running in less than 5 minutes.

The benefit of doing it this way is obvious. Another benefit of our approach is that we are levering the graphite-api package so that we can have access to all of the Graphite goodness without having to run all of the other bloats and then proxying it through ngingx/wsgi which helps with performance. I will go over this set up in a little bit. No Graphite server would be complete if it didn’t leverage Grafana, which turns out to be stupidly easy using the Docker approach.

If we were ever to try to expand this architecture I think a distributed model using EFS (currently in preview) along with some type of load balancer in front to distribute requests evenly may be a possibility. If you have experience running Graphite across many nodes I would love to hear what you are doing.

The Solution

There are a few components to our architecture. The first is a tool I have been writing about recently called Terraform. We use this with some custom scripting to build the server, configure it and attach our Graphite data volume to the server.

Here is what a sample terraform config might look like to provision the server with the tools we want. This server is provisioned to an AWS environment and leverages a number of variables. You can check the docs on how variables work or if there is too much confusion I can post an example.

provider "aws" {
  access_key = "${var.access_key}"
  secret_key = "${var.secret_key}"
  region = "${var.region}"
}

resource "aws_instance" "graphite" {
  ami = "${lookup(var.amis, var.region)}"
  availability_zone = "us-east-1e"
  instance_type = "c3.xlarge"
  subnet_id = "${var.public-1e}"
  security_groups = ["${var.graphite}"]
  key_name = "XXX"
  user_data = "${file("../cloud-config/graphite.yml")}"

  root_block_device = {
    volume_type = "gp2"
    volume_size = "20"
  }

  connection {
    user = "username"
    key_file = "${var.key_path}"
  }

 # mount EBS
  provisioner "local-exec" {
     command = "aws ec2 attach-volume --region=us-east-1 --volume-id=${var.graphite_data_vol} --instance-id=${aws_instance.graphite.id} --device=/dev/xvdf"
  }

  provisioner "remote-exec" {
    inline = [
    "while [ ! -e /dev/xvdf ]; do sleep 1; done",
    "echo '/dev/xvdf /data ext4 defaults 0 0' | sudo tee -a /etc/fstab",
    "sudo mkdir /data && sudo mount -t ext4 /dev/xvdf /data"
  ]
 }

}

And optionally if you have an Elastic IP to use you can tack that on to your config

resource "aws_eip" "graphite" {
  instance = "${aws_instance.graphite.id}"
  vpc = true
}

The graphite server uses a mostly standard config and installs a few of the components that we need to run the server, docker, python, pip, docker-compose, etc. Here is what a sample cloud config for the Graphite server might look like.

#cloud-config

# Make sure OS is up to date
apt_update: true
apt_upgrade: true
disable_root: true

# Connect to private repo
write_files:
 - path: /home/<user>/.dockercfg
 owner: user:group
 permissions: 0755
 content: |
 {
   "https://index.docker.io/v1/": {
   "auth": "XXX",
   "email": "email"
 }
 }

# Capture all subprocess output for troubleshooting cloud-init issues
output: {all: '| tee -a /var/log/cloud-init-output.log'}

packages:
 - python-dev
 - python-pip

# Install latest Docker version
runcmd:
 - apt-get -y install linux-image-extra-$(uname -r)
 - curl -sSL https://get.docker.com/ubuntu/ | sudo sh
 - usermod -a -G docker <user>
 - sg docker
 - sudo pip install -U docker-compose

# Reboot for changes to take
power_state:
 mode: reboot
 delay: "+1"

ssh_authorized_keys:
 - <put your ssh public key here>

Docker

This is where most of the magi happens. As noted above, we are using Docker and a few of its tools to get everything working. All the logic to get Graphite running is contained in the Dockerfile, which will require some customizing but is similar to the following.

# Building from Ubuntu base
FROM ubuntu:14.04.2

# This suppresses a bunch of annoying warnings from debconf
ENV DEBIAN_FRONTEND noninteractive

# Install all system dependencies
RUN \
 apt-get -qq install -y software-properties-common && \
 add-apt-repository -y ppa:chris-lea/node.js && \
 apt-get -qq update -y && \
 apt-get -qq install -y build-essential curl \
 # Graphite dependencies
 python-dev libcairo2-dev libffi-dev python-pip \
 # Supervisor
 supervisor \
 # nginx + uWSGI
 nginx uwsgi-plugin-python \
 # StatsD
 nodejs

# Install StatsD
RUN \
 mkdir -p /opt && \
 cd /opt && \
 curl -sLo statsd.tar.gz https://github.com/etsy/statsd/archive/v0.7.2.tar.gz && \
 tar -xzf statsd.tar.gz && \
 mv statsd-0.7.2 statsd

# Install Python packages for Graphite
RUN pip install graphite-api[sentry] whisper carbon

# Optional install graphite-api caching
# http://graphite-api.readthedocs.org/en/latest/installation.html#extra-dependencies
# RUN pip install -y graphite-api[cache]

# Configuration
# Graphite configs
ADD carbon.conf /opt/graphite/conf/carbon.conf
ADD storage-schemas.conf /opt/graphite/conf/storage-schemas.conf
ADD storage-aggregation.conf /opt/graphite/conf/storage-aggregation.conf
# Supervisord
ADD supervisord.conf /etc/supervisor/conf.d/supervisord.conf
# StatsD
ADD statsd_config.js /etc/statsd/config.js
# Graphite API
ADD graphite-api.yaml /etc/graphite-api.yaml
# uwsgi
ADD uwsgi.conf /etc/uwsgi.conf
# nginx
ADD nginx.conf /etc/nginx/nginx.conf
ADD basic_auth /etc/nginx/basic_auth

# nginx
EXPOSE 80 \
# graphite-api
8080 \
# Carbon line receiver
2003 \
# Carbon pickle receiver
2004 \
# Carbon cache query
7002 \
# StatsD UDP
8125 \
# StatsD Admin
8126

# Launch stack
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/supervisord.conf"]

The other component we need is Grafana, which we don’t actually build but pull from the Dockerhub registry and inject our custom volume to. This is all captured in our docker-compose.yml file listed below.

graphite:
  build: ./docker-graphite
  restart: always
  ports:
    - "8080:80"
    - "8125:8125/udp"
    - "8126:8126"
    - "2003:2003"
    - "2004:2004"
  volumes:
    - "/data/graphite:/opt/graphite/storage/whisper"

grafana:
  image: grafana/grafana
  restart: always
  ports:
    - "80:3000"
  volumes:
    - "/data/grafana:/var/lib/grafana"
  links:
    - graphite
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=password123

We have open sourced our configuration and placed it on github so you can take a look at it to get a better idea of the configs and how everything is working with some working examples. The github repo is a quick way to try out the stack without having to provision and build an environment to run this on. If you are just interested in kicking the tires I suggest starting with the github repo.

The build directive above corresponds to the repo on github.

The last components is actually running the Docker containers. As you can see we use docker-compose but we also need a way to start the containers automatically after a disruption like a reboot or something. That is actually pretty easy. On an Ubuntu (or system using upstart) you can create an init script to start up docker-compose or restart it automatically if it has problems. Here I have created a file called /etc/init/graphite.conf with the following configuraiton.

description "Graphite"
start on filesystem and started docker
stop on runlevel [!2345]
respawn
chdir /home/user
exec docker-compose up

A systemd service would achieve a similar goal but the version of Ubuntu used here doesn’t leverage systemd.

After everything has been dropped in place and configured you can check your work by testing out Grafana by hitting the public IP address of your server. If you hit the Grafana splash page everything should be working!

Conclusion

There are many pieces to this puzzle and honestly we don’t have the requirement of having Graphite be 100% available and redundant so we can get away with a single server for our needs. A separate EBS volume and Terraform allow us to rebuild the server quickly and automatically if something were to happen to the server. Also, the way we have designed Graphite to run will be able to handle a substantial workload without falling over. But if you are doing anything cool with Graphite HA or resiliency I would like to hear how you are doing it, there is always room for improvement.

If you are just interested in trying out the Graphite stack I highly suggest going over to the github repo and running the container stack to play around with the components, especially if you are interested in learning about how statsd and graphite collect metrics. The Grafana interface give you a nice way to tap in to the metrics that get pumped in to Graphite.

Uchiwa dashboard for Sensu

August 12, 2014August 31, 2015 Josh Reichardt 9 Comments

Recently the new Uchiwa dashboard redesign for Sensu was released, and it is awesome. It’s hard to describe how much of a leap forward this most recent release is, but it finally feels like Sensu is as “complete” and polished product as other open source and commercial products that exist. And if you haven’t heard of Sensu yet you are missing out. As described on the website sensuapp.org. Sensu is an open source monitoring framework. Instead of the traditional monolithic type of monitoring solutions (cough Nagios cough) that typically come to mind, the design of Sensu allows for a more more scalable and distributed approach to monitoring which hasn’t really been done before and offers a number of benefits, including a variety of dashboards to choose from.

Sensu touts itself as a “monitoring router”, which is a much more intuitive approach to monitoring once your wrap your head around the concept and leave the monolithic idea alone. For example, you can plug in different components to your monitoring solution very easily with Sensu, and you aren’t tied to one solution. If you need graphing and analytics you can choose from any number of existing solutions, Graphite, hosted Graphite, DataDog, NewRelic, etc. and more importantly, if something isn’t working as well as you’d like you can simply rip it out the component that isn’t working in favor of something that fits your needs better. Meaning it adds flexibility. no more hammering square blocks in to round holes. Sensu also offers nice scalability features, since all of the pieces are loosely coupled you don’t need to worry about scaling the entire beast, you can pick and choose which pieces to scale and when. Sensu itself is also scalable. Since the backbone of Sensu relies on RabbitMQ (soon to be opened up to other message queueing services), the busier it gets, simply cluster or add nodes to your RabbitMQ cluster. Granted, RabbitMQ isn’t exactly the easiest thing to scale, but it is possible.

With its distributed nature, Sensu by default is just a monitor. In the beginning, that meant either writing your own dashboard to communicate with Sensu server or using the default dashboard. As the ecosystem has evolved, the default dashboard has not been able to keep up with the evolution of Sensu and the needs of those using it.

Traditionally in the monitoring world, if you are not familiar, design and usability have not exactly been high priorities with regards to dashboards, graphics and GUI’s in the majority of tools that exist. Although that fact is changing somewhat with some of the newer cloud tools like DataDog and NewRelic, the only problem is that those solution are commercial and can become expensive. The bane of the open source solutions, at least for me, is how ugly the dashboards and user experiences have been (the Sensu default dashboard was an exception). But, the latest release of Uchiwa for Sensu has really changed the game in my opinion. It is much more modern and elegant.

We have gone from this:

To this:

Which one would you rather use? It is much easier to use and is much more elegant. The main dashboard (pictured above) gives a nice 1,000 ft view of what is going on in your environment. It is easy to quickly check the dashboard for any issues going on in your environment.

In addition to the home view, there is a nice checks view to get a glimpse of pretty much everything that’s going on in your environment. Sometimes with a large number of checks it is very easy to forget what exactly is happening so this is a nice way to double check.

As well, there is another similar view for checking clients. One small but very nice piece of info here is that it will display the Sensu client version for each host. If there are any issues with a host it is easy to tell from here.

You can also drill down in to any of these hosts to get a better picture of what exactly is going on. It will show you exactly which checks are being run for the host as well as some other very hand information.

From this page you can even select an individual check and see exactly how it is set up and behaving. It is easy to silence a single alert of all alerts for a client. Just click on the sound icon in any context to silence or unsilence an alert or an entire client. This has been handy for minimizing alert spam when doing maintenance on specific hosts.

One last handy feature is the info page. From here you can check out some of the Sensu server info as well as Uchiwa settings. This is also good for troubleshooting.

That pretty much covers the highlights of the new UI. As I have said, I am very excited for this release because this is an awesome GUI and there are going to be some really interesting improvements and additions in the future for Uchiwa which will make it an even stronger and more compelling reason to make the switch to Sensu and Uchiwa if you haven’t already.

If you have direct questions about the post, you comment here. Otherwise, the best place to get help with most of this stuff is probably the #sensu channel on IRC. That’s where the majority of the project contributors hang out. You can check out the Uchiwa code as well if you’d like over on Github. If you ever have issues with the dashboard that is the place to go, I would suggest browsing through the issues and if you can’t find a solution then create a new issue. Don’t hesitate to jump in to any of the discussions either. The author is very friendly and helpful and is very quick to respond to issues. One final helpful resource is the Sensu docs. Make sure you are looking at the correct version of Sensu according to the documentation, there are still enough changes occurring that the docs still have some differences between them and can get new users fumbled up.

Chef data bags with Test Kitchen

June 30, 2014August 31, 2015 Josh Reichardt 8 Comments

As a step towards integrating your Chef cookbooks with Jenkins CI and your testing/release pipeline it is important to make sure that local changes pass unit and integration tests before being accepted and committed into version control. For example, when running test kitchen it is important to fully simulate what data bags and encrypted data bags are doing on a local box for many tests to pass correctly. So, today I would like to focus on a stumbling block towards Jenkins and integration testing that I ran in to recently. There are a few lessons that I learned along the way that I would like to share to help clarify things a little bit because there wasn’t much good info out there on how to do this.

First, I need to give credit where it is due. This post was a great resource in my journey to find a solution to my test kitchen data bag issue.

The largest roadblock I found along the way was that the version of test kitchen I was using was being shipped with chef-solo as the primary driver. There has been a lot of discussion around this topic lately and (from what I understand) has pretty much become the general consensus within the Chef community that chef-solo should be replaced by chef-zero. There are a number of advantages to using chef-zero instead of chef-solo, including a lesson I learned the hard way, which is that chef-zero has the ability to act as a stand alone Chef server – unlocking the ability to store data bags and encrypted data bags without having to do any sort of wacky hacking to get Chef to compile and converge correctly.

There was a good post written recently that expounds more on the benefits of using chef-zero instead of chef-solo. It is here, and is definitely worth the read if you are interested in learning more about the benefits of chef-zero.

So with that knowledge in mind, here is what a newly updated sample .kitchen.yml file might look like:

--- 
driver: 
 name: vagrant 
 
provisioner: 
 name: chef_zero 
 
platforms: 
 - name: ubuntu-13.10-i386 
 - name: centos-6.4-i386 
 
suites: 
 - name: default 
 data_bags_path: "test/integration/data_bags" 
 run_list: 
 - recipe[recipe-to-test] 
 attributes:

It’s a pretty straight forward config. The biggest change that you will notice in this config is that instead of using chef-solo as the provisioner it has been changed to chef-zero – I now know that it makes all the difference in the world. The next big change to observe is the data_bags_path in the suites section. This bit of configuration basically tells the Chef provisioner to go look at the specified file path when chef-zero spins up and use that to store data bag, encrypted data bag or other information that potentially would live on the Chef server that client’s would use.

So in the test/integration/data_bags directory I have a directory and json file inside that directory for the specific data I am interested in, called sensu/ssl.json. This file essentially contains the same information that is stored on the Chef server about the ssl certificates used for live hosts in the production environment, just mirrored into a sandbox/integration testing environment.

If you’re interested, here is a sample of what the ssl.json file might look like:

{ 
 "id": "ssl", 
 "server": { 
 "key": "-----BEGIN RSA PRIVATE KEY-----gM
 "cert": "-----BEGIN CERTIFICATE-----gM
 "cacert": "-----BEGIN CERTIFICATE-----gM
 }, 
 "client": { 
 "key": "-----BEGIN RSA PRIVATE KEY-----gM
 "cert": "-----BEGIN CERTIFICATE-----gM
 } 
}

Note that the “id” is “ssl”. As far as I know the file name must match up to the id when you are creating this json file.

Now you should be able to create and converge your test recipe with test kitchen:

kitchen create ubuntu
kitchen converge ubuntu

If you have any difficulty, let me know. I tried to be thorough in this write up but could have accidentally skipped important information. The main keys or takeaways though should be 1) use chef-zero wherever possible and 2) make sure you have your data bag paths and files created correctly and referenced correctly in your .kitchen.yml file. Finally, if you are still having issues, make sure you have triple checked the spelling and json syntax of your paths and configs.

Leveraging Nagios Plugins with Chef and Sensu

June 20, 2014August 31, 2015 Josh Reichardt 7 Comments

Setting up Nagios plugins to run in a Sensu and Chef managed environment is straightforward and uncomplicated. For example, I recently have been interested in monitoring SSL certificate date expiration and it just so happens that the Nagios check_http plugin does exactly what I’m looking for.

The integration between Sensu and the Nagios plugins is very nice. For convenience in our Sensu environment, we like to put the additional Nagios plugins on to all of the systems we monitor because the footprint is negligible and it allows for some nice flexibility of services and checks to monitor should an additional service get added to a server in the future that we hadn’t anticipated. For the amount of effort it takes to get the checks onto the server and to get working, adding the Nagios plugins is totally worth the effort.

The first step is to add the Nagios plugins to your Chef recipe. I am using a generic Chef recipe for my Sensu clients that takes care of some of the more tedious tasks including downloading the appropriate scripts and checks for the clients to run as well as some other dependencies and items that Sensu likes to have. Luckily there is a public Debian package available for installing the Nagios plugins so it easy to add them. Just add this snippet into your Chef recipe for Sensu clients:

apt_package "nagios-plugins" do 
 action :install 
end

After you run your next chef-client job you will have access to a variety of checks provided by the Nagios plugins package as illustrated below.

There are a number of examples available but to run the check_http for cert expiration by hand you can run this command:

/usr/lib/nagios/plugins/check_http -H <sitename> -C 30,10

Where <sitename> is the URL of the website you would like to check. Now that we are able to run this check manually, go ahead and roll that in to your Chef recipe for Sensu. An example of this might look similar to the following:

sensu_check "check_web" do 
  command "/usr/lib/nagios/plugins/check_http -H localhost -C 30,10" 
  handlers ["pagerduty", "slack"] 
  subscribers ["core"] 
  interval 60
  standalone true
  additional(:notification => "Certificate will expire soon", :occurrences => 5) 
end

You may not want to run this check on every host so it may be a good idea to run this check as a stand alone check. It is simple enough to add this snippet in to any recipe and tack on the “standalone true” attribute to the sensu_check resource. I have an example of what this standalone attribute looks like in the example above for reference.

Adding in Nagios plugins gives you a very nice set of additional tools to add to your monitoring arsenal for not that much effort. You never know when something from the Nagios plugins might come in handy so I suggest you try them out. There are many other uses for the Nagios plugins so I suggest taking a look.

Introduction to Grafana

June 4, 2014August 31, 2015 Josh Reichardt 9 Comments

Grafana is a beautiful dashboard for displaying various Graphite metrics through a web browser. Grafana is nice because it is simple to set up and maintain and is easy to use and displays metrics in a very nice Kibana like display style. I would like to walk readers through the basics of this tool because although it is a very new project (beginning of 2014) it has an enormous amount of potential and I honestly believe that there is enough functionality to put it into production.

The guys over at Hosted Graphite have just recently released hosted Grafana as part of their standard plan. If you are interested you can check it out over at Hosted Graphite.

One very nice feature of this product offering is that you do not need to worry about any of the behind the scenes details or intricacies of how all of the Graphite components work together. You just click a few buttons and you are ready to go. Highly recommended if you just need something that works, go check it out.

Anyway, Grafana gives you the ability to bolt on all kinds of bells and whistles to your graphs. For example, there is a nice little node.js tool called statsd written by the guys over at etsy that sort of expands the capabilities of Graphite. And since it hooks right in with Graphite, it makes it very simple to represent and output the various metrics into Grafana.

If you do choose to roll your own Grafana solution, there are a few gotchas that I was not aware of that I’d like to cover. The first, which should seem easy enough is that you need to run but Graphite and Grafana simultaneously. So that means you either need to create two virtual hosts depending on which web server you choose to use or you need to to create to different port bindings. The Grafana documentation has a few examples of how to create new port binding, which uses Apache as the web server but it is pretty easy to find examples of configs on Github and other sites if you are interested in using nginx as your web server instead. The important thing to remember is that you will want to have two sites listed in your sites-enabled directory inside of the apache directory. One for Graphite (which I moved to port 8080 for simplicity) and one for Grafana (which I stuck on port 80).

If you choose to use authentication there are a few extra headers that need to be written as well so that Grafana is accessible correclty. This is easy to add to either your nginx or you apache config.

Here is the adapted Apache configuration necessary to add in authentication, where website name is the globally reachable DNS name of your webserver:

Header set Access-Control-Allow-Origin "http://WEBSITENAME"
Header set Access-Control-Allow-Methods "GET, OPTIONS"
Header set Access-Control-Allow-Headers "origin, authorization, accept"
Header set Access-Control-Allow-Credentials true

<Location />
  AuthName "graphs restricted"
  AuthType Basic
  AuthUserFile /etc/apache2/htpasswd
  <LimitExcept OPTIONS>
    require valid-user
  </LimitExcept>
</Location>

That is almost the entire configuration that I have for the Grafana webserver. You will need to download and install an Apache lib tool to generate a hash to use for you AuthUserFile, directions can be pieced together from this. The Graphite configuration is only a little bit more complex and was created by Chef, so I will go ahead and post that here as well:

<VirtualHost *:8080>
  ServerName SERVER
  ServerAdmin [email protected]
  DocumentRoot "/opt/graphite/webapp"
  ErrorLog /opt/graphite/storage/log/webapp/error.log
  CustomLog /opt/graphite/storage/log/webapp/access.log common

  Header set Access-Control-Allow-Origin "*"
  Header set Access-Control-Allow-Methods "GET, OPTIONS"
  Header set Access-Control-Allow-Headers "origin, authorization, accept"

  WSGIScriptAlias / /opt/graphite/conf/wsgi.py

  <Directory /opt/graphite>
  <Files wsgi.py>
    Require all granted
  </Files>
  </Directory>

  <Location "/content/">
    SetHandler None
  </Location>

  <Location "/media/">
    SetHandler None
  </Location>

 # NOTE: In order for the django admin site media to work you
 # must change @DJANGO_ROOT@ to be the path to your django
 # installation, which is probably something like:
 # /usr/lib/python2.6/site-packages/django
 Alias /media/ "@DJANGO_ROOT@/contrib/admin/media/"

</VirtualHost>

Again pretty straight forward. If you choose to add authentication, it is done in a similar way.

The only other step is to add your metrics into Graphite. I am using Sensu, which I have written a bit about and plan to write more about in the future. Reference some of those writings to get an idea for metric gather and if you have any question let me know. I will be writing a post in the near future about gathering and aggregating metrics with Sensu. For now I will just assume that you already have a way to collect metrics.

Once you have everything set up you should be able to start playing around with the Grafana GUI. After you have your configurations ironed out pretty much everything else is done through the GUI which is nice.

To illustrate the power of Grafana, here are a few example dashboards I have built recently:

If you want to check out the Grafana project you can find more information either on their Github page or their website. The docs page on the main website is a great resource as well as the IRC channel. In fact, the IRC channel is probably the most preferable place to go for help because it isn’t overcrowded and the creator of Grafana is in there quite a bit.