Add OpenStack instance meta-data info in your salt grains

During a work session on my salt-states for I wanted to shape be able to query the OpenStack cluster meta-data so that I can adjust more efficiently my salt configuration.

What are grains? Grains are structured data that describes what a minion has such as which version of GNU/Linux its running, what are the network adapters, etc.

The following is a Python script that adds data in Salt Stack’ internal database called grains.

I have to confess that I didn’t write the script but adapted it to work within an OpenStack cluster. More precisely on DreamHost’s DreamCompute cluster. The original script came from saltstack/salt-contrib and the original file was to read data from EC2.

The original script wasn’t getting any data in the cluster. Most likely due to API changes and that EC2 API exposes dynamic meta-data that the DreamCompute/OpenStack cluster don’t.

In the end, I edited the file to make it work on DreamCompute and also truncated some data that the grains subsystem already has.

My original objective was to get a list of security-groups the VM was assigned. Unfortunately the API doesn’t give that information yet. Hopefully I’ll find a way to get that information some day.

Get OpenStack instance detail using Salt


salt-call grains.get dreamcompute:uuid

Or for another machine

salt app1 grains.get dreamcompute:uuid

What size did we create a particular VM?

salt app1 grains.get dreamcompute:instance_type

What data you can get

Here is a sample of the grain data that will be added to every salt minion you manage.

You might notice that some data will be repeated such as the ‘hostname’, but the rest can be very useful if you want to use the data within your configuration management.

            ssh-rsa rsa public key... [email protected]

What does the script do?

The script basically scrapes OpenStack meta-data service and serializes into saltstack grains system the data it gets.

OpenStack’s meta-data service is similar to what you’d get from AWS, but doesn’t expose exactly the same data. This is why I had to adapt the original script.

To get data from an instance you simply (really!) need to make an HTTP call to an internal IP address that OpenStack nova answers.

For example, from an AWS/OpenStack VM, you can know the instance hostname by doing


To know what the script calls, you can add a line at _call_aws(url) method like so

diff --git a/_grains/ b/_grains/
index 682235d..c3af659 100644
--- a/_grains/
+++ b/_grains/
@@ -25,6 +25,7 @@ def _call_aws(url):

     conn = httplib.HTTPConnection("", 80, timeout=1)
+'API call to ' + url )
     conn.request('GET', url)
     return conn.getresponse()

When you saltutil.sync_all (i.e. refresh grains and other data), the log will tell you which endpoints it queried.

In my case they were:

[INFO    ] API call to /openstack/2012-08-10/meta_data.json
[INFO    ] API call to /latest/meta-data/
[INFO    ] API call to /latest/meta-data/block-device-mapping/
[INFO    ] API call to /latest/meta-data/block-device-mapping/ami
[INFO    ] API call to /latest/meta-data/block-device-mapping/ebs0
[INFO    ] API call to /latest/meta-data/block-device-mapping/ebs1
[INFO    ] API call to /latest/meta-data/block-device-mapping/root
[INFO    ] API call to /latest/meta-data/hostname
[INFO    ] API call to /latest/meta-data/instance-action
[INFO    ] API call to /latest/meta-data/instance-id
[INFO    ] API call to /latest/meta-data/instance-type
[INFO    ] API call to /latest/meta-data/local-ipv4
[INFO    ] API call to /latest/meta-data/placement/
[INFO    ] API call to /latest/meta-data/placement/availability-zone
[INFO    ] API call to /latest/meta-data/public-ipv4
[INFO    ] API call to /latest/meta-data/ramdisk-id
[INFO    ] API call to /latest/meta-data/reservation-id
[INFO    ] API call to /latest/meta-data/security-groups
[INFO    ] API call to /openstack/2012-08-10/meta_data.json
[INFO    ] API call to /latest/meta-data/
[INFO    ] API call to /latest/meta-data/block-device-mapping/
[INFO    ] API call to /latest/meta-data/block-device-mapping/ami
[INFO    ] API call to /latest/meta-data/block-device-mapping/ebs0
[INFO    ] API call to /latest/meta-data/block-device-mapping/ebs1
[INFO    ] API call to /latest/meta-data/block-device-mapping/root
[INFO    ] API call to /latest/meta-data/hostname
[INFO    ] API call to /latest/meta-data/instance-action
[INFO    ] API call to /latest/meta-data/instance-id
[INFO    ] API call to /latest/meta-data/instance-type
[INFO    ] API call to /latest/meta-data/local-ipv4
[INFO    ] API call to /latest/meta-data/placement/
[INFO    ] API call to /latest/meta-data/placement/availability-zone
[INFO    ] API call to /latest/meta-data/public-ipv4
[INFO    ] API call to /latest/meta-data/ramdisk-id
[INFO    ] API call to /latest/meta-data/reservation-id
[INFO    ] API call to /latest/meta-data/security-groups

Its quite heavy.

Hopefully the script respects HTTP headers and don’t bypass 304 Not Modified responses. Otherwise it’ll add load to nova. Maybe I should check that (note-to-self).


You can add this feature by adding a file in your salt states repository in the _grains/ folder. The file can have any name ending by .py.

You can grab the grain python code in this gist.


Converting a dynamic site into static HTML documents

Its been two times now that I’ve been asked to make a website that was running on a CMS and make it static.

This is an useful practice if you want to keep the site content for posterity without having to maintain the underlying CMS. It makes it easier to migrate sites since the sites that you know you won’t add content to anymore becomes simply a bunch of HTML files in a folder.

My end goal was to make an EXACT copy of what the site is like when generated by the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even as to keep documents at their original location from the new static files. It means that each HTML document had to keep their same value BUT that a file will exist and the web server will find it. For example, if a link points to /foo, the link in the page remain as-is, even though its now a static file at /foo.html, but the web server will serve /foo.html anyway.

Here are a few steps I made to achieve just that. Notice that your mileage may vary, I’ve done those steps and they worked for me. I’ve done it once for a WordPress blog and another on the [email protected] website that was running on ExpressionEngine.


1. Browse and get all pages you think could be lost in scraping

We want a simple file with one web page per line with its full address.
This will help the crawler to not forget pages.

  • Use a web browser developer tool Network inspector, keep it open with “preserve log”.
  • Once you browsed the site a bit, from the network inspector tool, list all documents and then export using the “Save as HAR” feature.
  • Extract urls from har file using underscore-cli

    npm install underscore-cli
    cat site.har | underscore select ‘.entries .request .url’ > workfile.txt

  • Remove first and last lines (its a JSON array and we want one document per line)

  • Remove the trailing remove hostname from each line (i.e. start by /path), in vim you can do %s/http:\/\/www\
  • Remove " and ", from each lines, in vim you can do %s/",$//g
  • At the last line, make sure the " is removed too because the last regex missed it
  • Remove duplicate lines, in vim you can do :sort u
  • Save this file as list.txt for the next step.

2. Let’s scrape it all

We’ll do two scrapes. First one is to get all assets it can get, then we’ll go again with different options.

The following are the commands I ran on the last successful attempt to replicate the site I was working on.
This is not a statement that this method is the most efficient technique.
Please feel free to improve the document as you see fit.

First a quick TL;DR of wget options

  • -m is the same as --mirror
  • -k is the same as --convert-links
  • -K is the same as --backup-converted which creates .orig files
  • -p is the same as --page-requisites makes a page to get ALL requirements
  • -nc ensures we dont download the same file twice and end up with duplicates (e.g. file.html AND file.1.html)
  • --cut-dirs would prevent creating directories and mix things around, do not use.

Notice that we’re sending headers as if we were a web browser. Its up to you.

wget -i list.txt -nc --random-wait --mirror -e robots=off --no-cache -k -E --page-requisites \
     --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \
     --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \
     --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \

Then, another pass

wget -i list.txt --mirror -e robots=off -k -K -E --no-cache --no-parent \
     --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \
     --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \
     --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \

3. Do some cleanup on the fetched files

Here are a few commands I ran to clean the files a bit

  • Remove empty lines in every .orig files. They’re the ones we’ll use in the end after all

    find . -type f -regextype posix-egrep -regex '.*\.orig$' -exec sed -i 's/\r//' {} \;
  • Rename the .orig file into html

    find . -name '*orig' | sed -e "p;s/orig/html/" | xargs -n2 mv
    find . -type f -name '*\.html\.html' | sed -e "p;s/\.html//" | xargs -n2 mv
  • Many folders might have only an index.html file in it. Let’s just make them a file without directory

    find . -type f -name 'index.html' | sed -e "p;s/\/index\.html/.html/" | xargs -n2 mv
  • Remove files that has a .1 (or any number in them), they are most likely duplicates anyway

    find . -type f -name '*\.1\.*' -exec rm -rf {} \;

Setting up Discourse with Fastly as a CDN provider and SSL

The following is a copy of what I published in a question on about “Enable a CDN for your Discourse while working on

Setup detail

Our setup uses Fastly, and leverage their SSL feature. Note that in order for you to use SSL too, you’ll have to contact them to have it onto your account.

SEE ALSO this post about Make Discourse “long polling” work behind Fastly. This step is required and is a logical next step to this procedure.

In summary;

  • SSL between users and Fastly
  • SSL between Fastly and “frontend” servers. (That’s the IP we put into Fastly hosts configuration, and are also refered to as “origins” or “backends” in CDN-speak)
  • Docker Discourse instance (“upstream“) which listens only on private network and port (e.g.
  • More than two publicly exposed web servers (“frontend“), with SSL, that we use as “backends” in Fastly
  • frontend server running NGINX with an upstream block proxying internal upstream web servers that the Discourse Docker provides.
  • We use NGINX’s keepalive HTTP header in the frontend to make sure we minimize connections

Using this method, if we need to scale, we only need add more internal Discourse Docker instances, we can add more NGINX upstream entries.

Note that I recommend to use direct private IP addresses instead of internal names. It removes complexity and the need to rewrite Hosts: HTTP headers.


Everything is the same as basic Fastly configuration, refer to setup your domain.

Here are the differences;

  1. Setup your domain name with the CNAME Fastly will provide you (you will have to contact them for your account though), ours is like that ;  IN  CNAME
  2. In Fastly pannel at Configure -> Hosts, we tell which publicly available frontends IPs

    Notice we use port 443, so SSL is between Fastly and our frontends. Also, you can setup Shielding (which is how you activate the CDN behavior within Fastly) by enabling it on only one. I typically set it on the one I call “first”.

    Fastly service configuration, at Hosts tab

  3. In Fastly pannel Configure -> Settings -> Request Settings; we make sure we forward X-Forwarded-For header. You DONT need this; you can remove it.

    Fastly service configuration, at Settings tab

  4. Frontend NGINX server has a block similar to this.

    In our case, we use Salt Stack as the configuration management system, it basically generates the Virtual Hosts for us as using Salt reactor system. Every time a Docker instance would become available, the configuration will be rewritten using this template.

    • {{ upstream_port }} would be at 8000 in this example

    • {{ upstreams }} would be an array of current internal Docker instances, e.g. ['','']

    • {{ tld }} would be in production, but can be anything else we need in other deployment, it gives great flexibility.
    • Notice the use of discoursepolling alongside the discourse subdomain name. Refer to this post about Make Discourse “long polling” work behind Fastly to understand its purpose

      upstream upstream_discourse {
      {%- for b in upstreams %}
          server    {{ b }}:{{ upstream_port }};
      {%- endfor %}
          keepalive 16;
      server {
          listen      443 ssl;
          server_name discoursepolling.{{ tld }} discourse.{{ tld }};
          root    /var/www/html;
          include common_params;
          include ssl_params;
          ssl                 on;
          ssl_certificate     /etc/ssl/2015/discuss.pem;
          ssl_certificate_key /etc/ssl/2015/201503.key;
          # Use internal Docker runner instance exposed port
          location / {
              proxy_pass             http://upstream_discourse;
              include                proxy_params;
              proxy_intercept_errors on;
              # Backend keepalive
              # ref:
              proxy_http_version 1.1;
              proxy_set_header Connection "";

    Note that I removed the include proxy_params; line. If you have lines similar to proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;, you don’t need them (!)

Quelques bouts de code pour automatiser le déploiement


Ce billet n’est qu’un simple «link dump» pour retrouver parmi plusieurs notes éparpillés. Je compte éventuellement publier la totalité de mon travail dans des projets publics sur GitHub une fois la boucle complétée. Le tout sans fournir les données privés, évidemment.

Faire le saut vers l’automatisation demande beaucoup de préparation et je prends le temps de publier ici quelques bouts de code que j’ai écrits pour compléter la tâche.

Au final, mon projet permettra de déployer un site qui s’appuie sur un cluster MariaDB, Memcached, une stack LAMP («prefork») lorsqu’on a pas le choix, une stack [HHVM/php5-fpm, Python, nodejs] app servers pour le reste servi par un frontend NGINX. Mes scripts vont déployer une série d’applications web avec toutes les dépendances qui les adaptent géré dans leur propre «git repo» parent. Dans mon cas, ce sera: WordPress, MediaWiki, Discourse, et quelques autres.


  • Instantiation à partir de commandes nova du terminal, crée une nouvelle VM mise à jour et son nom définit son rôle dans le réseau interne
  • Les VMs sont uniquement accessible par un Jump box (i.e. réseau interne seulement)
  • Un système regarde si un répertoire clone git à eu des changements sur la branche «master», lance un événement si c’est le cas
  • Chaque machine sont construites à partir d’une VM minimale. Dans ce cas-ci; Ubuntu 14.04 LTS
  • Système doit s’assurer que TOUTES les mises à jour sont appliqués régulièrement
  • Système doit s’assurer que ses services interne sont fonctionnels
  • Dans le cas d’une situation où une VM atteint le seuil critique OOM, la VM redémarre automatiquement
  • Le nom de la VM décrit son rôle, et les scripts d’installation installent les requis qui y sont affectés
  • Les configurations utilisent les détails (e.g. adresses IP privés et publiques) de chaque pool (e.g. redis, memcache, mariadb) et ajuste automatiquement les configurations dans chaque application
  • … etc.

Bouts de code

Billets inspirants sur le sujet