python – Electricmonk.nl weblog

An Ansible safeguard

admin — Mon, 18 Jul 2022 08:44:05 +0000

At my work, we use ansible to provision all kinds of things, from servers to monitoring. Ansible is very powerful, but with great power comes great responsibility. One downside of automating many things with ansible is that you could also accidentally destroy a lot of things with a single wrong command.

In a perfect world where everything is managed by ansible, this wouldn’t be much of a problem. However we rarely, if ever, live in a perfect world. and real deployments can drift from what is configured in ansible. The platforms we manage are not always fully under our control and can get pretty non-homogenous.

So we were looking for something like a safeguard to protect us from ourselves; to prevent us from accidentally invoking ansible in a destructive way. This article describes our solution to this problem.

Our Ansible setup

First, a little bit about how our ansible is set up. The safeguard in this article may not work for setups that differ from ours.

We have a single, overarching, site.yml playbook that determines which tasks to run for which machines, tags, etc. I’m not going to show the whole thing, but the gist of it looks like this:

- hosts: all
  roles:
    - role: common
      tags: ["common"]

    - role: firewall
      tags: ["firewall"]

    - role: certificate
      tags: ["certificate", "webserver"]
      when: "'certificate' in group_names"

Basically this says to apply the common and firewall roles to all machines and only apply the certificate role if it’s in the certificate group. We then have a hosts file that looks a little like this:

db1.example.com
web1.example.com

[certificate]
web1.example.com

So db1 and web1 will get the common and firewall roles and web1 will get the certificate role in addition.

We then execute ansible like so:

$ ansible-playbook -b -K site.yml -t certificate -l web1.example.com

I’m not entirely sure, but I think this is a pretty common setup.

The potentially dangerous problems

We have a few roles that are potentially dangerous. For example, the webserver role will deploy a webserver. However, as time progresses, the actually deployed configuration of the webserver in production can drift from what is configured in ansible. Yes, this is something that, in theory, should never happen. Unfortunately, we live in the real world where things are not always perfect for various reasons. In our situation, we can’t always be fully in control, and it’s something we just have to be pragmatic about.

There are also roles and tags that are just inherently dangerous. For instance, we have a few tags that always require restarts of services, which may cause disruptions if done during office hours.

Then there’s the problem of overly broad host specifications. For example, if we accidentally forget to specify a host limit or a tag, or we make a typo, we may inadvertently role out way too many changes.

We wanted a way to prevent us from accidentally making these mistakes, but still allow us to overrule any safeguards if we were sure it was the right thing to do.

Our solution

What we came up with is a special task at the top of site.yml that always runs, regardless of what tags or limits you specify:

- name: Safeguard
  # Always run regardless of what tags or limits the the user specifies.
  hosts: all

  connection: local
  become: no
  gather_facts: false

  tasks:
    # Call a local script in the repo that will perform some safety checks.
    - name: Check hosts and tags
      ansible.builtin.shell:
        cmd: tools/safeguard.py
      delegate_to: localhost
      run_once: true

      # Pass some information off this ansible run to the script via the
      # environment.
      environment:
        safeguard_limit: "{{ ansible_limit|default('') }}"
        safeguard_hosts: "{{ ansible_play_hosts }}"
        safeguard_tags: "{{ ansible_run_tags }}"

        # The user can override safety guards by setting these variables
        # using '-e sg_nolimit=yes'
        sg_nolimit: "{{ sg_nolimit|default('BREAKBAD') }}"
        sg_notag: "{{ sg_notag|default('BREAKBAD') }}"
        sg_dangertag: "{{ sg_dangertag|default('BREAKBAD') }}"
        sg_manyhosts: "{{ sg_manyhosts|default('BREAKBAD') }}"

      # The 'always' tag is special in ansible and will always match regardless
      # of which tags you specify (including none at all).
      tags:
        - always

      changed_when: False

I’m not going to explain this task in detail, you can read the comments in it to fully understand it. Basically, it passes some of the current ansible run information such as the user-specified tags and limits to a script, which will check for potentially dangerous things, such as not specifying a limit. The user can override these checks by setting various variables using -e sg_XXXX.

So, for example, the user must specify a host limit using -l. Otherwise, the playbook will execute on all hosts that match it, and that may not be what you intended. You can override this safeguard like so:

$ ansible-playbook -b -K site.yml -e sg_nolimit=yes -t certificate

This will probably also trigger the “manyhosts” safeguard, which checks that you’re not specifying too many hosts at the same time. So you’d also have to override that safeguard:

$ ansible-playbook -b -K site.yml -e sg_nolimit=yes -e sg_manyhosts=yes -t certificate

The safeguard script

The safeguard.py script looks like this:

#!/usr/bin/env python3

# Safeguard script, executed by the first task in site.yml.

import ast
import os
import sys


def check_constraint(cb, override, err_msg):
    """
    Wrapper function around constraint_ functions. Does some boilerplate such
    as checking for overrides.
    """
    if override in os.environ and os.environ[override] == 'yes':
        # User has overriden this constraint with an extra var.
        return

    # Call the callback. If it doesn't return True, abort.
    if cb() is not True:
        sys.stderr.write("{}. Override with '{}=yes'.\n".format(
            err_msg,
            override)
        )
        sys.exit(1)

def constraint_nolimit():
    """
    The user should specify a limit with '-l' or '--limit'. If not, this var
    will be empty.
    """
    # If this is not empty, it's fine
    if os.environ["safeguard_limit"] != "":
        return True

def constraint_notags():
    """
    The user should specify a tag. If not, the value here becomes 'all'. Stop
    if it is.
    """
    tags = ast.literal_eval(os.environ["safeguard_tags"])
    if len(tags) > 0 and "all" not in tags:
        return True

def constraint_dangertags():
    """
    Some tags are a bit dangerous
    """
    # FIXME: Hardcoded
    danger_tags = ["common", "webserver"]
    tags = ast.literal_eval(os.environ["safeguard_tags"])
    for tag in tags:
        if tag in danger_tags:
            return False

    return True

def constraint_manyhosts():
    """
    Executing stuff on many hosts may not be a good idea.
    """
    if os.environ["sg_nolimit"] == "yes":
        return True

    hosts = ast.literal_eval(os.environ["safeguard_hosts"])
    if len(hosts) < 4:
        return True

    return False


if __name__ == "__main__":
    check_constraint(constraint_nolimit, "sg_nolimit", "No limit specified")
    check_constraint(constraint_notags, "sg_notag", "No tag(s) specified")
    check_constraint(constraint_dangertags, "sg_dangertag", "Dangerous tags specified")
    check_constraint(constraint_manyhosts, "sg_manyhosts", "Too many hosts specified")

I've reduced the script a bit for clarity. Again, I'm not going to fully explain how it works. If you can read a little bit of Python, its workings should be self-evident. There's a bit of dynamic dispatch magic in it to call the various constraint_ methods. Not something I usually recommend as it can lead to unclear call stacks pretty quickly, but in such a small script it's not much of a problem.

Conclusion

This safeguard construction has been working well for us. While actually fixing the dangerous situations is always preferable, sometimes in real life things get messy and an extra hurdle can prevent accidental damage. This solution, coupled with --check and various protections in the roles themselves, have so far prevented us from creating accidental production disruptions.

SSL/TLS client certificate verification with Python v3.4+ SSLContext

admin — Sat, 02 Jun 2018 10:12:05 +0000

Normally, an SSL/TLS client verifies the server’s certificate. It’s also possible for the server to require a signed certificate from the client. These are called Client Certificates. This ensures that not only can the client trust the server, but the server can also trusts the client.

Traditionally in Python, you’d pass the ca_certs parameter to the ssl.wrap_socket() function on the server to enable client certificates:

# Client
ssl.wrap_socket(s, ca_certs="ssl/server.crt", cert_reqs=ssl.CERT_REQUIRED,
                certfile="ssl/client.crt", keyfile="ssl/client.key")

# Server
ssl.wrap_socket(connection, server_side=True, certfile="ssl/server.crt",
                keyfile="ssl/server.key", ca_certs="ssl/client.crt")

Since Python v3.4, the more secure, and thus preferred method of wrapping a socket in the SSL/TLS layer is to create an SSLContext instance and call SSLContext.wrap_socket(). However, the SSLContext.wrap_socket() method does not have the ca_certs parameter. Neither is it directly obvious how to enable requirement of client certificates on the server-side.

The documentation for SSLContext.load_default_certs() does mention client certificates:

Purpose.CLIENT_AUTH loads CA certificates for client certificate verification on the server side.

But SSLContext.load_default_certs() loads the system’s default trusted Certificate Authority chains so that the client can verify the server‘s certificates. You generally don’t want to use these for client certificates.

In the Verifying Certificates section, it mentions that you need to specify CERT_REQUIRED:

In server mode, if you want to authenticate your clients using the SSL layer (rather than using a higher-level authentication mechanism), you’ll also have to specify CERT_REQUIRED and similarly check the client certificate.

I didn’t spot how to specify CERT_REQUIRED in either the SSLContext constructor or the wrap_socket() method. Turns out you have to manually set a property on the SSLContext on the server to enable client certificate verification, like this:

context = ssl.create_default_context(ssl.Purpose.CLIENT_AUTH)
context.verify_mode = ssl.CERT_REQUIRED
context.load_cert_chain(certfile=server_cert, keyfile=server_key)
context.load_verify_locations(cafile=client_certs)

Here’s a full example of a client and server who both validate each other’s certificates:

For this example, we’ll create Self-signed server and client certificates. Normally you’d use a server certificate from a Certificate Authority such as Let’s Encrypt, and would setup your own Certificate Authority so you can sign and revoke client certificates.

Create server certificate:

openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 -keyout server.key -out server.crt

Make sure to enter ‘example.com’ for the Common Name.

Next, generate a client certificate:

openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 -keyout client.key -out client.crt

The Common Name for the client certificate doesn’t really matter.

Client code:

#!/usr/bin/python3

import socket
import ssl

host_addr = '127.0.0.1'
host_port = 8082
server_sni_hostname = 'example.com'
server_cert = 'server.crt'
client_cert = 'client.crt'
client_key = 'client.key'

context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH, cafile=server_cert)
context.load_cert_chain(certfile=client_cert, keyfile=client_key)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = context.wrap_socket(s, server_side=False, server_hostname=server_sni_hostname)
conn.connect((host_addr, host_port))
print("SSL established. Peer: {}".format(conn.getpeercert()))
print("Sending: 'Hello, world!")
conn.send(b"Hello, world!")
print("Closing connection")
conn.close()

Server code:

#!/usr/bin/python3

import socket
from socket import AF_INET, SOCK_STREAM, SO_REUSEADDR, SOL_SOCKET, SHUT_RDWR
import ssl

listen_addr = '127.0.0.1'
listen_port = 8082
server_cert = 'server.crt'
server_key = 'server.key'
client_certs = 'client.crt'

context = ssl.create_default_context(ssl.Purpose.CLIENT_AUTH)
context.verify_mode = ssl.CERT_REQUIRED
context.load_cert_chain(certfile=server_cert, keyfile=server_key)
context.load_verify_locations(cafile=client_certs)

bindsocket = socket.socket()
bindsocket.bind((listen_addr, listen_port))
bindsocket.listen(5)

while True:
    print("Waiting for client")
    newsocket, fromaddr = bindsocket.accept()
    print("Client connected: {}:{}".format(fromaddr[0], fromaddr[1]))
    conn = context.wrap_socket(newsocket, server_side=True)
    print("SSL established. Peer: {}".format(conn.getpeercert()))
    buf = b''  # Buffer to hold received client data
    try:
        while True:
            data = conn.recv(4096)
            if data:
                # Client sent us data. Append to buffer
                buf += data
            else:
                # No more data from client. Show buffer and close connection.
                print("Received:", buf)
                break
    finally:
        print("Closing connection")
        conn.shutdown(socket.SHUT_RDWR)
        conn.close()

Output from the server looks like this:

$ python3 ./server.py 
Waiting for client
Client connected: 127.0.0.1:51372
SSL established. Peer: {'subject': ((('countryName', 'AU'),),
(('stateOrProvinceName', 'Some-State'),), (('organizationName', 'Internet
Widgits Pty Ltd'),), (('commonName', 'someclient'),)), 'issuer':
((('countryName', 'AU'),), (('stateOrProvinceName', 'Some-State'),),
(('organizationName', 'Internet Widgits Pty Ltd'),), (('commonName',
'someclient'),)), 'notBefore': 'Jun  1 08:05:39 2018 GMT', 'version': 3,
'serialNumber': 'A564F9767931F3BC', 'notAfter': 'Jun  1 08:05:39 2019 GMT'}
Received: b'Hello, world!'
Closing connection
Waiting for client

Output from the client:

$ python3 ./client.py 
SSL established. Peer: {'notBefore': 'May 30 20:47:38 2018 GMT', 'notAfter':
'May 30 20:47:38 2019 GMT', 'subject': ((('countryName', 'NL'),),
(('stateOrProvinceName', 'GLD'),), (('localityName', 'Ede'),),
(('organizationName', 'Electricmonk'),), (('commonName', 'example.com'),)),
'issuer': ((('countryName', 'NL'),), (('stateOrProvinceName', 'GLD'),),
(('localityName', 'Ede'),), (('organizationName', 'Electricmonk'),),
(('commonName', 'example.com'),)), 'version': 3, 'serialNumber':
'CAEC89334941FD9F'}
Sending: 'Hello, world!
Closing connection

A few notes:

You can concatenate multiple client certificates into a single PEM file to authenticate different clients.
You can re-use the same cert and key on both the server and client. This way, you don’t need to generate a specific client certificate. However, any clients using that certificate will require the key, and will be able to impersonate the server. There’s also no way to distinguish between clients anymore.
You don’t need to setup your own Certificate Authority and sign client certificates. You can just generate them with the above mentioned openssl command and add them to the trusted certificates file. If you no longer trust the client, just remove the certificate from the file.
I’m not sure if the server verifies the client certificate’s expiration date.

Lurch: a unixy launcher and auto-typer

admin — Sun, 04 Mar 2018 08:45:44 +0000

I cobbled together a unixy command / application launcher and auto-typer. I’ve dubbed it Lurch.

Features:

Fuzzy filtering as-you-type.
Execute commands.
Open new browser tabs.
Auto-type into currently focussed window
Auto-type TOTP / rfc6238 / two-factor / Google Authenticator codes.
Unixy and composable. Reads entries from stdin.

You can use and combine these features to do many things:

Auto-type passwords
Switch between currently opened windows by typing a part of its title (using wmctrl to list and switch to windows)
As a generic (and very customizable) application launcher by parsing .desktop entries or whatever.
Quickly cd to parts of your filesystem using auto-type.
Open browser tabs and search via google or specific search engines.
List all entries in your SSH configuration and quickly launch an ssh session to one of them.
Etc.

You’ll need a way to launch it when you press a keybinding. That’s usually the window manager’s job. For XFCE, you can add a keybinding under the Keyboard -> Application Shortcuts settings dialog.

Here’s what it looks like:

Unfortunately, due to time constraints, I cannot provide any support for this project:

NO SUPPORT: There is absolutely ZERO support on this project. Due to time constraints, I don’t take bug or features reports and probably won’t accept your pull requests.

You can get it from the Github page.

Understanding Python’s logging module

admin — Sun, 06 Aug 2017 08:40:08 +0000

I’m slightly embarrassed to say that after almost two decades of programming Python, I still didn’t understand its logging module. Sure, I could get it to work, and reasonably well, but I’d often end up with unexplained situations such as double log lines or logging that I didn’t want.

>>> requests.get('https://www.electricmonk.nl')
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.electricmonk.nl
DEBUG:requests.packages.urllib3.connectionpool:https://www.electricmonk.nl:443 "GET / HTTP/1.1" 301 178

What? I never asked for that debugging information?

So I decided to finally, you know, read the documentation and experiment a bit to figure out how the logging module really works. In this article I want to bring attention to some of the misconceptions I had about the logging module. I’m going to assume you have a basic understanding of how it works and know about loggers, log levels and handlers.

Logger hierarchy

Loggers have a hierarchy. That is, you can create individual loggers and each logger has a parent. At the top of the hierarchy is the root logger. For instance, we could have the following loggers:

myapp
myapp.ui
myapp.ui.edit

These can be created by asking a parent logger for a new child logger:

>>> log_myapp = logging.getLogger("myapp")
>>> log_myapp_ui = log_myapp.getChild("ui")
>>> log_myapp_ui.name
'myapp.ui'
>>> log_myapp_ui.parent.name
'myapp'

Or you can use dot notation:

>>> log_myapp_ui = logging.getLogger("myapp.ui")
>>> log_myapp_ui.parent.name
'myapp'

You should use the dot notation generally.

One thing that’s not immediately clear is that the logger names don’t include the root logger. In actuality, the logger hierarchy looks like this:

root.myapp
root.myapp.ui
root.myapp.ui.edit

Log levels and message propagation

Each logger can have a log level. When you send a message to a logger, you specify the log level of the message. If the level matches, the message is then propagated up the hierarchy of loggers. One of the biggest misconceptions I had was that I thought each logger checked the level of the message and if it the level of the message is lower or equal, the logger’s handler would be invoked. This is not true!

What happens instead is that the level of the message is only checked by the logger you give the message to. If the message’s level is lower or equal to the logger’s, the message is propagated up the hierarchy, but none of the other loggers will check the level! They’ll simply invoke their handlers.

>>> log_myapp.setLevel(logging.ERROR)
>>> log_myapp_ui.setLevel(logging.DEBUG)
>>> log_myapp_ui.debug('test')
DEBUG:myapp.ui:test

In the example above, the root logger has a handler that prints the message. Even though the “log_myapp” handler has a level of ERROR, the DEBUG message is still propagated to to the root logger. This image (found on this page) shows why:

As you can see, when giving a message to a logger, the logger checks the level. After that, the level on the loggers is no longer checked and all handlers in the entire chain are invoked, regardless of level. Note that you can set levels on handlers as well. This is useful if you want to, for example, create a debugging log file but only show warnings and errors on the console.

It’s also worth noting that by default, loggers have a level of 0. This means they use the log level of the first parent logger that has an actual level set. This is determined at message-time, not when the logger is created.

The root logger

The logging tutorial for Python explains that to configure logging, you can use basicConfig():

logging.basicConfig(filename='example.log',level=logging.DEBUG)

It’s not immediately obvious, but what this does is configure the root logger. Doing this may cause some counter-intuitive behaviour, because it causes debugging output for all loggers in your program, including every library that uses logging. This is why the requests module suddenly starts outputting debug information when you configure the root logger.

In general, your program or library shouldn’t log directly against the root logger. Instead configure a specific “main” logger for your program and put all the other loggers under that logger. This way, you can toggle logging for your specific program on and off by setting the level of the main logger. If you’re still interested in debugging information for all the libraries you’re using, feel free to configure the root logger. There is no convenient method such as basicConfig() to configure a main logger, so you’ll have to do it manually:

ch = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s %(levelname)8s %(name)s | %(message)s')
ch.setFormatter(formatter)

logger = logging.getLogger('myapp')
logger.addHandler(ch)
logger.setLevel(logging.WARNING)  # This toggles all the logging in your app

There are more pitfalls when it comes to the root logger. If you call any of the module-level logging methods, the root logger is automatically configured in the background for you. This goes completely against Python’s “explicit is better than implicit” rule:

#!/usr/bin/env python
import logging
logging.warn("uhoh")
# Output: WARNING:root:uhoh

In the example above, I never configured a handler. It was done automatically. And on the root handler no less. This will cause all kinds of logging output from libraries you might not want. So don’t use the logging.warn(), logging.error() and other module-level methods. Always log against a specific logger instance you got with logging.getLogger().

This has tripped me up many times, because I’ll often do some simple logging in the main part of my program with these. It becomes especially confusing when you do something like this:

#!/usr/bin/python
import logging
for x in foo:
    try:
        something()
    except ValueError as err:
        logging.exception(err)
        pass  # we don't care

Now there will be no logging until an error occurs, and then suddenly the root logger is configured and subsequent iterations of the loop may start logging messages.

The Python documentation also mentions the following caveat about using module-level loggers:

The above module-level convenience functions, which delegate to the root logger, call basicConfig() to ensure that at least one handler is available. Because of this, they shouldnot be used in threads, in versions of Python earlier than 2.7.1 and 3.2, unless at least one handler has been added to the root logger before the threads are started. In earlier versions of Python, due to a thread safety shortcoming in basicConfig(), this can (under rare circumstances) lead to handlers being added multiple times to the root logger, which can in turn lead to multiple messages for the same event.

Debugging logging problems

When I run into weird logging problems such as no output, or double lines, I generally put the following debugging code at the point where I’m logging the message.

log_to_debug = logging.getLogger("myapp.ui.edit")
while log_to_debug is not None:
    print "level: %s, name: %s, handlers: %s" % (log_to_debug.level,
                                                 log_to_debug.name,
                                                 log_to_debug.handlers)
    log_to_debug = log_to_debug.parent

which outputs:

level: 0, name: myapp.ui.edit, handlers: []
level: 0, name: myapp.ui, handlers: []
level: 0, name: myapp, handlers: []
level: 30, name: root, handlers: []

From this output it becomes obvious that all loggers use a level of 30, since their log levels are 0, which means the look up the hierarchy for the first logger with a non-zero level. I’ve also not configured any handlers. If I was seeing double output, it’s probably because there is more than one handler configured.

Summary

When you log a message, the level is only checked at the logger you logged the message against. If it passes, every handler on every logger up the hierarchy is called, regardless of that logger’s level.
By default, loggers have a level of 0. This means they use the log level of the first parent logger that has an actual level set. This is determined at message-time, not when the logger is created.
Don’t log directly against the root logger. That means: no logging.basicConfig() and no usage of module-level loggers such as logging.warning(), as they have unintended side-effects.
Create a uniquely named top-level logger for your application / library and put all child loggers under that logger. Configure a handler for output on the top-level logger for your application. Don’t configure a level on your loggers, so that you can set a level at any point in the hierarchy and get logging output at that level for all underlying loggers. Note that this is an appropriate strategy for how I usually structure my programs. It might not be for you.
The easiest way that’s usually correct is to use __name__ as the logger name: log = logging.getLogger(__name__). This uses the module hierarchy as the name, which is generally what you want.
Read the entire logging HOWTO and specifically the Advanced Logging Tutorial, because it really should be called “logging basics”.

Merging two Python dictionaries by deep-updating

admin — Sun, 07 May 2017 14:56:38 +0000

Say we have two Python dictionaries:

{
    'name': 'Ferry',
    'hobbies': ['programming', 'sci-fi']
}

and

{
    'hobbies': ['gaming']
}

What if we want to merge these two dictionaries such that “gaming” is added to the “hobbies” key of the first dictionary? I couldn’t find anything online that did this already, so I wrote the following function for it:

# Copyright Ferry Boender, released under the MIT license.
def deepupdate(target, src):
    """Deep update target dict with src
    For each k,v in src: if k doesn't exist in target, it is deep copied from
    src to target. Otherwise, if v is a list, target[k] is extended with
    src[k]. If v is a set, target[k] is updated with v, If v is a dict,
    recursively deep-update it.

    Examples:
    >>> t = {'name': 'Ferry', 'hobbies': ['programming', 'sci-fi']}
    >>> deepupdate(t, {'hobbies': ['gaming']})
    >>> print t
    {'name': 'Ferry', 'hobbies': ['programming', 'sci-fi', 'gaming']}
    """
    for k, v in src.items():
        if type(v) == list:
            if not k in target:
                target[k] = copy.deepcopy(v)
            else:
                target[k].extend(v)
        elif type(v) == dict:
            if not k in target:
                target[k] = copy.deepcopy(v)
            else:
                deepupdate(target[k], v)
        elif type(v) == set:
            if not k in target:
                target[k] = v.copy()
            else:
                target[k].update(v.copy())
        else:
            target[k] = copy.copy(v)

It uses a combination of deepcopy(), updating and self recursion to perform a complete merger of the two dictionaries.

As mentioned in the comment, the above function is released under the MIT license, so feel free to use it any of your programs.

Reliable message delivery with Mosquitto (MQTT)

admin — Mon, 20 Feb 2017 19:10:46 +0000

I was looking for a message queue that could reliably handle messages in such a way that I was guaranteed never to miss one, even if the consumer is offline or crashes. Mosquitto (MQTT) comes very close to that goal. However, it wasn’t directly obvious how to configure it to be as reliable as possible So this post describes how to use Mosquitto to ensure the most reliable delivery it can handle.

TL;DR: You can’t

If you want to do reliable message handling with Mosquitto, the short answer is: You can’t. For the long answer, read the rest of the article. Or if you’re lazy and stubborn, read the “Limitations” section further down. ;-)

Anyway, let’s get on with the show and see how close Mosquitto can get.

Quick overview of Mosquitto

Here’s a quick schematic of Mosquitto components:

+----------+     +--------+     +----------+
| producer |---->| broker |---->| consumer |
+----------+     +--------+     +----------+

The producer sends messages to a topic on the broker. The broker maintains an internal state of topics and which consumers are interested in which topics. It also maintains a queue of messages which still need to be sent to each consumer. How the broker decided what / when to send to which consumer depends on settings such as the QoS (Quality of Service) and what kind of session the consumer is opening.

Producer and consumer settings

Here’s a quick overview of settings that ensure the highest available quality of delivery of messages with Mosquitto. When creating a consumer or producer, ensure you set these settings properly:

quality-of-service must be 2.
The consumer must send a client_id.
clean_session on the consumer must be False.

These are the base requirements to ensure that each consumer will receive messages exactly once, even if they’ve been offline for a while. The quality-of-service setting of 2 ensures that the broker requires acknowledgement from the consumer that a message has been received properly. Only then does the broker update its internal state to advance the consumer to the next message in the queue. If the client crashes before acknowledging the message, it’ll be resent the next time.

The client_id gives the broker a unique name under which to store session state information such as the last message the consumer has properly acknowledged. Without a client_id, the broker cannot do this.

The clean_session setting lets the consumer inform the broker about whether it wants its session state remembered. Without it, the broker assumes the broker assumes the consumer does not care about past messages and such. It will only receive any new messages that are produced after the consumer has connected to the broker.

Together these settings ensure that messages are reliably delivered from the producer to the broker and to the consumer, even if the consumer has been disconnected for a while or crashes while receiving the message.

Broker settings

The following settings are relevant configuration options on the broker. You can generally find these settings in/etc/mosquitto/mosquitto.conf.

The broker must have persistence set to True in the broker configuration.
You may want to set max_inflight_messages to 1 in the broker configuration to ensure correct ordering of messages.
Configure max_queued_messsages to the maximum number of messages to retain in a queue.
Tweak autosave_interval to how often you want the broker to write the in-memory database to disk.

The persistence setting informs the broker that you’d like session state and message queues written to disk. If the broker for some reason, the messages will (mostly) still be there.

You can ensure that messages are sent to consumers in the same order as they were sent to the broker by the producers by setting the max_inflight_messages setting to 1. This will probably severely limit the throughput speed of messages.

The max_queued_messsages determines how many unconfirmed messages should maximally be retained in queues. This should basically be the product of the maximum number of messages per second and the maximum time a consumer might be offline. Say we’re processing 1 message per second and we want the consumer to be able to be offline for 2 hours (= 7200 seconds), then the max_queued_messsages setting should be 1 * 7200 = 7200.

The autosave_interval determines how often you want the broker to write the in-memory database to disk. I suspect that setting this to a very low level will cause severe Disk I/O activity.

Examples

Here’s an example of a producer and consumer:

producer.py:

import paho.mqtt.client as paho
import time

client = paho.Client(protocol=paho.MQTTv31)
client.connect("localhost", 1883)
client.loop_start()
client.publish("mytesttopic", str("foo"), qos=2)
time.sleep(1)  # Give the client loop time to proess the message

consumer.py:

import paho.mqtt.client as paho

def on_message(client, userdata, msg):
    print(msg.topic+" "+str(msg.qos)+" "+str(msg.payload))

client = paho.Client("testclient", clean_session=False, protocol=paho.MQTTv31)
client.on_message = on_message
client.connect("localhost", 1883)
client.subscribe("mytesttopic", qos=2)
client.loop_forever()

Pitfalls

There are a few pitfalls I ran into when using Mosquitto:

If the broker or one of the clients doesn’t support the MQTTv32 protocol, things will fail silently. So I specify MQTTv31 manually.
The client loop needs some time to process the sending and receiving of messages. If you send a single message and exit your program right away, the loop doesn’t have time to actually send the message.
The subscriber must have already run once before the broker will start keeping messages for it. Otherwise, the broker has no idea that a consumer with QoS=2 is interested in messages (and would have to keep messages for ever). So register your consumer once by just running it, before the producer runs.

Limitations

Although the settings above make exchanging messages with Mosquitto more reliable, there are still some downsides:

Exchanging messages in this way is obviously slower than having no consistency checks in place.
Since the Mosquitto broker only writes the in-memory database to disk every X (where X is configurable) seconds, you may lose data if the broker crashes.
On the consumer side, it is the MQTT library that confirms the receipt of the message. However, as far as I can tell, there is no way to manually confirm the receipt of a message. So if your client crashes while handling a message, rather than while it is receiving a message, you may still lose the message. If you wish to handle this case, you can store the message on the client as soon as possible. This is, however, not much more reliable. The only other way is to implement some manual protocol via the exchange of messages where the original publisher retains a message and resends it unless its been acknowledged by the consumer.

Conclusion

In other words, as far as I can see, you cannot do reliable message handling with Mosquitto. If your broker crashes or your client crashes, Mosquitto will lose your messages. Other than that, if all you require is reliable delivery of messages to the client, you’re good to go.

So what are the alternatives? At this point, I have to honest and say: I don’t know yet. I’m personally looking for a lightweight solution, and it seems none of the lightweight Message Queues do reliable message handling (as opposed to reliable messagedelivery, which most do just fine).

When I find an answer, I’ll let you know here.

Exploring UPnP with Python

admin — Tue, 05 Jul 2016 19:56:07 +0000

UPnP stands for Universal Plug and Play. It’s a standard for discovering and interacting with services offered by various devices on a network. Common examples include:

Discovering, listing and streaming media from media servers
Controlling home network routers: e.g. automatic configuration of port forwarding to an internal device such as your Playstation or XBox.

In this article we’ll explore the client side (usually referred to as the Control Point side) of UPnP using Python. I’ll explain the different protocols used in UPnP and show how to write some basic Python code to discover and interact with devices. There’s lots of information on UPnP on the Internet, but a lot of it is fragmented, discusses only certain aspects of UPnP or is vague on whether we’re dealing with the client or a server. The UPnP standard itself is quite an easy read though.

Disclaimer: The code in this article is rather hacky and not particularly robust. Do not use it as a basis for any real projects.

Protocols

UPnP uses a variety of different protocols to accomplish its goals:

SSDP: Simple Service Discovery Protocol, for discovering UPnP devices on the local network.
SCPD: Service Control Point Definition, for defining the actions offered by the various services.
SOAP: Simple Object Access Protocol, for actually calling actions.

Here’s a schematic overview of the flow of a UPnP session and where the different protocols come into play.

The standard flow of operations in UPnP is to first use SSDP to discover which UPnP devices are available on the network. Those devices return the location of an XML file which defines the various services offered by each device. Next we use SCPD on each service to discover the various actions offered by each service. Essentially, SCPD is an XML-based protocol which describes SOAP APIs, much like WSDL. Finally we use SOAP calls to interact with the services.

SSDP: Service Discovery

Lets take a closer look at SSDP, the Simple Service Discovery Protocol. SSDP operates over UDP rather than TCP. While TCP is a statefull protocol, meaning both end-points of the connection are aware of whom they’re talking too, UDP is stateless. This means we can just throw UDP packets over the line, and we don’t care much whether they are received properly or even received at all. UDP is often used in situations where missing a few packets is not a problem, such as streaming media.

SSDP uses HTTP over UDP (called HTTPU) in broadcasting mode. This allows all UPnP devices on the network to receive the requests regardless of whether we know where they are located. Here’s a very simple example of how to perform an HTTPU query using Python:

import socket

msg = \
    'M-SEARCH * HTTP/1.1\r\n' \
    'HOST:239.255.255.250:1900\r\n' \
    'ST:upnp:rootdevice\r\n' \
    'MX:2\r\n' \
    'MAN:"ssdp:discover"\r\n' \
    '\r\n'

# Set up UDP socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
s.settimeout(2)
s.sendto(msg, ('239.255.255.250', 1900) )

try:
    while True:
        data, addr = s.recvfrom(65507)
        print addr, data
except socket.timeout:
    pass

This little snippet of code creates a HTTP message using the M-SEARCH HTTP method, which is specific to UPnP. It then sets up a UDP socket, and sends out the HTTPU message to IP address 239.255.255.250, port 1900. That IP is a special broadcast IP address. It is not actually tied to any specific server, like normal IPs. Port 1900 is the one which UPnP servers will listen on for broadcasts.

Next, we listen on the socket for any replies. The socket has a timeout of 2 seconds. This means that after not receiving any data on the socket after two seconds, the s.recvfrom() call times out, which raises an exception. The exception is caught, and the program continues.

You will recall that we don’t know how many devices might be on the network. We also don’t know where they are nor do we have any idea how fast they will respond. This means we can’t be certain about the number of seconds we must wait for replies. This is the reason why so many UPnP control points (clients) are so slow when they scan for devices on the network.

In general all devices should be able to respond in less than 2 seconds. It seems that manufacturers would rather be on the safe side and sometimes wait up to 10 seconds for replies. A better approach would be to cache previously found devices and immediately check their availability upon startup. A full device search could then be done asynchronous in the background. Then again, many uPNP devices set the cache validaty timeout extremely low, so clients (if they properly implement the standard) are forced to rediscover them every time.

Anyway, here’s the output of the M-SEARCH on my home network. I’ve stripped some of the headers for brevity:

('192.168.0.1', 1900) HTTP/1.1 200 OK
USN: uuid:2b2561a3-a6c3-4506-a4ae-247efe0defec::upnp:rootdevice
SERVER: Linux/2.6.18_pro500 UPnP/1.0 MiniUPnPd/1.5
LOCATION: http://192.168.0.1:40833/rootDesc.xml

('192.168.0.2', 53375) HTTP/1.1 200 OK
LOCATION: http://192.168.0.2:1025/description.xml
SERVER: Linux/2.6.35-31-generic, UPnP/1.0, Free UPnP Entertainment Service/0.655
USN: uuid:60c251f1-51c6-46ae-93dd-0a3fb55a316d::upnp:rootdevice

Two devices responded to our M-SEARCH query within the specified number of seconds. One is a cable internet router, the other is Fuppes, a UPnP media server. The most interesting things in these replies are the LOCATION headers, which point us to an SCPD XML file: http://192.168.0.1:40833/rootDesc.xml.

SCPD, Phase I: Fetching and parsing the root SCPD file

The SCPD XML file (http://192.168.0.1:40833/rootDesc.xml) contains information on the UPnP server such as the manufacturer, the services offered by the device, etc. The XML file is rather big and complicated. You can see the full version, but here’s a grealy reduced one from my router:



  
    urn:schemas-upnp-org:device:InternetGatewayDevice:1
    Ubee EVW3226
    
      
        urn:schemas-upnp-org:service:Layer3Forwarding:1
        /ctl/L3F
        /evt/L3F
        /L3F.xml
      
    
    
      
        urn:schemas-upnp-org:device:WANDevice:1
        WANDevice
        
          
            urn:schemas-upnp-org:service:WANCommonInterfaceConfig:1
            urn:upnp-org:serviceId:WANCommonIFC1
            /ctl/CmnIfCfg
            /evt/CmnIfCfg
            /WANCfg.xml
          
        
        
          
            urn:schemas-upnp-org:device:WANConnectionDevice:1
            WANConnectionDevice
            
              
                urn:schemas-upnp-org:service:WANIPConnection:1
                /ctl/IPConn
                /evt/IPConn
                /WANIPCn.xml

It consists of basically three important things:

The URLBase
Virtual Devices
Services

URLBase

Not all SCPD XML files contain an URLBase (the one above from my router doesn’t), but if they do, it looks like this:

http://192.168.1.254:80

This is the base URL for the SOAP requests. If the SCPD XML does not contain an URLBase element, the LOCATION header from the server’s discovery response may be used as the base URL. Any paths should be stripped off, leaving only the protocol, IP and port. In the case of my internet router that would be: http://192.168.0.1:40833/

Devices

The XML file then specifies devices, which are virtual devices that the physical device contains. These devices can contain a list of services in the tag. A list of sub-devices can be found in the tag. The Devices in the deviceList can themselves contain a list of services and devices. Thus, devices can recursively contain sub-devices, as shown in the following diagram:

As you can see, a virtual Device can contain a Device List, which can contain a virtual Device, etc. We are most interested in the elements from the . They look like this:


  urn:schemas-upnp-org:service:WANCommonInterfaceConfig:1
  urn:upnp-org:serviceId:WANCommonIFC1
  /ctl/CmnIfCfg
  /evt/CmnIfCfg
  /WANCfg.xml

...

  urn:schemas-upnp-org:service:WANIPConnection:1
  /ctl/IPConn
  /evt/IPConn
  /WANIPCn.xml

The in combination with the gives us the URL to the SOAP server where we can send our requests. The URLBase in combination with the points us to a SCPD (Service Control Point Definition) XML file which contains a description of the SOAP calls.

The following Python code extracts the URLBase, ControlURL and SCPDURL information:

import urllib2
import urlparse
from xml.dom import minidom

def XMLGetNodeText(node):
    """
    Return text contents of an XML node.
    """
    text = []
    for childNode in node.childNodes:
        if childNode.nodeType == node.TEXT_NODE:
            text.append(childNode.data)
    return(''.join(text))

location = 'http://192.168.0.1:40833/rootDesc.xml'

# Fetch SCPD
response = urllib2.urlopen(location)
root_xml = minidom.parseString(response.read())
response.close()

# Construct BaseURL
base_url_elem = root_xml.getElementsByTagName('URLBase')
if base_url_elem:
    base_url = XMLGetNodeText(base_url_elem[0]).rstrip('/')
else:
    url = urlparse.urlparse(location)
    base_url = '%s://%s' % (url.scheme, url.netloc)

# Output Service info
for node in root_xml.getElementsByTagName('service'):
    service_type = XMLGetNodeText(node.getElementsByTagName('serviceType')[0])
    control_url = '%s%s' % (
        base_url,
        XMLGetNodeText(node.getElementsByTagName('controlURL')[0])
    )
    scpd_url = '%s%s' % (
        base_url,
        XMLGetNodeText(node.getElementsByTagName('SCPDURL')[0])
    )
    print '%s:\n  SCPD_URL: %s\n  CTRL_URL: %s\n' % (service_type,
                                                     scpd_url,
                                                     control_url)

Output:

urn:schemas-upnp-org:service:Layer3Forwarding:1:
  SCPD_URL: http://192.168.0.1:40833/L3F.xml
  CTRL_URL: http://192.168.0.1:40833/ctl/L3F

urn:schemas-upnp-org:service:WANCommonInterfaceConfig:1:
  SCPD_URL: http://192.168.0.1:40833/WANCfg.xml
  CTRL_URL: http://192.168.0.1:40833/ctl/CmnIfCfg

urn:schemas-upnp-org:service:WANIPConnection:1:
  SCPD_URL: http://192.168.0.1:40833/WANIPCn.xml
  CTRL_URL: http://192.168.0.1:40833/ctl/IPConn

SCPD, Phase II: Service SCPD files

Let’s look at the WANIPConnection service. We have an SCPD XML file for it at http://192.168.0.1:40833/WANIPCn.xml and a SOAP URL at http://192.168.0.1:40833/ctl/IPConn. We must find out which SOAP calls we can make, and which parameters they take. Normally SOAP would use a WSDL file to define its API. With UPnp however this information is contained in the SCPD XML file for the service. Here’s an example of the full version of the WANIPCn.xml file. There are two interesting things in the XML file:

The element contains a list of actions understood by the SOAP server.
The element contains metadata about the arguments we can send to SOAP actions, such as the type and allowed values.

ActionList

The tag contains a list of actions understood by the SOAP server. It looks like this:


  
    SetConnectionType
    
      
        NewConnectionType
        in
        ConnectionType
      
    
  
  
    [... etc ...]

In this example, we discover an action called SetConnectionType. It takes one incoming argument: NewConnectionType. The relatedStateVariable specifies which StateVariable this argument should adhere to.

serviceStateTable

Looking at the section later on in the XML file, we see:


  
    ConnectionType
    string
  
  
  [... etc ...]

From this we conclude that we need to send an argument with name “ConnectionType” and type “string” to the SetConnectionType SOAP call.

Another example is the GetExternalIPAddress action. It takes no incoming arguments, but does return a value with the name “NewExternalIPAddress“. The action will return the external IP address of your router. That is, the IP address you use to connect to the internet.


  GetExternalIPAddress
  
    
      NewExternalIPAddress
      out
      ExternalIPAddress

Let’s make a SOAP call to that action and find out what our external IP is.

SOAP: Calling an action

Normally we would use a SOAP library to create a call to a SOAP service. In this article I’m going to cheat a little and build a SOAP request from scratch.

import urllib2

soap_encoding = "http://schemas.xmlsoap.org/soap/encoding/"
soap_env = "http://schemas.xmlsoap.org/soap/envelope"
service_ns = "urn:schemas-upnp-org:service:WANIPConnection:1"
soap_body = """

  
    
    
   
""" % (soap_encoding, service_ns, soap_env)

soap_action = "urn:schemas-upnp-org:service:WANIPConnection:1#GetExternalIPAddress"
headers = {
    'SOAPAction': u'"%s"' % (soap_action),
    'Host': u'192.168.0.1:40833',
    'Content-Type': 'text/xml',
    'Content-Length': len(soap_body),
}

ctrl_url = "http://192.168.0.1:40833/ctl/IPConn"

request = urllib2.Request(ctrl_url, soap_body, headers)
response = urllib2.urlopen(request)

print response.read()

The SOAP server returns a response with our external IP in it. I’ve pretty-printed it for your convenience and removed some XML namespaces for brevity:



  
    
      212.100.28.66

We can now put the response through an XML parser and combine it with the SCPD XML’s and to figure out which output parameters we can expect and what type they are. Doing this is beyond the scope of this article, since it’s rather straight-forward yet takes a reasonable amount of code. Suffice to say that our extenal IP is 212.100.28.66.

Summary

To summarise, these are the steps we take to actually do something useful with a UPnP service:

Broadcast a HTTP-over-UDP (HTTPU) message to the network asking for UPnP devices to respond.
Listen for incoming UDP replies and extract the LOCATION header.
Send a WGET to fetch a SCPD XML file from the LOCATION.
Extract services and/or devices from the SCPD XML file.
1. For each service, extract the Control and SCDP urls.
2. Combine the BaseURL (or if it was not present in the SCPD XML, use the LOCATION header) with the Control and SCDP url’s.
Send a WGET to fetch the service’s SCPD XML file that describes the actions it supports.
Send a SOAP POST request to the service’s Control URL to call one of the actions that it supports.
Receive and parse reply.

An example with Requests on the left and Responses on the right. Like all other examples in this article, the XML has been heavily stripped of redundant or unimportant information:

Conclusion

I underwent this whole journey of UPnP because I wanted a way transparently support connections from externals networks to my locally-running application. While UPnP allows me to do that, I feel that UPnP is needlessly complicated. The standard, while readable, feels like it’s designed by committee. The indirectness of having to fetch multiple SCPD files, the use of non-standard protocols, the nestable virtual sub-devices… it all feels slightly unnecesarry. Then again, it could be a lot worse. One only needs to take a quick look at SAML v2 to see that UPnP isn’t all that bad.

All in all, it let me do what I needed, and it didn’t take too long to figure out how it worked. As a kind of exercise I partially implemented a high-level simple to use UPnP client for python, which is available on Github. Take a look at the source for more insights on how to deal with UPnP.

Ansible-cmdb v1.14: Generate a host overview of Ansible facts.

admin — Tue, 26 Apr 2016 13:30:58 +0000

I’ve just released ansible-cmdb v1.14. Ansible-cmdb takes the output of Ansible’s fact gathering and converts it into a static HTML overview page containing system configuration information. It supports multiple templates and extending information gathered by Ansible with custom data.

This release includes the following bugfixes and feature improvements:

Look for ansible.cfg and use hostfile setting.
html_fancy: Properly sort vcpu and ram columns.
html_fancy: Remember which columns the user has toggled.
html_fancy: display groups and hostvars even if no host information was collected.
html_fancy: support for facter and custom facts.
html_fancy: Hide sections if there are no items for it.
html_fancy: Improvements in the rendering of custom variables.
Apply Dynamic Inventory vars to all hostnames in the group.
Many minor bugfixes.

As always, packages are available for Debian, Ubuntu, Redhat, Centos and other systems. Get the new release from the Github releases page.

mdpreview, a Markdown previewer to be used with an external editor

admin — Fri, 04 Mar 2016 15:43:30 +0000

There are many Markdown previewers out there, from the simplest commandline tool + webbrowser to full-fledged Markdown IDE’s. I’ve tried quite a few, and I like none of them. I write my Markdown in an external editor (Vim), something very few Markdown previewers take in account. The ones that do are buggy. So I wrote mdpreview, a standalone Markdown previewer for Linux that works great with an external editor such as Vim. The main selling points:

Automatic reload when your Markdown file changes. Unlike many other previewers, it remembers your scroll position during reload and doesn’t put you back at the top.
Themes that closely resemble Github and Bitbucket, so you actually know what it’s going to look like when published. There are also some additional themes that are nice on the eyes (solarized).
An option to set Keep-on-top window hinting, so the previewer always stays on top of other windows.
Vi motion keys (j, k, G, g)
Append detection. If the end of the document is being viewed and new contents is appended, mdpreview automatically scrolls to the bottom.
mdpreview remembers your window size and position. A very basic feature you’d think most previewers would support, but don’t.

A feature to automatically scroll to the last made change in the Markdown file is currently being implemented.

Here’s mdpreview running the Solarized theme:

The Github theme:

And the BitBucket theme:

More information and installation instructions are available on the Github page.

Manually scrolling a Python GTK Webview

admin — Wed, 02 Mar 2016 21:06:11 +0000

I was trying to manually scroll a (Python) GTK embedded Webview in order to position the webview back to where it was after setting new contents with webview.load_html_string(html, 'file:///'). I couldn’t get it to work, and Google wasn’t of much help either.

I could scroll the Webview just fine from a key-press-event handler on the main window like this:

def __init__(self):
    # -- Removed some code here for brevity --
    self.scroll_window = gtk.ScrolledWindow(None, None)
    self.scroll_window.add(self.webview)
    self.win_main.connect("key-press-event", self.ev_key_press_event)

def ev_key_press_event(self, widget, ev):
    if ev.keyval == gtk.keysyms.t:
        self.scroll_window.get_vadjustment().set_value(100)

But automatically scrolling when something happened to the webview (in my case new content being set via webview.load_html_string()) didn’t work.

It turns out that the webview is still handling events and won’t allow scrolling using `scroll_window.get_vadjustment().set_value()` until all the events are handled.

You can manually handle all the pending GTK events before starting scrolling like this:

def __init__(self):
    # -- Removed some code here for brevity --
    self.scroll_window = gtk.ScrolledWindow(None, None)
    self.scroll_window.add(self.webview)
    webview.connect('notify::load-status', self.ev_load_status)

def ev_load_status(self, webview, load_status):
    if self.webview.get_load_status() == webkit.LOAD_FINISHED:
        while gtk.events_pending():
            gtk.main_iteration_do()

        self.scroll_window.get_vadjustment().set_value(100)

The solution above works for me on both initial load of a document and subsequent changing of the webkit contents using `webview.load_html_string()`.