URL library for Python

Posted on Fri 27 November 2015 in Code • Tagged with Python, URL, furlLeave a comment

Python has many batteries included, but a few things are still conspicuously missing.

One of them is a standardized and convenient approach to URL manipulation, akin to the URI class in Java. There are some functions in urllib, of course (or urllib.parse in Python 3), but much like their HTTP-related comrades, they prove rather verbose and somewhat clunky.

HTTP, however, is solved by the Requests package, so you may wonder if there is some analogous package for URL operations. The answer is affirmative, and the library in question is, quite whimsically, called furl.

URL in a wrap

The sole interesting part of the furl interface is the furl class. It represents a single URL, broken down to its constituents, with properties and methods for both reading them out and replacing with new values.

Thanks to this handy (and quite obvious) abstraction, common URL operations become quite simple and self-documenting:

from furl import furl

def to_absolute(url, base):
    """If given ``url`` is a relative path,
    make it relative to the ``base``.
    furled = furl(url)
    if not furled.scheme:
        return furl(base).join(url).url
    return url

def is_same_origin(*urls):
    """Check whether URLs are from the same origin (host:port)."""
    origins = set(url.netloc for url in map(furl, urls))
    return len(origins) <= 1

def get_facebook_username(profile_url):
    """Get Facebook user ID from their profile URL."""
    furled = furl(profile_url)
    if not (furled.host == 'facebook.com' or
        raise ValueError("not a Facebook URL: %s" % (profile_url,))
    return furled.path.segments[-1]

# etc.

This includes the extremely prevalent, yet very harmful pattern of bulding URLs through string interpolation:

url = '%s?%s' % (BASE_URL, urlencode(query_params))

Besides looking unpythonically ugly, it’s also inflexible and error-prone. If BASE_URL gains some innate query string params ('http://example.com/?a=b'), this method will start producing completely invalid urls (with two question marks, e.g. 'http://example.com/?a=b?foo=bar').

The equivalent in furl has none of these flaws:

url = furl(BASE_URL).add(query_params).url

The full package

To see the full power of furl, I recommend having a look at its API documentation. It’s quite clear and should be very easy to use.

let: binding for Knockout

Posted on Wed 18 November 2015 in Code • Tagged with Knockout, JavaScript, web frontend, data bindingLeave a comment

Knockout is a JavaScript “framework” that has always lurked in the shadows of other, more flashy ones. Even in the days of its relative novelty, the likes of Backbone or Ember had seemed to garner more widespread interest among frontend developers. Today, this hasn’t changed much, the difference being mainly that the spotlight is now taken by new actors (*cough* React *cough*).

But Knockout has a loyal following, and for good reasons. Possibly the most imporant one is why I’ve put the word “framework” in quotes. Knockout is, first and foremost, a data binding library: it doesn’t do much else besides tying DOM nodes to JavaScript objects.

This quality makes it both easy to learn and simple to integrate. In fact, it can very well live in just some small compartments of your web application, mingling easily with any server-side templating mechanism you might be using. It also interplays effortlessly with other JS libraries, and sticks very well to whatever duct tape you use to hold your frontend stack together.

Lastly, it’s also quite extensible. We can, for example, create our own bindings rather easily, extending the declarative language used to describe relationship between the data and UI.

In this post, I’m going to demonstrate this by implementing a very simple let: binding — a kind of “assignment” of an expression to a name.

From Knockout with: bluff

Out of box, Knockout proffers the with: binding, a quite similar mechanism. How it may be somewhat problematic is analogous to the widely discouraged with statement in JavaScript itself. Namely, it blends several namespaces together, making it harder to determine which object is being referred to. As a result, the code is more prone to errors.

On the other hand, freeing the developer from repeating long and complicated expressions is obviously valuable. Perhaps reducing them to nil is not the right approach, though, so how about we just shorten them to a more manageable length? Well, that’s exactly what the let: binding is meant to do:

<div data-bind="let: { addr: currentUser.personalInfo.address }">
  <p data-bind="text: addr.line1"></p>
  <!-- ko if: addr.line2 -->
    <p data-bind="text: addr.line2"></p>
  <!-- /ko -->
    <span data-bind="text: add.city"></span>,
    <span data-bind="text: add.region"></span>
  <p data-bind="text: add.country"></p>

Making it happen turns out to be pretty easy.

Binding contract

To define a Knockout binding, up to two things are needed. We have to specify what the library should do:

  • when the binding is first applied to a DOM node (the init method)
  • when any of the observed values changes (the update method)

Not every binding has to implement both methods. In our cases, only the init is necessary, because all we have to do is modify the binding context.

What’s that? Shortly speaking, a binding context, is an object holding all the data you can potentially bind to your DOM nodes. Think of it as a namespace, or local scope: whatever’s in there can be used directly inside data-bind attributes.

let: it go

Therefore, all that the let: binding has to do is to extend the context with a mapping passed to it as an argument. Well, almost all:

ko.bindingHandlers['let'] = {
    init: function(element, valueAccessor, allBindings, viewModel, bindingContext) {
        var innerContext = bindingContext.extend(valueAccessor);
        ko.applyBindingsToDescendants(innerContext, element);
        return { controlsDescendantBindings: true };

The resulting innerContext is a copy of the original bindingContext, augmented with additional properties from the argument of let: (those are available through valueAccessor). Once we have it, though, we need to handle it in a little special way.

Normally, Knockout is processing all bindings recursively, pasing down the same bindingContext (which ultimately comes from the root viewModel). But since we want to locally alter the context, we also need to interrupt this regular way and take care of the lower-level DOM nodes ourselves.

This is exactly what the overly-long ko.applyBindingsToDescendants function is doing. The only caveat is that Knockout has to be told explicitly about our intentions through the return value from init. Otherwise, it would try to apply the original bindingContext recursively, which in our case would amount to applying it twice. { controlsDescendantBindings: true } prevents Knockout from doing so erroneously.

Turn SQLAlchemy queries into literal SQL

Posted on Thu 12 November 2015 in Code • Tagged with SQLAlchemy, SQL, databasesLeave a comment

Like I mentioned in the post about adding regular expression support, the nice thing about SQLAlchemy is the clean, layered structure of various abstractions it utilizes.

Regretably, though, we know that software abstractions tend to leak. Inn particular, any ORM seem to be a poster child of this rule, especially in the popular opinion among developer crowds. By design, they’re intended to hide at least some details of the operations on a database, but those very details can be quite critical at times. There are situations when we simply want to know what exactly is going on, and how all those model classes and mappers and relationships translate to the actual SQL code.

To make it a little more concrete, let’s focus on the SQLAlchemy Query class. Given such a query, we’d like to get the final SQL representation of it, the one that’s ultimately sent to the database, It could be useful for any number of things, from logging1 to profiling, or just displaying in the web page’s footer, or even solely for prototyping in the Python REPL.

In other words, we want to turn this:

db_session.query(User.id).filter(User.email == some_email)

into something like this:

SELECT users.id FROM users WHERE users.email = :1

regardless of the complexity of the query, the number of model classes it spans, or the number of relationship-related JOINs it involves.

It’s a dialect

There is one aspect we cannot really universalize, though. It’s the specific database backend that SQLAlchemy should compile our query for. Shrugging off syntactical and semantical differences between database engines is one thing that using an ORM can potentially buy us, but if we want to get down to the SQL level, we need to be specific about it.

In SQLAlchemy’s parlance, any specific variant of the SQL language is called a dialect. Most of the time, you’ll be interested in the particular dialect your database of choice is using. This is easily obtainable from the database Session:

dialect = db_session.bind.dialect

The resulting Dialect object is little more than a container for small tidbits of information, used by SQLAlchemy to handle various quirks of the database backends it supports. For our purposes, though, it can be treated as a completely opaque token.

Compile & unwrap

With the Dialect in hand, we can invoke the query compiler to get the textual representatio of our Query. Or, to be more precise, the compiled version of the query’s Select statement

>>> query = db_session.query(User.id).filter(User.email == 'foo@example.com')
>>> print(query.statement.compile(dialect=db_session.bind.dialect))
SELECT user.id
FROM users
WHERE users.email = %(email_1)s

Perhaps unsurprisingly, even after compilation the result is still just another object: the Compiled one. As you can see, however, the actual SQL text is just its __str__ing representation, which we can print directly or obtain with str() or unicode().


But obviously, we don’t want to type the above incantations everytime we need to take a peek at the generated SQL for an ORM query. Probably the best solution is to extend the Query class so that it offers this additional functionality under a new method:

from sqlalchemy.orm.query import Query as _Query

class Query(_Query):
    """Custom, enhanced SQLALchemy Query class."""

    def to_sql(self):
        """Return a literal SQL representation of the query."""
        dialect = self.session.bind.dialect
        return str(self.statement.compile(dialect=dialect))

This new Query class needs then be passed as query_cls argument to the constructor of Session. Details may vary a little bit dependening on how exactly your application is set up, but in most cases, it should be easy enough to figure out.

  1. If you’re only interested in indiscriminate logging of all queries, setting the echo parameter in create_engine may be sufficient. Another alternative is to look directly at the logging configuration options for various parts of SQLAlchemy 

Celery task in a Flask request context

Posted on Tue 03 November 2015 in Code • Tagged with Celery, Flask, Python, queue, request contextLeave a comment

Celery is an asynchronous task worker that’s frequently used for background processing in Python web apps. Rather than performing a time-consuming task within the request loop, we delegate it to a queue so that a worker process can pick it up when ready. The immediate benefit is much better latency for the end user. Pros also include easier scalability, since you can adjust the number of workers to your task load.

Examples of tasks that are usually best done in the background vary from fetching data through a third-party API to sending emails, and from pushing mobile notifications to pruning the database of stale records. Anything that may take more than a couple hundred milliseconds to complete — and isn’t absolutely essential to the current HTTP request — is typically a good candidate for asynchronous processing.

In Celery, workers are just regular Python processes, often running the exact same code that’s powering the app’s web frontend1. But unlike most of that code, they aren’t servicing any HTTP requests — they simply run some function with given arguments, both specified by whomever sent a task for execution. Indeed, those functions don’t even know who or what asked for them to be executed.

Neat, right? It is what we usually compliment as decoupling, or separation of concerns. They are valuable qualities even regardless of the UI and scaling benefits mentioned earlier.

Not quite web

But those qualities come with a trade-off. Task code is no longer a web frontend code: it doesn’t run within the comfy environment of our web framework of choice. Losing it may be quite unnerving, actually, because in a typical web application there will be many things tied directly to the HTTP request pipeline. After all, this is what web applications do — respond to HTTP requests — so it often makes perfect sense e.g. to marry the request flow with database transactions, committing or rolling those back according to HTTP status code that the app produced.

Tasks may also require a database, though, if only to assert the expected state of the world. Similar goes for memcache, a Redis instance, or basically any resource used by the frontend code. Alas, it’s quite possible the very reason we delegate work to a task is to shift lengthy interactions with those external systems away from the UI. Obviously, we’re going to need them for that!

Fake a request

So one way or another, our tasks will most likely need some initialization and/or cleanup code. And since it’s probably the same code that most HTTP request handlers require and use already, why not just pretend we’re handling a request after all?

In Flask, we can pull that off rather easily. The test_request_context method is conveniently provided to allow for faking the request context — that is, an execution environment for HTTP handlers. Like the name suggest, it is used mostly for testing, but there is nothing stopping us from using it in tasks run by Celery.

We probably don’t want to call it directly, though. What would be better is to have Celery prepare the context first, and then run the task code as if it was an HTTP handler. For even better results, the context would preserve information extracted from the actual HTTP request, one that has sent the task for execution. Moving some work to the background would then be a trivial matter, for both the task and the original handler would operate within the same environment.

Convenient? I believe so. And as I’ll demonstrate next, it isn’t very complicated to implement either.

Wrap the decorator?

At least one piece of the solution should stand out as pretty obvious. Since our intention is to wrap the task’s code in some additional packaging — the request context — it seems fairly natural to write our own @task decorator:

import functools

from myapp import app, celery

def task(**kwargs):
    """Decorator function to apply to Celery tasks.
    Executes the actual task inside Flask's test request context.
    def decorator(func):
        """Actual decorator."""
        def wrapped(*args, **kwargs):
            with app.test_request_context():
                return func(*args, **kwargs)

        return wrapped

    return decorator

Here, app is the Flask application, and celery is the Celery object that’s often configured alongside it.

The Task class

While this technique will give us a request context inside the task code, it won’t be the request context from which the task has been sent for execution. To replicate that context correctly, we need additional support on the sender side.

Internally, Celery is converting every function we annotate with @task decorator into a subclass of Celery.app.task.Task. This process can be customized, for example by providing an explicit base= parameter to @task which specifies a custom Task subclass to use in place of the default one. Within that subclass, we’re free to override any functionality that we need to.

In our case, we’ll scrape the current request context from Flask for all relevant information, and later use it to recreate the context in the task code.

But before we get to that, let’s just create the Task subclass and move the above execution logic to it. This way, users won’t have to use a completely new @task decorator which would needlessly couple them to a specific Celery app instance:

from celery import Task

class RequestContextTask(Task):
    """Base class for tasks that run inside a Flask request context."""
    abstract = True

    def __call__(self, *args, **kwargs):
        with app.test_request_context():
            return super(RequestContextTask, self).__call__(*args, **kwargs)

Instead, they can either set this class as base= for a specific task:

def sync_user_data(user_id):
    # ...

or make it into new default for all their tasks:

celery = Celery(...)
celery.Task = RequestContextTask

Invocation patterns

When the frontend asks for a task to be executed, it most often uses the Task.delay method. It will package a payload contaning task arguments, and send it off through a broker — usually an AMQP-based queue, such as RabbitMQ — so that a Celery worker can pick it up and actually execute.

But there are other means of task invocation. We can even run it “in place”, locally and synchronously, which is especially useful for various testing scenarios. Lastly, a task can also be retried from within its own code, terminating its current run and scheduling another attempt for some future date.

Obviously, for the RequestContextTask to be useful, it needs to behave correctly in every situation. Therefore we need to cover all the entry points I’ve mentioned — the asynchronous call, a synchronous invocation, and a task retry:

class RequestContextTask(Task):
    # ...

    def apply_async(self, args=None, kwargs=None, **rest):
        return super(RequestContextTask, self) \
            .apply_async(args, kwargs, **rest)

    def apply(self, args=None, kwargs=None, **rest):
        return super(RequestContextTask, self) \
            .apply(args, kwargs, **rest)

    def retry(self, args=None, kwargs=None, **rest):
        return super(RequestContextTask, self) \
            .retry(args, kwargs, **rest)

Note that Task.apply_async is being called internally by Task.delay, so it’s only that first method that we have to override.

Context in a box

As you can deduce right away, the Flask-related magic is meant to go in the _include_context method. The idea is to prepare arguments for the eventual invocation of Flask.test_request_context, and pass them through an extra task parameter. Those arguments are relatively uncomplicated: they are just a medley of various pieces of information that we can easily obtain from the Flask’s request object:

from flask import has_request_context, request

class RequestContextTask(Task):
    CONTEXT_ARG_NAME = '_flask_request_context'

    # ...

    def _include_request_context(self, kwargs):
        """Includes all the information about current HTTP request context
        as an additional argument to the task.
        if not has_request_context():

        context = {
            'path': request.path,
            'base_url': request.url_root,
            'method': request.method,
            'headers': dict(request.headers),
        if '?' in request.url:
            context['query_string'] = request.url[(request.url.find('?') + 1):]

        kwargs[self.CONTEXT_ARG_NAME] = context

On the worker side, we simply unpack them and recreate the context:

from flask import make_response

class RequestContextTask(Task):
    # ...
    def __call__(self, *args, **kwargs):
        call = lambda: super(RequestContextTask, self).__call__(*args, **kwargs)

        context = kwargs.pop(self.CONTEXT_ARG_NAME, None)
        if context is None or has_request_context():
            return call()

        with app.test_request_context(**context):
            result = call()
            app.process_response(make_response(result or ''))

        return result

The only tricky part is calling Flask.process_response at the end. We need to do that for the @after_request hooks to execute correctly. This is quite crucial, because those hooks are where you’d normally put important cleanup code, like commit/rollback of the database transaction.

Complete solution

To see how all those code snippets fit together, see this gist. You may also want to have a look at the article on Celery integration in Flask docs for some tips on how to integrate it with your own project.

  1. This isn’t strictly necessary, as Celery supports sending tasks for execution by explicit name. For that request to reach the worker, however, the task broker configuration must be correctly shared between sender and the worker. 

Actually, Python enums are pretty OK

Posted on Sun 25 October 2015 in Code • Tagged with Python, enums, serializationLeave a comment

Little over two years ago, I was writing about my skepticism towards the addition of enum types to Python. My stance has changed somewhat since then, and I now think there is some non-trivial value that enums can add to Python code. It becomes especially apparent in the circumstances involving any kind of persistence, from pickling to databases.

I will elaborate on that through the rest of this post.


First, let’s recap (or perhaps introduce) the important facts about enums in Python. An enum, or an enumeration type, is a special class that inherits from enum.Enum1 and defines one or more constants:

from enum import Enum

class Cutlery(Enum):
    knife = 'knife'
    fork = 'fork'
    spoon = 'spoon'

Their values are then converted into singleton objects with distinct identity. They available as attributes of the enumeration type:

>>> Cutlery.fork
<Cutlery.fork: 'fork'>

As a shorthand for when the values themselves are not important, the Enum class can be directly invoked with constant names as strings:

Cutlery = Enum('Cutlery', ['knife', 'fork', 'spoon'])

Resulting class offers some useful API that we’d expect from an enumeration type in any programming language:

>>> list(Cutlery)
[<Cutlery.knife: 'knife'>, <Cutlery.fork: 'fork'>, <Cutlery.spoon: 'spoon'>]
>>> [c.value for c in Cutlery]
['knife', 'fork', 'spoon']
>> Cutlery.knife in Cutlery
>>> Cutlery('spoon')
<Cutlery.spoon: 'spoon'>
>>> Cutlery('spork')
Traceback (most recent call last):
# (...)
ValueError: 'spork' is not a valid Cutlery

How it was done previously

Of course, there is nothing really groundbreaking about assigning some values to “constants”. It’s been done like that since times immemorial, and quite often those constants have been grouped within finite sets.

Here’s an example of a comment model in some imaginary ORM, complete with a status “enumeration”:

class Comment(Model):
    APPROVED = 'approved'
    REJECTED = 'rejected'
    IN_REVIEW = 'in_review'

    author = StringProperty()
    text = TextProperty()
    state = String(choices=STATES)

# usage
comment = Comment(author="Xion", text="Boo")
comment.state = Comment.IN_REVIEW

Converting it to use an Enum is relatively straightforward:

class Comment(Model):
    class State(Enum):
        approved = 'approved'
        rejected = 'rejected'
        in_review = 'in_review'

    author = StringProperty()
    text = TextProperty()
    state = StringProperty(choices=[s.value for s in State])

comment = Comment(author="Xion", text="Meh.")
comment.state = Comment.State.approved.value

It is not apparent, though, if there is any immediate benefit of doing so. True, we no longer need to define the STATES set explicitly, but saving on this bit of boilerplate is balanced out by having to access the enum’s value when assigning it to a string property.

All in all, it seems like a wash — though at least we’ve established that enums are no worse than their alternatives :)

Enums are interoperable

Obviously, this doesn’t exactly sound like high praise. Thing is, we haven’t really replaced the previous solution completely. Remnants of it still linger in the way the state property is declared. Even though it’s supposed to hold enum constants, it is defined as string, which at this point is more of an implementation detail of how those constants are serialized.

What were really need here is a kind of EnumProperty that’d allow us to work solely with enum objects. Before the introduction of a standard Enum base, however, there was little incentive for ORMs and other similiar libraries to provide such a functionality. But now, it makes much more sense to support enums as first-class citizens, at least for data exchange and serialization, because users can be expected to already prefer the standard Enum in their own code.

Thus, the previous example changes into something along these lines:

class Comment(Model):
    # ...
    state = EnumProperty(State)

comment = Comment(author="Xion", text="Yay!")
comment.state = Comment.State.approved  # no .value!

Details of EnumProperty, or some equivalent construct, are of course specific to any given data management library. In SQLAlchemy, for example, a custom column type can handle the necessary serialization and deserialization between Python code and SQL queries, allowing you to define your models like this2:

class Comment(Model):
    class State(Enum):
        # ...

    author = Column(String(255))
    text = Column(Text)
    state = Column(EnumType(State, impl=String(32)))

# usage like above

In any case, the point is to have Python code operate solely on enum objects, while the persistence layer takes care of converting between them and their serializable values.

Enums are extensible

The other advantage Enums have over loose collections of constants is that they are proper types. Like all user-defined types in Python, they can have additional methods and properties defined on their instances, or even on the type itself.

Although this capability is (by default) not as powerful as e.g. in Java — where each enum constant can override a method in its own way — it can nevertheless be quite convenient at times. Typical use cases include constant classification:

class Direction(Enum):
    left = 37
    up = 38
    right = 39
    down = 40

    def is_horizontal(self):
        return self in (Direction.left, Direction.right)

    def is_vertical(self):
        return self in (Direction.down, Direction.up)

and conversion:

    def as_vector(self):
        return {
            Direction.left: (-1, 0),
            Direction.up: (0, -1),
            Direction.right: (1, 0),
            Direction.down: (0, 1),

For the latter, it would be handy to have the Java’s ability to attach additional data to an enum constant. As it turns out, Python supports this feature natively in a very similar way. We simply have to override enum’s __new__ method to parse out any extra values from the initializer and turn them into attributes of the enum instance:

class Direction(Enum):
    left = 37, (-1, 0)
    up = 38, (0, -1)
    right = 39, (1, 0)
    down = 40, (0, 1)

    def __new__(cls, keycode, vector):
        obj = object.__new__(cls)
        obj._value_ = keycode
        obj.vector = vector
        return obj

It’s possible, in fact, to insert any arbitrary computation here that yields the final _value_ of an enum constant3. This trick can be used to, for example, construct enums that automatically number themselves.

Finally, we can add static methods, class methods, or class properties to the Enum subclass, just like we would do with any other class:

class MyEnum(Enum):
    # ...

    def __values__(cls):
        return [m.value for m in cls]

Enums just are

All these things are possible primarly because of the most notable aspect of Python enums: their existence as an explicit concept. A syntactically unorganized bunch of constants cannot offer half of the highlighted features because there is nowhere to pin them on.

For that reason alone, using enums as an explicit — rather than implicit — pattern seems worthwhile. The one benefit we’re certain to reap is better code structure through separation of important concepts.

  1. The enum module is part of the Python standard library since version 3.4 but a fully functional backport is available for Python 2.x as well. 

  2. It is even possible to instruct SQLAlchemy how to map Python enums to ENUM types in database engines that support it, but details are outside of the scope of this article. 

  3. If you’re fine with the enum’s value being the whole tuple (everything after the = sign), you can override __init__ instead of __new__ (cf. the planet example from standard docs). 

CSS class helper for Jinja

Posted on Thu 22 October 2015 in Code • Tagged with Jinja, CSS, Python, FlaskLeave a comment

One of the great uses for a templating engine such as Jinja is reducing the clutter in our HTML source files. Even with the steady advances in the CSS1 standards, various div.wrapper, div.container, div.content, and other presentational elements are still a common fact of life. Unless you’re one of the cool kids who use Web Components with Polymer, your main option for abstracting this boilerplate away is defining some more general template macros.

As with any kind of abstraction, it’s crucial to balance broad applicability with a potential for finely-tuned customization. In the case of wrapping HTML snippets in Jinja macros, an easy way to maintain flexibility is to allow the caller to specify some crucial attributes of the root element:

  Renders the markup for a <button> from Twitter Bootstrap.
{% macro button(text=none, style='default', type='button', class='') %}
  <button type="{{ type }}"
        class="btn btn-{{ style }}{% if class %} {{ class }}{% endif %}"
        {{ kwargs|xmlattr }}>
    {{ text }}
{% endmacro %}

An element id would be a good example, and in the above macro it’s handled implicility thanks to the {{ kwargs|xmlattr }} stanza.

class, however, is not that simple, because a macro like that usually needs to supply some CSS classes of its own. The operation of combining them with additional ones, passed by the caller, is awfully textual and error-prone. It’s easy, for example, to forget about the crucial space and run two class names together.

As if CSS wasn’t difficult enough to debug already!

Let them have list

The root cause for any of those problems is working at a level that’s too low for the task. The value for a class attribute may be encoded as string, but it’s fundamentally a list of tokens. In the modern DOM API, for example, it is represented as exactly that: a DOMTokenList.

I’ve found it helpful to replicate a similar mechanism in Jinja templates. The result is a ClassList wrapper whose code I quote here in full:

from collections import Iterable, MutableSet

class ClassList(MutableSet):
    """Data structure for holding, and ultimately returning as a single string,
    a set of identifiers that should be managed like CSS classes.
    def __init__(self, arg=None):
        :param arg: A single class name or an iterable thereof.
        if isinstance(arg, basestring):
            classes = arg.split()
        elif isinstance(arg, Iterable):
            classes = arg
        elif arg is not None:
            raise TypeError(
                "expected a string or string iterable, got %r" % type(arg))

        self.classes = set(filter(None, classes))

    def __contains__(self, class_):
        return class_ in self.classes

    def __iter__(self):
        return iter(self.classes)

    def __len__(self):
        return len(self.classes)

    def add(self, *classes):
        for class_ in classes:

    def discard(self, *classes):
        for class_ in classes:

    def __str__(self):
        return " ".join(sorted(self.classes))

    def __html__(self):
        return 'class="%s"' % self if self else ""

To make it work with Flask, adorn the class with Flask.template_global decorator:

from myflaskapp import app

class ClassList(MutableSet):
    # ...

Otherwise, if you’re working with a raw Jinja Environment, simply add ClassList to its global namespace directly:

jinja_env.globals['classlist'] = ClassList

In either case, I recommend following the apparent Jinja convention of naming template symbols as lowercasewithoutspaces.

Becoming classy

Usage of this new classlist helper is relatively straightforward. Since it accepts both a space-separated string or an iterable of CSS class names, a macro can wrap anything the caller would pass it as a value for class attribute:

{% macro button(text=none, style='default', type='button', class='') %}
  {% set classes = classlist(class) %}
  {# ... #}

The classlist is capable of producing the complete class attribute syntax (i.e. class="..."), or omit it entirely if it would be empty. All we need to do is evaluate it using the normal Jinja expression syntax:

  <button type="{{ type }}" {{ classes }} {{ kwargs|xmlattr }}>

Before that, though, we may want to include some additional classes that are specific to our macro. The button macro, for example, needs to add the Bootstrap-specific btn and btn-$STYLE classes to the <button> element it produces2:

  {% do classes.add('btn', 'btn-' + style) %}

After executing this statement, the final class attribute contains both the caller-provided classes, as well those two that were added explicitly.

  1. Believe it or not, we can finally center the content vertically without much of an issue! 

  2. The {% do %} block in Jinja allows to execute statements (such as function calls) without evaluating values they return. It is not exposed by default, but adding a standard jinja2.ext.do extension to the Environment makes it available. 

Regular expression filters in SQLAlchemy

Posted on Wed 14 October 2015 in Code • Tagged with regular expressions, SQLAlchemy, SQL, Postgres, SQLite, ASTLeave a comment

The nice part about SQLAlchemy — an ORM library/framework for Python — is that it offers much more than just mapping of SQL relations to model classes. At its lower level, for example, there exists an additional layer between those classes and raw SQL queries: an abstraction for the SQL language itself.

Although it may sound byzantine at first, this extra layer serves two important purposes:

  • portability: compiling of the same query to potentially different SQL syntaxes, as required by different database backends
  • extensibility: allowing the user to introduce new elements: operators, functions, or even whole statements

I’m going to take advantage of both of these qualities here, and show how you can implement support for regular expression operators in query filters. On supported database backends, it will allow for powerful pattern matching on string columns.

Database side

We will use the Postgres syntax of POSIX regular expression matchers as a reference. It includes four operators: for case sensitive or insensitive matching, in regular or negated versions.

Since it’s a common practice to use an in-memory SQLite for running database-involved “unit” tests1, we will also add support for that backend. Regular expressions are not implemented there directly, but thanks to sqlite3s custom function ability, it’ll be easy enough to provide the necessary support in Python.

Desired API

In essence, what we want to do is to enhance the String column type with additional comparators, which will then be usable as arguments for the Query.filter method.

As an example, consider the following model class with a string column:

class Person(Model):
    nick = Column(String(255))
    # ...

Once we’re done, it should be possible to query for e.g. all the people whose nicks contain numbers in a pretty straightforward fashion:

numerics = session.query(Person).filter(Person.nick.regexp(r'\d+')).all()

Because of the limitations of underlying databases backends, only literal regular expressions would be supported. It means that, from the SQL standpoint, they have to be constants: you cannot use the value of one column as part of a regular expression that another column is matched against.

But of course, you can still construct those literals in Python code:

def get_copycats(nick):
    """Return all the people who tried to parrot given nick
    but failed and had to append some numbers to it.
    nick_regexp = '^' + re.escape(nick) + r'\d+$'
    return session.query(Person).filter(Person.nick.regexp(nick_regexp)).all()

Considering that regular expressions themselves are already pretty powerful, this really ought to be sufficient for all reasonable purposes.

How comparators work?

So how do we add the regexp method to the String type? It may seem logical to simply extend the class and add it directly:

from sqlalchemy import String as _String

class String(_String):
    """Enhanced version of the standard string type."""

    def regexp(self, value):
        # hmm...

But this won’t actually work. In SQLAlchemy, objects like String or Integer do not represent columns of certain type; they correspond to types themselves2. Extending them with additional methods won’t do much good, because regular code rarely operates on column types.

On the other hand, though, we obviously don’t want to mess with the Column class itself. Our additions are only meaningful for string columns, but putting them there would expose them to all columns, regardless of type!

Thankfully, there is an intermediate mechanism which SQLAlchemy introduced precisely to address the need we’ve got here. Every column type can define a comparator_factory: a kind of mixin class whose methods are incorporated into columns of that type. By overriding this inner class, we can both modify the behavior of existing operators, as well as introduce completely new ones.

So in order to add regexp and other methods to all String columns, our new string type must define its own comparator_factory:

class String(_String):
    class comparator_factory(_String.comparator_factory):
        def regexp(self, other):
            # ...

We need to remember about deriving it from the original one, too. Otherwise, all the standard operators you’d want to use in queries (==, +, etc.) would cease to work, because the new comparator_factory wouldn’t include an implementation of any of the necessary magic methods (__eq__, __add__, etc.).

SQL, abstracted

Knowing where to put our new comparator methods is certainly desirable, but the more interesting question is how do we implement them?

Like I’ve mentioned in the beginning, SQLAlchemy employs an additional layer between ORM models and raw, textual SQL language. Basically, it’s an AST for backend-independent queries which includes almost all of the various SQL constructs, codified in a platform-agnostic way inside the sqlalchemy.sql package.

You may have used this layer directly if you ever wrote queries based on Table objects and ran them with Session.execute. But even those constructed using the more familiar Query class interface end up in this intermediate representation. Often there is little to no additional processing involved.

Arguments to the Query.filter method, for example, are already given as SQL clause objects. It just so happens that its creation is hidden behind a very neat, operator-based API.

Thus, if our regular expression filters are to cooperate, they also need to return pieces of the SQL syntax tree. Not very complicated pieces, mind you, since we only need to represent simple expressions: something like foo ~ 'bar', where ~ may optionally be replaced by one of the other three operators.

Creating the node

They are all binary operators, by the way (i.e. taking exactly two arguments), so it makes sense that the corresponding AST node class is called BinaryExpression. The node’s children are the left argument, the right argument, and the operator itself.

With a little help from a few more SQL syntax wrappers, the implementation of regexp and the other methods turns out to be quite straightforward:

from sqlalchemy.sql.expression import BinaryExpression, literal
from sqlalchemy.sql.operators import custom_op

# (inside String.comparator_factory)

def regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~'))

def iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~*'))

def not_regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~'))

def not_iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~*'))

The use of literal function is dictated by the limitation that was mentioned earlier: any regular expression given in the query must be a SQL literal. If we now try to pass a column-like clause, we’ll get an exception right at the query definition, rather than a database error when we try to execute it.

The custom_op function, on the other hand, is simply the easiest way to create an “operator node” that’s required as a child of BinaryExpression. Since it’s a custom operator, it won’t be interpreted by SQLAlchemy in any way; it will simply be used verbatim in the final SQL string that’s sent to the database.


You may have noticed that this would pose a problem if said database doesn’t support ~ or the other operators, which happens to be the case for everything besides Postgres. Because we originally intended to support SQLite in addition to Postgres, this is evidently a problem.

It’s also where the portability of an intermediate SQL representation comes into play. Since our new operators return AST nodes and not just textual SQL, we can redefine the way those nodes are compiled into actual query fragments on various database backends.

To accomplish that, first we need to distinguish our regex filters from other BinaryExpressions:

class RegexMatchExpression(BinaryExpression):
    """Represents matching of a column againsts a regular expression."""

# (inside String.comparator_factory)

def regexp(self, other):
    return RegexMatchExpression(self.expr, literal(other), custom_op('~'))
# etc.

Once we’ve introduced such a distinction, it becomes possible to provide a different way for those filters to be turned into SQL. We can namely define a new compilation routine for them, and mark it as canonical for a specific SQL dialect:

from sqlalchemy.ext.compiler import compiles

@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    """Compile the SQL expression representing a regular expression match
    for the SQLite engine.
    # ...

The function receives an AST element to process (here, the RegexMatchExpression), along with a special compiler object that controls the whole translation process. Armed with those tools, we are allowed to modify the process in arbitrary ways and output just the right SQL statement that’ll do the job in SQLite.

Regex support lite

How does such a statement look like, though? As I’ve remarked early on, SQLite is very easy to extend with your own functions, and the sqlite3 driver used by SQLAlchemy enables us to write those functions directly in Python. Obviously, this is fantastic news when you have something like the standard re module at your disposal.

Indeed, coding the four required functions is quite trivial:

import re

# Mapping from the regular expression matching operators
# to named Python functions that implement them for SQLite.
    '~': ('REGEXP',
          lambda value, regex: bool(re.match(regex, value))),
    '~*': ('IREGEXP',
           lambda value, regex: bool(re.match(regex, value, re.IGNORECASE))),
    '!~': ('NOT_REGEXP',
           lambda value, regex: not re.match(regex, value)),
    '!~*': ('NOT_IREGEXP',
            lambda value, regex: not re.match(regex, value, re.IGNORECASE)),

What’s less apparent is how — or rather, when — to instruct the SQLite database to use them. As per the API we have to use, custom SQLite functions are created on a per-connection basis. But SQLAlchemy, like any good database interface, takes care of connection management, lifecycle, and pooling. Nowhere in our application there is a connect call (nor there should be!) that we could just follow with a few create_function invocations.

Yet, there is a way of doing exactly what we require, and it involves utilizing the event subsystem included in SQLAlchemy. Anytime something interesting happens to any of its ORM or core objects — models, sessions, connection pools, etc. — SQLAlchemy publishes an event that our application code can listen (subscribe) to. It’s a classic PubSub system that introduces some serious potential for extensibility.

Our use of it will be pretty modest, though. All we’re interested in is the establishing of a connection to the SQLite database. This translates directly to a 'connect' event of the Engine object:

from sqlalchemy import event
from sqlalchemy.engine import Engine
import sqlite3

@event.listens_for(Engine, 'connect')
def sqlite_engine_connect(dbapi_connection, connection_record):
    """Listener for the event of establishing connection to a SQLite database.

    Creates the functions handling regular expression operators
    within SQLite engine, pointing them to their Python implementations above.
    if not isinstance(dbapi_connection, sqlite3.Connection):

    for name, function in SQLITE_REGEX_FUNCTIONS.values():
        dbapi_connection.create_function(name, 2, function)

Note that this will catch all the connection events, so we have to verify it’s really SQLite we’re talking to. Afterwards, the creation of REGEXP, IREGEXP, etc. functions is extremely straightforward.

Compilation, take two

This was quite a build-up, but now we’re very close to the finale. What remains is finishing the compilation routine:

@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):

We know that element corresponds to an expression in the form of a ~ 'foo'. For SQLite, however, the compatible version is a function call: REGEXP(a, 'foo'). At first this may appear rather disconcerting, because it’s basically a completely different AST node to build.

But it’s actually not a problem at all. Inside compiler hooks, we are allowed to use the much of the same API that’s available when drafting regular queries. This includes the func factory object which produces calls to arbitrary SQL functions. Rather than compiling the original binary expression, we’ll simply poach its operands and use them as arguments to one of the new functions:

from sqlalchemy import exc
from sqlalchemy.sql.expression import func

@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    # determine the name of a custom SQLite function to use for the operator
    operator = element.operator.opstring
        func_name, _ = SQLITE_REGEX_FUNCTIONS[operator]
    except (KeyError, ValueError), e:
        would_be_sql_string = ' '.join((compiler.process(element.left),
        raise exc.StatementError(
            "unknown regular expression match operator: %s" % operator,
            would_be_sql_string, None, e)

    # compile the expression as an invocation of the custom function
    regex_func = getattr(func, func_name)
    regex_func_call = regex_func(element.left, element.right)
    return compiler.process(regex_func_call)

Notice how compiler.process is used to obtain the final result. The compiler object doesn’t care that we use it on a totally different AST node; it will dutifully carry out its compilation into raw SQL. We can even use this capability to add a little bit of paranoid error handling: if we encounter an unknown operator, the resulting error message will include fully compiled, SQL versions of both arguments.


This concludes our efforts: the original query examples with Person.nick.regexp should now work in both Postgres and SQLite. For you convenience, I’ve bundled all the code in this gist.

If you feel like tinkering with it further, I would suggest you try to remove the superfluous NOT_* functions. They make little sense given that SQL has a perfectly adequate NOT keyword. A clean solution would probably prefer an additional reverse flag in RegexMatchExpression over looking for a '!' character in the operator string.

  1. It may or may not be a good practice, though. 

  2. Although it’s possible to write Column(Integer), it’s merely a syntactical convenience. SQLAlchemy interprets it readily as Column(Integer()). Parameterized types — like String — always require explicit instantiation. 

Don’t Copy & Paste. Retype.

Posted on Sat 03 October 2015 in Programming • Tagged with problem solving, Stack Overflow, typingLeave a comment

In this day and age, Google and Stack Overflow are quite essential tools for any developer. Lately, though, the latter seems to be getting some bad rap. On one side, it’s because of seemingly peculiar and sometimes alienating moderation policies. But more pertinently, it’s the apparent rise of a phenomenon that’s wittily dubbed “the full Stack Overflow developer“.

In a nutshell, individuals deserving to be called that are code slingers who throw software artifacts together mostly by copying and pasting code samples found in Stack Overflow answers. They may be getting something working pretty quickly, but they also lack understanding of problems they’re facing and solutions they’re using so cheerily.

Of course, not every instance of code Copy Pasta is to be scorned. I’m pretty sure most people reading this post (and certainly the person writing it!) are guilty of replicating at least a few snippets from Stack Overflow, verbatim, in their own codebase. Heck, we may have even done so with nigh zero interest as to why it has been written this way. Not every technology is intrinsically fascinating, after all, and deadlines are sometimes too close for comfort.

But if so, does it mean we are gradually turning into full Stack Overflow developers?… Yikes! Surely we don’t want that to happen!

Mitigation tactic

Before you shut off your Internet connection altogether while coding, consider employing the following technique whenever you feel like scraping a piece of code from Stack Overflow, and dumping it in your project source.

Don’t use the clipboard. Don’t copy and paste. Retype the code you’ve found instead.

It’s going to take more time, yes. It’s definitely more cumbersome than just hitting Ctrl+C/Ctrl+V. It may also make little sense: if the end result is the same, why does it matter whether the code was transfered through the clipboard or not?


I’d argue, however, that it makes perfect sense. From the least to the most important, the reasons why I think so are the following:

  • The fact that retyping is slower than copy-pasting is what actually makes it better. If you vow not to use the clipboard, you’re much less likely to just pick whatever’s the first Stack Overflow result Google has given. You’ll weigh different solutions, and you’ll be rightfully biased towards shorter and simpler ones.

  • When you type something, you cannot do it completely thoughtlessly. Whether you want it or not, you’ll absorb some of the knowledge through sheer osmosis, because the code will flow through your eyes and fingers as it’s transfered from the browser to your editor or IDE. Your subconscious brain will latch onto the bits and pieces of information, and it will sort them out for you to use later. Even if you didn’t intend to, you will most likely learn something.

  • But most importantly, what you type almost certainly won’t be a perfect copy of the original snippet. As you progress through the code, you’ll inevitably deviate from it, if only to conform to a particular style guide your project is following.
    It’s quite likely, though, that you’ll make larger changes as well. You will replace familiar patterns with calls to utility functions. You’ll rearrange the code visually for better readability. You will add comments, or extract functions to make it more self-documenting. You might even enhance and customize it, so that you can abstract and reuse it multiple times.

Afterwards, what you’ve just typed won’t be just some code you have found on the Internet. It’ll be your code.