Actually, Python enums are pretty OK

Posted on Sun 25 October 2015 in Code • Tagged with Python, enums, serializationLeave a comment

Little over two years ago, I was writing about my skepticism towards the addition of enum types to Python. My stance has changed somewhat since then, and I now think there is some non-trivial value that enums can add to Python code. It becomes especially apparent in the circumstances involving any kind of persistence, from pickling to databases.

I will elaborate on that through the rest of this post.

Revision

First, let’s recap (or perhaps introduce) the important facts about enums in Python. An enum, or an enumeration type, is a special class that inherits from enum.Enum1 and defines one or more constants:

from enum import Enum

class Cutlery(Enum):
    knife = 'knife'
    fork = 'fork'
    spoon = 'spoon'

Their values are then converted into singleton objects with distinct identity. They available as attributes of the enumeration type:

>>> Cutlery.fork
<Cutlery.fork: 'fork'>

As a shorthand for when the values themselves are not important, the Enum class can be directly invoked with constant names as strings:

Cutlery = Enum('Cutlery', ['knife', 'fork', 'spoon'])

Resulting class offers some useful API that we’d expect from an enumeration type in any programming language:

>>> list(Cutlery)
[<Cutlery.knife: 'knife'>, <Cutlery.fork: 'fork'>, <Cutlery.spoon: 'spoon'>]
>>> [c.value for c in Cutlery]
['knife', 'fork', 'spoon']
>> Cutlery.knife in Cutlery
True
>>> Cutlery('spoon')
<Cutlery.spoon: 'spoon'>
>>> Cutlery('spork')
Traceback (most recent call last):
# (...)
ValueError: 'spork' is not a valid Cutlery

How it was done previously

Of course, there is nothing really groundbreaking about assigning some values to “constants”. It’s been done like that since times immemorial, and quite often those constants have been grouped within finite sets.

Here’s an example of a comment model in some imaginary ORM, complete with a status “enumeration”:

class Comment(Model):
    APPROVED = 'approved'
    REJECTED = 'rejected'
    IN_REVIEW = 'in_review'
    STATES = frozenset([APPROVED, REJECTED, IN_REVIEW])

    author = StringProperty()
    text = TextProperty()
    state = String(choices=STATES)

# usage
comment = Comment(author="Xion", text="Boo")
comment.state = Comment.IN_REVIEW
comment.save()

Converting it to use an Enum is relatively straightforward:

class Comment(Model):
    class State(Enum):
        approved = 'approved'
        rejected = 'rejected'
        in_review = 'in_review'

    author = StringProperty()
    text = TextProperty()
    state = StringProperty(choices=[s.value for s in State])

comment = Comment(author="Xion", text="Meh.")
comment.state = Comment.State.approved.value
comment.save()

It is not apparent, though, if there is any immediate benefit of doing so. True, we no longer need to define the STATES set explicitly, but saving on this bit of boilerplate is balanced out by having to access the enum’s value when assigning it to a string property.

All in all, it seems like a wash — though at least we’ve established that enums are no worse than their alternatives :)

Enums are interoperable

Obviously, this doesn’t exactly sound like high praise. Thing is, we haven’t really replaced the previous solution completely. Remnants of it still linger in the way the state property is declared. Even though it’s supposed to hold enum constants, it is defined as string, which at this point is more of an implementation detail of how those constants are serialized.

What were really need here is a kind of EnumProperty that’d allow us to work solely with enum objects. Before the introduction of a standard Enum base, however, there was little incentive for ORMs and other similiar libraries to provide such a functionality. But now, it makes much more sense to support enums as first-class citizens, at least for data exchange and serialization, because users can be expected to already prefer the standard Enum in their own code.

Thus, the previous example changes into something along these lines:

class Comment(Model):
    # ...
    state = EnumProperty(State)

comment = Comment(author="Xion", text="Yay!")
comment.state = Comment.State.approved  # no .value!
comment.save()

Details of EnumProperty, or some equivalent construct, are of course specific to any given data management library. In SQLAlchemy, for example, a custom column type can handle the necessary serialization and deserialization between Python code and SQL queries, allowing you to define your models like this2:

class Comment(Model):
    class State(Enum):
        # ...

    author = Column(String(255))
    text = Column(Text)
    state = Column(EnumType(State, impl=String(32)))

# usage like above

In any case, the point is to have Python code operate solely on enum objects, while the persistence layer takes care of converting between them and their serializable values.

Enums are extensible

The other advantage Enums have over loose collections of constants is that they are proper types. Like all user-defined types in Python, they can have additional methods and properties defined on their instances, or even on the type itself.

Although this capability is (by default) not as powerful as e.g. in Java — where each enum constant can override a method in its own way — it can nevertheless be quite convenient at times. Typical use cases include constant classification:

class Direction(Enum):
    left = 37
    up = 38
    right = 39
    down = 40

    @property
    def is_horizontal(self):
        return self in (Direction.left, Direction.right)

    @property
    def is_vertical(self):
        return self in (Direction.down, Direction.up)

and conversion:

    def as_vector(self):
        return {
            Direction.left: (-1, 0),
            Direction.up: (0, -1),
            Direction.right: (1, 0),
            Direction.down: (0, 1),
        }.get(self)

For the latter, it would be handy to have the Java’s ability to attach additional data to an enum constant. As it turns out, Python supports this feature natively in a very similar way. We simply have to override enum’s __new__ method to parse out any extra values from the initializer and turn them into attributes of the enum instance:

class Direction(Enum):
    left = 37, (-1, 0)
    up = 38, (0, -1)
    right = 39, (1, 0)
    down = 40, (0, 1)

    def __new__(cls, keycode, vector):
        obj = object.__new__(cls)
        obj._value_ = keycode
        obj.vector = vector
        return obj

It’s possible, in fact, to insert any arbitrary computation here that yields the final _value_ of an enum constant3. This trick can be used to, for example, construct enums that automatically number themselves.

Finally, we can add static methods, class methods, or class properties to the Enum subclass, just like we would do with any other class:

class MyEnum(Enum):
    # ...

    @classproperty
    def __values__(cls):
        return [m.value for m in cls]

Enums just are

All these things are possible primarly because of the most notable aspect of Python enums: their existence as an explicit concept. A syntactically unorganized bunch of constants cannot offer half of the highlighted features because there is nowhere to pin them on.

For that reason alone, using enums as an explicit — rather than implicit — pattern seems worthwhile. The one benefit we’re certain to reap is better code structure through separation of important concepts.


  1. The enum module is part of the Python standard library since version 3.4 but a fully functional backport is available for Python 2.x as well. 

  2. It is even possible to instruct SQLAlchemy how to map Python enums to ENUM types in database engines that support it, but details are outside of the scope of this article. 

  3. If you’re fine with the enum’s value being the whole tuple (everything after the = sign), you can override __init__ instead of __new__ (cf. the planet example from standard docs). 

Continue reading

CSS class helper for Jinja

Posted on Thu 22 October 2015 in Code • Tagged with Jinja, CSS, Python, FlaskLeave a comment

One of the great uses for a templating engine such as Jinja is reducing the clutter in our HTML source files. Even with the steady advances in the CSS1 standards, various div.wrapper, div.container, div.content, and other presentational elements are still a common fact of life. Unless you’re one of the cool kids who use Web Components with Polymer, your main option for abstracting this boilerplate away is defining some more general template macros.

As with any kind of abstraction, it’s crucial to balance broad applicability with a potential for finely-tuned customization. In the case of wrapping HTML snippets in Jinja macros, an easy way to maintain flexibility is to allow the caller to specify some crucial attributes of the root element:

{#
  Renders the markup for a <button> from Twitter Bootstrap.
#}
{% macro button(text=none, style='default', type='button', class='') %}
  <button type="{{ type }}"
        class="btn btn-{{ style }}{% if class %} {{ class }}{% endif %}"
        {{ kwargs|xmlattr }}>
    {{ text }}
  </button>
{% endmacro %}

An element id would be a good example, and in the above macro it’s handled implicility thanks to the {{ kwargs|xmlattr }} stanza.

class, however, is not that simple, because a macro like that usually needs to supply some CSS classes of its own. The operation of combining them with additional ones, passed by the caller, is awfully textual and error-prone. It’s easy, for example, to forget about the crucial space and run two class names together.

As if CSS wasn’t difficult enough to debug already!

Let them have list

The root cause for any of those problems is working at a level that’s too low for the task. The value for a class attribute may be encoded as string, but it’s fundamentally a list of tokens. In the modern DOM API, for example, it is represented as exactly that: a DOMTokenList.

I’ve found it helpful to replicate a similar mechanism in Jinja templates. The result is a ClassList wrapper whose code I quote here in full:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from collections import Iterable, MutableSet


class ClassList(MutableSet):
    """Data structure for holding, and ultimately returning as a single string,
    a set of identifiers that should be managed like CSS classes.
    """
    def __init__(self, arg=None):
        """Constructor.
        :param arg: A single class name or an iterable thereof.
        """
        if isinstance(arg, basestring):
            classes = arg.split()
        elif isinstance(arg, Iterable):
            classes = arg
        elif arg is not None:
            raise TypeError(
                "expected a string or string iterable, got %r" % type(arg))

        self.classes = set(filter(None, classes))

    def __contains__(self, class_):
        return class_ in self.classes

    def __iter__(self):
        return iter(self.classes)

    def __len__(self):
        return len(self.classes)

    def add(self, *classes):
        for class_ in classes:
            self.classes.add(class_)

    def discard(self, *classes):
        for class_ in classes:
            self.classes.discard(class_)

    def __str__(self):
        return " ".join(sorted(self.classes))

    def __html__(self):
        return 'class="%s"' % self if self else ""

To make it work with Flask, adorn the class with Flask.template_global decorator:

from myflaskapp import app

@app.template_global('classlist')
class ClassList(MutableSet):
    # ...

Otherwise, if you’re working with a raw Jinja Environment, simply add ClassList to its global namespace directly:

jinja_env.globals['classlist'] = ClassList

In either case, I recommend following the apparent Jinja convention of naming template symbols as lowercasewithoutspaces.

Becoming classy

Usage of this new classlist helper is relatively straightforward. Since it accepts both a space-separated string or an iterable of CSS class names, a macro can wrap anything the caller would pass it as a value for class attribute:

{% macro button(text=none, style='default', type='button', class='') %}
  {% set classes = classlist(class) %}
  {# ... #}

The classlist is capable of producing the complete class attribute syntax (i.e. class="..."), or omit it entirely if it would be empty. All we need to do is evaluate it using the normal Jinja expression syntax:

  <button type="{{ type }}" {{ classes }} {{ kwargs|xmlattr }}>

Before that, though, we may want to include some additional classes that are specific to our macro. The button macro, for example, needs to add the Bootstrap-specific btn and btn-$STYLE classes to the <button> element it produces2:

  {% do classes.add('btn', 'btn-' + style) %}

After executing this statement, the final class attribute contains both the caller-provided classes, as well those two that were added explicitly.


  1. Believe it or not, we can finally center the content vertically without much of an issue! 

  2. The {% do %} block in Jinja allows to execute statements (such as function calls) without evaluating values they return. It is not exposed by default, but adding a standard jinja2.ext.do extension to the Environment makes it available. 

Continue reading

Regular expression filters in SQLAlchemy

Posted on Wed 14 October 2015 in Code • Tagged with regular expressions, SQLAlchemy, SQL, Postgres, SQLite, ASTLeave a comment

The nice part about SQLAlchemy — an ORM library/framework for Python — is that it offers much more than just mapping of SQL relations to model classes. At its lower level, for example, there exists an additional layer between those classes and raw SQL queries: an abstraction for the SQL language itself.

Although it may sound byzantine at first, this extra layer serves two important purposes:

  • portability: compiling of the same query to potentially different SQL syntaxes, as required by different database backends
  • extensibility: allowing the user to introduce new elements: operators, functions, or even whole statements

I’m going to take advantage of both of these qualities here, and show how you can implement support for regular expression operators in query filters. On supported database backends, it will allow for powerful pattern matching on string columns.

Database side

We will use the Postgres syntax of POSIX regular expression matchers as a reference. It includes four operators: for case sensitive or insensitive matching, in regular or negated versions.

Since it’s a common practice to use an in-memory SQLite for running database-involved “unit” tests1, we will also add support for that backend. Regular expressions are not implemented there directly, but thanks to sqlite3s custom function ability, it’ll be easy enough to provide the necessary support in Python.

Desired API

In essence, what we want to do is to enhance the String column type with additional comparators, which will then be usable as arguments for the Query.filter method.

As an example, consider the following model class with a string column:

class Person(Model):
    nick = Column(String(255))
    # ...

Once we’re done, it should be possible to query for e.g. all the people whose nicks contain numbers in a pretty straightforward fashion:

numerics = session.query(Person).filter(Person.nick.regexp(r'\d+')).all()

Because of the limitations of underlying databases backends, only literal regular expressions would be supported. It means that, from the SQL standpoint, they have to be constants: you cannot use the value of one column as part of a regular expression that another column is matched against.

But of course, you can still construct those literals in Python code:

def get_copycats(nick):
    """Return all the people who tried to parrot given nick
    but failed and had to append some numbers to it.
    """
    nick_regexp = '^' + re.escape(nick) + r'\d+$'
    return session.query(Person).filter(Person.nick.regexp(nick_regexp)).all()

Considering that regular expressions themselves are already pretty powerful, this really ought to be sufficient for all reasonable purposes.

How comparators work?

So how do we add the regexp method to the String type? It may seem logical to simply extend the class and add it directly:

from sqlalchemy import String as _String


class String(_String):
    """Enhanced version of the standard string type."""

    def regexp(self, value):
        # hmm...

But this won’t actually work. In SQLAlchemy, objects like String or Integer do not represent columns of certain type; they correspond to types themselves2. Extending them with additional methods won’t do much good, because regular code rarely operates on column types.

On the other hand, though, we obviously don’t want to mess with the Column class itself. Our additions are only meaningful for string columns, but putting them there would expose them to all columns, regardless of type!

Thankfully, there is an intermediate mechanism which SQLAlchemy introduced precisely to address the need we’ve got here. Every column type can define a comparator_factory: a kind of mixin class whose methods are incorporated into columns of that type. By overriding this inner class, we can both modify the behavior of existing operators, as well as introduce completely new ones.

So in order to add regexp and other methods to all String columns, our new string type must define its own comparator_factory:

class String(_String):
    class comparator_factory(_String.comparator_factory):
        def regexp(self, other):
            # ...

We need to remember about deriving it from the original one, too. Otherwise, all the standard operators you’d want to use in queries (==, +, etc.) would cease to work, because the new comparator_factory wouldn’t include an implementation of any of the necessary magic methods (__eq__, __add__, etc.).

SQL, abstracted

Knowing where to put our new comparator methods is certainly desirable, but the more interesting question is how do we implement them?

Like I’ve mentioned in the beginning, SQLAlchemy employs an additional layer between ORM models and raw, textual SQL language. Basically, it’s an AST for backend-independent queries which includes almost all of the various SQL constructs, codified in a platform-agnostic way inside the sqlalchemy.sql package.

You may have used this layer directly if you ever wrote queries based on Table objects and ran them with Session.execute. But even those constructed using the more familiar Query class interface end up in this intermediate representation. Often there is little to no additional processing involved.

Arguments to the Query.filter method, for example, are already given as SQL clause objects. It just so happens that its creation is hidden behind a very neat, operator-based API.

Thus, if our regular expression filters are to cooperate, they also need to return pieces of the SQL syntax tree. Not very complicated pieces, mind you, since we only need to represent simple expressions: something like foo ~ 'bar', where ~ may optionally be replaced by one of the other three operators.

Creating the node

They are all binary operators, by the way (i.e. taking exactly two arguments), so it makes sense that the corresponding AST node class is called BinaryExpression. The node’s children are the left argument, the right argument, and the operator itself.

With a little help from a few more SQL syntax wrappers, the implementation of regexp and the other methods turns out to be quite straightforward:

from sqlalchemy.sql.expression import BinaryExpression, literal
from sqlalchemy.sql.operators import custom_op


# (inside String.comparator_factory)

def regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~'))

def iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~*'))

def not_regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~'))

def not_iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~*'))

The use of literal function is dictated by the limitation that was mentioned earlier: any regular expression given in the query must be a SQL literal. If we now try to pass a column-like clause, we’ll get an exception right at the query definition, rather than a database error when we try to execute it.

The custom_op function, on the other hand, is simply the easiest way to create an “operator node” that’s required as a child of BinaryExpression. Since it’s a custom operator, it won’t be interpreted by SQLAlchemy in any way; it will simply be used verbatim in the final SQL string that’s sent to the database.

Compile!

You may have noticed that this would pose a problem if said database doesn’t support ~ or the other operators, which happens to be the case for everything besides Postgres. Because we originally intended to support SQLite in addition to Postgres, this is evidently a problem.

It’s also where the portability of an intermediate SQL representation comes into play. Since our new operators return AST nodes and not just textual SQL, we can redefine the way those nodes are compiled into actual query fragments on various database backends.

To accomplish that, first we need to distinguish our regex filters from other BinaryExpressions:

class RegexMatchExpression(BinaryExpression):
    """Represents matching of a column againsts a regular expression."""

# (inside String.comparator_factory)

def regexp(self, other):
    return RegexMatchExpression(self.expr, literal(other), custom_op('~'))
# etc.

Once we’ve introduced such a distinction, it becomes possible to provide a different way for those filters to be turned into SQL. We can namely define a new compilation routine for them, and mark it as canonical for a specific SQL dialect:

from sqlalchemy.ext.compiler import compiles


@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    """Compile the SQL expression representing a regular expression match
    for the SQLite engine.
    """
    # ...

The function receives an AST element to process (here, the RegexMatchExpression), along with a special compiler object that controls the whole translation process. Armed with those tools, we are allowed to modify the process in arbitrary ways and output just the right SQL statement that’ll do the job in SQLite.

Regex support lite

How does such a statement look like, though? As I’ve remarked early on, SQLite is very easy to extend with your own functions, and the sqlite3 driver used by SQLAlchemy enables us to write those functions directly in Python. Obviously, this is fantastic news when you have something like the standard re module at your disposal.

Indeed, coding the four required functions is quite trivial:

import re

# Mapping from the regular expression matching operators
# to named Python functions that implement them for SQLite.
SQLITE_REGEX_FUNCTIONS = {
    '~': ('REGEXP',
          lambda value, regex: bool(re.match(regex, value))),
    '~*': ('IREGEXP',
           lambda value, regex: bool(re.match(regex, value, re.IGNORECASE))),
    '!~': ('NOT_REGEXP',
           lambda value, regex: not re.match(regex, value)),
    '!~*': ('NOT_IREGEXP',
            lambda value, regex: not re.match(regex, value, re.IGNORECASE)),
}

What’s less apparent is how — or rather, when — to instruct the SQLite database to use them. As per the API we have to use, custom SQLite functions are created on a per-connection basis. But SQLAlchemy, like any good database interface, takes care of connection management, lifecycle, and pooling. Nowhere in our application there is a connect call (nor there should be!) that we could just follow with a few create_function invocations.

Yet, there is a way of doing exactly what we require, and it involves utilizing the event subsystem included in SQLAlchemy. Anytime something interesting happens to any of its ORM or core objects — models, sessions, connection pools, etc. — SQLAlchemy publishes an event that our application code can listen (subscribe) to. It’s a classic PubSub system that introduces some serious potential for extensibility.

Our use of it will be pretty modest, though. All we’re interested in is the establishing of a connection to the SQLite database. This translates directly to a 'connect' event of the Engine object:

from sqlalchemy import event
from sqlalchemy.engine import Engine
import sqlite3


@event.listens_for(Engine, 'connect')
def sqlite_engine_connect(dbapi_connection, connection_record):
    """Listener for the event of establishing connection to a SQLite database.

    Creates the functions handling regular expression operators
    within SQLite engine, pointing them to their Python implementations above.
    """
    if not isinstance(dbapi_connection, sqlite3.Connection):
        return

    for name, function in SQLITE_REGEX_FUNCTIONS.values():
        dbapi_connection.create_function(name, 2, function)

Note that this will catch all the connection events, so we have to verify it’s really SQLite we’re talking to. Afterwards, the creation of REGEXP, IREGEXP, etc. functions is extremely straightforward.

Compilation, take two

This was quite a build-up, but now we’re very close to the finale. What remains is finishing the compilation routine:

@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    pass

We know that element corresponds to an expression in the form of a ~ 'foo'. For SQLite, however, the compatible version is a function call: REGEXP(a, 'foo'). At first this may appear rather disconcerting, because it’s basically a completely different AST node to build.

But it’s actually not a problem at all. Inside compiler hooks, we are allowed to use the much of the same API that’s available when drafting regular queries. This includes the func factory object which produces calls to arbitrary SQL functions. Rather than compiling the original binary expression, we’ll simply poach its operands and use them as arguments to one of the new functions:

from sqlalchemy import exc
from sqlalchemy.sql.expression import func


@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    # determine the name of a custom SQLite function to use for the operator
    operator = element.operator.opstring
    try:
        func_name, _ = SQLITE_REGEX_FUNCTIONS[operator]
    except (KeyError, ValueError), e:
        would_be_sql_string = ' '.join((compiler.process(element.left),
                                        operator,
                                        compiler.process(element.right)))
        raise exc.StatementError(
            "unknown regular expression match operator: %s" % operator,
            would_be_sql_string, None, e)

    # compile the expression as an invocation of the custom function
    regex_func = getattr(func, func_name)
    regex_func_call = regex_func(element.left, element.right)
    return compiler.process(regex_func_call)

Notice how compiler.process is used to obtain the final result. The compiler object doesn’t care that we use it on a totally different AST node; it will dutifully carry out its compilation into raw SQL. We can even use this capability to add a little bit of paranoid error handling: if we encounter an unknown operator, the resulting error message will include fully compiled, SQL versions of both arguments.

Summary

This concludes our efforts: the original query examples with Person.nick.regexp should now work in both Postgres and SQLite. For you convenience, I’ve bundled all the code in this gist.

If you feel like tinkering with it further, I would suggest you try to remove the superfluous NOT_* functions. They make little sense given that SQL has a perfectly adequate NOT keyword. A clean solution would probably prefer an additional reverse flag in RegexMatchExpression over looking for a '!' character in the operator string.


  1. It may or may not be a good practice, though. 

  2. Although it’s possible to write Column(Integer), it’s merely a syntactical convenience. SQLAlchemy interprets it readily as Column(Integer()). Parameterized types — like String — always require explicit instantiation. 

Continue reading

Optional loading of RequireJS modules

Posted on Tue 29 September 2015 in Code • Tagged with JavaScript, RequireJS, modules, Web Workers, DOM, AJAXLeave a comment

RequireJS is a module loader for JavaScript. Similar to its alternatives such as Browserify, it tries to solve an important problem on the web front(end): dividing JavaScript code into modules for better maintainability while still loading them correctly and efficiently without manual curation of the <script> tags.

Once it’s configured correctly (which can be rather non-trivial, though), modules in RequireJS are simply defined as functions that return arbitrary JavaScript objects:

define([
    'jquery',
    'lodash',

    'myapp/dep1',
    'myapp/dep2',
], function($, _, dep1, dep2) {
    // ... all of the module's code ...

    return {
        exportedSymbol1: ...,
        exportedSymbol2: ...,
    };
});

Before executing the function, RequireJS loads all the specified dependencies, repeating the process recursively and asynchronously. Return values from module functions are passed as parameters to the next module function, and thus the whole mechanism clicks, serving as a crafty workaround for the lack of proper import functionality1.

Relative failures

If, at some point in the chain, the desired module cannot be found or loaded, the entire process grinds to a halt with an error. Most of the time, this is perfectly acceptable (or even desirable) behavior, equivalent to an incorrect import statement, invalid #include directive, or similar mistake in other languages.

But there are situations when we’d like to proceed with a missing module, because the dependent code is prepared to handle it. The canonical example are Web Workers. Unlike traditional web application code, Web Worker scripts operate outside of a context of any single page, having no access to the DOM tree (because which DOM tree would it be?). Correspondingly, they have no document nor window objects in their global scope.

Unfortunately, some libraries (*cough* jQuery *cough*) require those objects as a hard (and usually implicit) dependency. This doesn’t exactly help if we’d like to use them in worker code for other features, not related to DOM. In case of jQuery, for example, it could be the API for making AJAX calls, which is still decidedly more pleasant than dealing with bare XMLHTTPRequest if we’re doing anything non-trivial.

Due to this hard dependency on DOM, however, Web Workers cannot require jQuery. No biggie, you may think: browsers supporting workers also offer an excellent, promise-based Fetch API that largely replaces the old AJAX, so we may just use it in worker code. Good thinking indeed, but it doesn’t solve the issue of sharing code between main (“UI”) part of the app and Web Workers.

Suppose you have the following dependency graph:

The common module has some logic that we’d want reused between regular <script>-based code and a Web Worker, but its dependency on jQuery makes it impossible. It would work, however, if this dependency was a soft one. If common could detect that jQuery is not available and fall back to other solutions (like the Fetch API), we would be able to require it in both execution environments.

The optional plugin

What we need, it seems, is an ability to say that some dependencies (like 'jquery') are optional. They can be loaded if they’re available but otherwise, they shouldn’t cause the whole dependency structure to crumble. RequireJS does not support this functionality by default, but it’s easy enough to add it via a plugin.

There are already several useful plugins available for RequireJS that offer some interesting features. As of this writing, however, optional module loading doesn’t seem to be among them. That’s not a big problem: rolling out our own2 plugin turns out to be relatively easy.

RequireJS plugins are themselves modules: you create them as separate JavaScript files having code wrapped in define call. They can also declare their own dependencies like any other module. The only requirement is that they export an object with certain API: at minimum, it has to include the load method. Since our optional plugin is very simple, load is in fact the only method we have to implement:

/* Skeleton of a simple RequireJS plugin module. */

define([], function() {

function load(moduleName, parentRequire, onload, config) {
    // ...
}

return {
    load: load,
};

});

As its name would hint, load carries out the actual module loading which a plugin is allowed to influence, modify, or even replace with something altogether different. In our case, we don’t want to be too invasive, but we need to detect failure in the original loading procedure and step in.

I mentioned previously that module loading is asynchronous, which JavaScript often translates to “callbacks”. Here, load receives the onload callback which we eventually need to invoke. It also get the mysterious parentRequire argument; this is simply a regular require function that’d normally be used if our plugin didn’t stand in the way.

Those two are the most important pieces of the puzzle, which overall has a pretty succinct solution:

/**
 * RequireJS plugin for optional module loading.
 */
define ([], function() {


/** Default value to return when a module failed to load. */
var DEFAULT = null;

function load(moduleName, parentRequire, onload) {
    parentRequire([moduleName], onload, function (err) {
        var failedModule = err.requireModules && requireModules[0];
        console.warn("Could not load optional module: " + failedModule);
        requirejs.undef(failedModule);

        define(failedModule, [], function() { return DEFAULT; });
        parentRequire([failedModule], onload);
    });
}

return {
    load: load,
};

});

The logic here is as follows:

  • First, try to load the module normally (via the outer parentRequire call).
  • If it succeeds, onload is called and there is nothing for us to do.
  • If it fails, we log the failedModule and cleanup some internal RequireJS state with requirejs.undef.
  • Most importantly, we define the module as a trivial shim that returns some DEFAULT (here, null).
  • As a result, when we require it again (through the inner parentRequire call), we know it’ll be loaded successfully.

Usage

Plugins in RequireJS are invoked on a per-module basis. You can specify that a certain dependency 'bar' shall be loaded through a plugin 'foo' by putting 'foo!bar' on the dependency list:

define([ 'foo!bar'], function(bar) {
    // ...
});

Both 'foo' and 'bar' represent module paths here: the first one is the path to the plugin module, while the second one is the actual dependency. In a more realistic example — like when our optional loader is involved — both of them would most likely be multi-segments paths:

define([
    'myapp/ext/require/optional!myapp/common/buttons/awesome-button',
], function(AwesomeButtonController) {
    // ...
});

As you can see, they can get pretty unreadable rather quickly. It would be better if the plugin prefix consisted of just one segment (i.e. optional!) instead. We can make that happen by adding a mapping to the RequireJS config:

requirejs.config({
    // ...
    map: {
        '*': {
            'optional': 'myapp/ext/require/optional',
        }
    }
})

With this renaming in place, the loading of non-mandatory dependencies becomes quite a bit clearer:

define([
    'optional!myapp/common/buttons/awesome-button',
], function(AwesomeButtonController) {

// ...
if (!AwesomeButtonController) {
    // ... (some work around) ...
}

});

Of course, you still need to actually code around the potential lack of an optional dependency. The if statement above is just an illustrative example; you may find it more sensible to provide some shim instead:

AwesomeButtonController = AwesomeButtonController || function() {
    // ...
};

Either way, I recommend trying to keep the size of such conditional logic to a minimum. Ideally, it should be confined to a single place, or — better yet — abstracted behind a function.


  1. An actual import statement has made it into the ES6 (ECMAScript 2015) standard but, as of this writing, no browser implements it. 

  2. Most of the code for the plugin presented here is based on this StackOverflow answer

Continue reading

Reload Jinja macro files

Posted on Thu 24 September 2015 in Code • Tagged with Python, Jinja, Flask, Werkzeug, Flask-ScriptLeave a comment

The integration between Jinja templating engine and the Flask microframework for Python is quite seamless most of the time. It also facilitates rapid development: when we run our web application through Flask’s development server, any changes in Python code will get picked up immediately without restarting it. Similarly, Jinja’s default template loader will detect whether any template has been modified after its last compilation and recompile it if necessary.

One instance when it doesn’t seem to work that well, though, is template files that only contain Jinja macros. They apparently aren’t subject to the same caching rules that apply to regular templates. Modifications to those files alone may not be picked up by Jinja Environment, causing render_template calls in Flask to (indirectly) use their stale versions.

I’ll be watching you

This problem can be alleviated by making the server watch those macro files explicitly, not unlike the Python sources it already monitors. When running a Flask server through the Flask.run method, you can pass additional keyword arguments which aren’t handled by Flask itself, but by the underlying WSGI scaffolding called Werkzeug. The automatic reloader is actually part of it and is quite configurable. Most importantly, it allows passing a list of extra_files= to watch:

import os

app.run(host='0.0.0.0', port=int(os.environ.get('PORT', 5000)),
        extra_files=[os.path.join(app.root_path, app.template_folder, 'macros.html')])

But of course it’s tedious and error-prone to keep this list up to date manually, so let’s try to do better.

…all of you

I’m going to assume you’re keeping all your Jinja macro files inside a single directory. This makes sense, as macros are reusable pieces of template code that are imported by multiple regular templates, so they shouldn’t be scattered around the codebase without some order. The folder may be named macros, util, common, or similar; what’s important is that all the macros have a designated place in the project source tree.

Under this assumption, it’s not difficult to programmatically compute their complete list1:

from pathlib import Path

from myapplication import app  # Flask application object


#: Directories under app's template folder that contain files
#: with only Jinja macros.
JINJA_MACRO_DIRS = ('macros',)


def get_extra_files():
    """Returns an iterable of the extra files that the Flask server
    should monitor for changes and reload the app if any has been modified.
    """
    app_root = Path(app.root_path)

    for dirname in JINJA_MACRO_DIRS:
        macros_dir = app_root / app.template_folder / dirname
        for filepath in macros_dir.rglob('*.html'):
            yield str(filepath)

If you’re using Flask blueprints, in addition to the global application templates you’d also want to monitor any blueprint-specific macro files:

        for bp in (app.blueprints or {}).values():
            macros_dir = Path(bp.root_path) / bp.template_folder / dirname
            for filepath in macros_dir.rglob('*.html'):
                yield str(filepath)

A new start

How you’re going to use the get_extra_files function from the snippet above depends on how exactly you’re running the Flask development server. In the simplest case, when Flask.run is invoked at the top-level scope of some module, it’s pretty obvious:

app.run(..., extra_files=list(get_extra_files()))

More realistically, though, you may be using the Flask-Script extension for all management tasks, including starting the server. If so, calling Flask.run will be relegated to it, so you’ll need to override the Server command to inject additional parameters there:

import os

from flask.ext.script import Manager, Server as _Server

from myapplication import app


class Server(_Server):
    def __init__(self, *args, **kwargs):
        if kwargs.get('use_reloader', True):
            kwargs.setdefault('extra_files', []).extend(get_extra_files())
        super(Server, self).__init__(*args, **kwargs)


manager = Manager(app, with_default_commands=False)

server = Server(port=int(os.environ.get('PORT', 5000)))
manager.add_command('server', server)

This new server command should work just like the default one provided by Flask-Script out of the box, except that all your Jinja macro files will now be monitored for changes.


  1. Like before, I’m using the pathlib module, for which there is a backport if you’re using Python 3.3 or older. 

Continue reading

Purging local Memcached

Posted on Sat 19 September 2015 in Code • Tagged with memcache, Memcached, expect, caching, scripting, TclLeave a comment

Memcached is a ubiquitous caching solution, most commonly used to speed things up in web application backends. You deploy it as a separate binary and have your application servers talk to it before querying a database, calling a third party API over HTTP, or performing some other time-consuming I/O operation. Since it stores data in memory, as its name obviously suggests, it can be orders of magnitude faster than anything that may have to hit a spinning disk (like a database) or an unreliable, external network.

From the point of view of a developer, using Memcached is pretty simple. There exist numerous libraries that wrap its protocol in a neat, language-appropriate API. It does put another requirement on the development environment, of course: you need to have a working memcached deamon in the background if you want the local server to hit code paths where it retrieves data from memcache1. Thankfully, it’s an extremely popular piece of software, present in basically all package repositories, so having it up and running is just one apt-get install or brew install away.

How to flush

It’s a little dated and finicky piece of software, too. Once you have its local instance used for a while, there comes a time when you’d like to purge all its contents and have your server(s) fill it up again. Rather than restarting the daemon (which is system-specific procedure that may require root privileges), you’d like to just issue the flush_all command to it.

This should be easy. Unlike Redis, however, memcached doesn’t come with a dedicated CLI client. In theory, it doesn’t have to, because you can talk to it over just telnet:

$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
flush_all
OK
^]

telnet> quit
Connection closed.

But going through this rigmarole every time we want to flush memcache gets a bit tedious after a while. An obvious attempt at automating it will, however, fail:

$ echo "flush_all\n" | telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

Turns out the you cannot just bombard memcached with flush_all (or any other command) while the connection is being established. You can of course try adding a sleep to allow it some time, but this is inherently unreliable, especially when connecting to remote servers.

The most successful approach I’ve managed to find is to automate not the raw data exchange, but the interactive telnet session itself. By that I mean invoking a telnet command, observing its output and reacting to it — just like a human would do, but in a scripted, automated way.

Nobody expects the Telnet instrumentation

There is a dedicated Unix program for scripting just this kind of CLI interactions. It’s decidedly more obscure than mere echo or pipes, but should nonetheless be available for any modern (or ancient) Unix system, from Linuxes to OSX. Its name is expect and — if installed2 — the usual place you can find it is /usr/bin or /usr/local/bin.

expect accepts commands and scripts written in Tcl — a scripting language whose syntax looks a bit like a mix of PHP and Objective-C. It’s a fully featured language with all the usual programming constructs you’d normally want but expect enhances it further with instructions dedicated to spawning external processes, reading their output, and sending input in response.

In fact, the main commands of the expect program are spawn, expect, and send. Mashing them together, we can easily nail the crucial part of our memcache-purging script:

#!/usr/bin/expect

spawn telnet localhost 11211;
expect "Connected to localhost";
send "flush_all\n";

This performs the actual flush but the telnet session/process is kept open afterwards. Once the expect program ends (after finishing our script), that process may get orphaned and hog system resources. To prevent that, we should behave like a good Unix citizen and wait for our child processes to terminate:

wait;

But obviously, this will never happen, because we’ve never instructed the telnet process to end!

Escape artist

Your first instinct may be to send ^C (SIGINT) or ^D (end-of-file) to the telnet process but this won’t work either. Everything we type inside an active Telnet session is sent to the remote server, and that includes those two key chords.

Authors of the telnet program were of course aware of this problem. As a countermeasure, they introduced the concept of an escape character that allows the user to control the Telnet connection itself — including, but not limited to, its termination. When encountered in the input stream, the escape character causes telnet to temporally suspend its normal operation, “escaping” from the client-server session to a simple command shell of the telnet program itself. (This is indicated by the telnet> prompt).

The default escape character is ^], which can be easily typed on a keyboard using the key combination Ctrl+]. The expect program, however, only simulates typing real characters. Coupled with some syntactical quirks of the Tcl language, this makes the usage of ^] as an escape character rather cumbersome, to say the least.

Fortunately, this default can be changed rather easily with an -e flag to the telnet application. After setting it to something more common but still outside of the memcache protocol — such as the exclamation mark — we are now able to send it without an issue:

#!/usr/bin/expect

set escapechar !;

spawn telnet -e $escapechar localhost 11211;
expect "Connected to localhost";
send "flush_all\n";

send $escapechar;
expect "telnet>";
send "quit\n";
wait;

The script will also terminate despite waiting for the telnet process to finish, which means everything has been shut down gracefully.

Production-ready

As a final touch, it’d be nice to make our solution work with any memcache server and not just localhost. You can see it done in this gist: the script accepts two optional arguments (host & port) and uses it when spawning the telnet process.


  1. It probably goes without saying that any value in the memcache, as well as the entire memcached server (or servers), must be treated as potentially unavailable at any time. A properly written application server should still be able to service all requests — if only a little slower — without any (mem)caching at all. 

  2. On all distros, as well as in Homebrew for OSX, it should be available as a package with a totally uncreative name of expect

Continue reading

mailto: URLs in JavaScript

Posted on Tue 15 September 2015 in Code • Tagged with JavaScript, email, URLLeave a comment

Though not as popular as back in the days, mailto: URLs are sometimes still the best way — and most certainly the easiest — to enable users to send emails from a web application. Typically, they are used in plain regular links that are created with the <a> tag:

<a href="mailto:john.doe@example.com">Send email</a>

Depending on the operating system and browser settings, clicking on such a link could invoke a different mail client. In the past, this could cause much confusion if the user has never used its local email application (e.g. Outlook), but nowadays browsers are increasingly doing the right thing and presenting a choice of possible options. This includes the various web applications from popular email providers. User can thus choose — say — Gmail, and have all subsequent mailto: links take them to a Compose Mail view there.

Given those developments, it’s possible that soon those URLs won’t be as quaint as they used to be ;) In any case, I think it’s worthwhile to have a few mailto: tricks up our sleeve for when they turn out to be an appropriate solution for the task at hand.

Triggering them in JavaScript

In the example above, I’ve used a mailto: URL directly in some HTML markup. For the most part, however, those URLs behave like any other and support a lot of the same functionality that http:// URLs do:

function sendEmail(address) {
    window.location.href = 'mailto:' + address;
}

Predictably, calling this sendEmail function is equivalent to clicking on a mailto: link. This additional level of indirection, however, may be enough to fool some web crawlers that look for email addresses to spam. Its effectiveness depends largely on how radically we’re going to obfuscate the address argument in the actual invocation.

More importantly, though, a function like that can be part of some more complicated logic. The only limitation is the same one that restricts where and when it’s possible to window.open another browser window. Namely, it has to be called in a direct response to a UI action triggered by the user (a button click would be a good example).

Customization

In their simplest form, mailto: URLs aren’t terribly interesting, since their only parameter is the address of a sole recipient. Fortunately, this isn’t where their capabilities end: it’s possible to customize other email fields as well, such as the subject or even body1.

Somewhat surprisingly, all this can be done in a manner that’s more typical for http:// URL: with query string parameters. Right after the recipient’s address, we can drop a question mark and add some ampersand-separated key-value pairs:

<a href="mailto:john.doe@example.com?subject=Hi&amp;body=Hello%20John!">
  Say hello to John!
</a>

A more programmatic approach is of course also possible. One thing we need to remember, though, is the proper escaping of input values in order to make them safe to put inside an URL2. The encodeURIComponent function can be used for this purpose:

function getMailtoUrl(to, subject, body) {
    var args = [];
    if (typeof subject !== 'undefined') {
        args.push('subject=' + encodeURIComponent(subject));
    }
    if (typeof body !== 'undefined') {
        args.push('body=' + encodeURIComponent(body))
    }

    var url = 'mailto:' + encodeURIComponent(to);
    if (args.length > 0) {
        url += '?' + args.join('&');
    }
    return url;
}

In theory, almost any of the numerous email headers can be specified this way. However, support may vary across browsers, especially since some of the headers (like the Bcc: one) are subject to certain security or privacy considerations. Personally, I wouldn’t recommend relying on anything besides subject= and body=, with a possible addition of cc= for the Cc:header .


  1. By RFC 2368, only text/plain email body can be specified this way. 

  2. Note that this is unrelated to the HTML escaping of & character as &amp; entity in the previous example. 

Continue reading

Query string authentication in Requests

Posted on Fri 11 September 2015 in Code • Tagged with Python, Requests, HTTP, authenticationLeave a comment

Requests is a widely used Python library that provides a nicer API for writing HTTP clients than the standard urllib2 module does. It deals with authentication in an especially concise way: through a simple auth= argument, rather than a separate password manager & authentication handler or other such nonsense.

There are several possible ways to authenticate an HTTP call with Requests, and it’s pretty easy to implement our own approach if the server requires it. All the built-in ways, however, as well as the examples of custom implementations, are heavily biased towards using HTTP headers to transmit the necessary credentials (such as username/password or some kind of opaque token).

Non-standard auth

This is actually quite reasonable: the most popular authentication methods, including OAuth 1.0 & 2.0, use HTTP headers either primarily or exclusively.

Not every HTTP API follows this convention, though. Sometimes, credentials are put in other parts of the request, commonly the URL itself. It may seem like a bad idea at first but it can also be perfectly acceptable: credentials don’t have to expose secrets of any particular user of the remote system.

Steam API is a good example here. Calling any of its endpoints requires providing an API key but it grants no special rights to access data of any particular Steam user. All the information it returns is already visible on their public profile1.

For those special authentication schemes, Requests necessitate rolling out our own implementation. Thankfully, doing so is mostly pretty straightforward.

Simple example

All Requests’ authenticators are callable objects inheriting from requests.auth.AuthBase class. Writing your own is hence a matter of defining a subclass of AuthBase with at least a __call__ method:

class SillyAuth(AuthBase):
    def __call__(self, request):
        request.headers['X-ID'] = 'im valid so auth is yes'
        return request

# usage
requests.get('http://example.com', auth=SillyAuth())

The job of an authenticator is to modify the request so that it includes appropriate credentials in whatever form necessary to have them accepted by the remote server. Like I’ve mentioned before, HTTP headers are the most common option, but the request can be modified in other ways as well.

Query string parameters

One problem with modifying a query string, though, is that it’s a part of request URL. By the time it reaches authenticators, the Requests library has already merged any additional query params into it2. Including more params will thus require modifying the URL.

Though it may sound like a risky endeavour involving string manipulations that are fraught with security issues, it’s not really that bad at all. In fact, the Requests library provides an API to do exactly this:

class QueryStringAuth(AuthBase):
    """Authenticator that attaches a set of query string parameters
    (e.g. an API key) to the request.
    """
    def __init__(self, **params):
        self.params = params

    def __call__(self, request):
        if self.params:
            request.prepare_url(request.url, self.params)
        return request

Albeit scantly documented, the prepare_url method will take an existing URL and a dictionary of query string params, and outfit the request with a brand new URL that contains those params neatly encoded.

Full implementation of QueryStringAuth is a little more involved than the snippet above, because we should like to replicate all the idiosyncracies of how regular Requests API handles query string params. Some of them — like allowing both strings and lists as param values — are taken care of by prepare_url itself, but the rest should be dealt with on our own.

Usage

To finish up, let’s use this authenticator to call Steam API and return a list of games that a given user owns but hasn’t played yet:

STEAM_API_KEY = 'a1b2c3d4e5f6g7h8i9j'  # not a real one, of course


def get_steam_backlog(steam_id):
    url = 'http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/'
    params = {
        'steamid': steam_id,
        'include_appinfo': 1,
    }

    response = requests.get(
        url, params=params, auth=QueryStringAuth(key=STEAM_API_KEY))
    games = response.json().get('response', {}).get('games', ())

    for game in games:
        if game.get('playtime_forever', 0):
            continue
        yield game['name']

We could’ve put STEAM_API_KEY directly in params, of course. Singling it out explicitly as an authentication detail, however, makes the code clearer and plays nicely with more advanced features of Requests, such as sessions.


  1. It can be said that only in this case we’re dealing with exclusively authentication, whereas the others also perform authorization. I wouldn’t quibble too much about those details. The fact that both terms are often shortened to “auth” doesn’t exactly help with distinguishing them anyway. 

  2. In fact, what AuthBase.__call__ receives is a special PreparedRequest object which contains the exact bytes that’ll be sent to the server. Most of the higher level abstractions offered by the Requests library (like form data or JSON request body) has been compiled to raw octets at this point. This is done to allow some authenticators (like OAuth) to analyze the full request payload and sign it cryptographically as a part of their flow. 

Continue reading