Regular expression filters in SQLAlchemy

Posted on Wed 14 October 2015 in Code • Tagged with regular expressions, SQLAlchemy, SQL, Postgres, SQLite, ASTLeave a comment

The nice part about SQLAlchemy — an ORM library/framework for Python — is that it offers much more than just mapping of SQL relations to model classes. At its lower level, for example, there exists an additional layer between those classes and raw SQL queries: an abstraction for the SQL language itself.

Although it may sound byzantine at first, this extra layer serves two important purposes:

  • portability: compiling of the same query to potentially different SQL syntaxes, as required by different database backends
  • extensibility: allowing the user to introduce new elements: operators, functions, or even whole statements

I’m going to take advantage of both of these qualities here, and show how you can implement support for regular expression operators in query filters. On supported database backends, it will allow for powerful pattern matching on string columns.

Database side

We will use the Postgres syntax of POSIX regular expression matchers as a reference. It includes four operators: for case sensitive or insensitive matching, in regular or negated versions.

Since it’s a common practice to use an in-memory SQLite for running database-involved “unit” tests1, we will also add support for that backend. Regular expressions are not implemented there directly, but thanks to sqlite3s custom function ability, it’ll be easy enough to provide the necessary support in Python.

Desired API

In essence, what we want to do is to enhance the String column type with additional comparators, which will then be usable as arguments for the Query.filter method.

As an example, consider the following model class with a string column:

class Person(Model):
    nick = Column(String(255))
    # ...

Once we’re done, it should be possible to query for e.g. all the people whose nicks contain numbers in a pretty straightforward fashion:

numerics = session.query(Person).filter(Person.nick.regexp(r'\d+')).all()

Because of the limitations of underlying databases backends, only literal regular expressions would be supported. It means that, from the SQL standpoint, they have to be constants: you cannot use the value of one column as part of a regular expression that another column is matched against.

But of course, you can still construct those literals in Python code:

def get_copycats(nick):
    """Return all the people who tried to parrot given nick
    but failed and had to append some numbers to it.
    """
    nick_regexp = '^' + re.escape(nick) + r'\d+$'
    return session.query(Person).filter(Person.nick.regexp(nick_regexp)).all()

Considering that regular expressions themselves are already pretty powerful, this really ought to be sufficient for all reasonable purposes.

How comparators work?

So how do we add the regexp method to the String type? It may seem logical to simply extend the class and add it directly:

from sqlalchemy import String as _String


class String(_String):
    """Enhanced version of the standard string type."""

    def regexp(self, value):
        # hmm...

But this won’t actually work. In SQLAlchemy, objects like String or Integer do not represent columns of certain type; they correspond to types themselves2. Extending them with additional methods won’t do much good, because regular code rarely operates on column types.

On the other hand, though, we obviously don’t want to mess with the Column class itself. Our additions are only meaningful for string columns, but putting them there would expose them to all columns, regardless of type!

Thankfully, there is an intermediate mechanism which SQLAlchemy introduced precisely to address the need we’ve got here. Every column type can define a comparator_factory: a kind of mixin class whose methods are incorporated into columns of that type. By overriding this inner class, we can both modify the behavior of existing operators, as well as introduce completely new ones.

So in order to add regexp and other methods to all String columns, our new string type must define its own comparator_factory:

class String(_String):
    class comparator_factory(_String.comparator_factory):
        def regexp(self, other):
            # ...

We need to remember about deriving it from the original one, too. Otherwise, all the standard operators you’d want to use in queries (==, +, etc.) would cease to work, because the new comparator_factory wouldn’t include an implementation of any of the necessary magic methods (__eq__, __add__, etc.).

SQL, abstracted

Knowing where to put our new comparator methods is certainly desirable, but the more interesting question is how do we implement them?

Like I’ve mentioned in the beginning, SQLAlchemy employs an additional layer between ORM models and raw, textual SQL language. Basically, it’s an AST for backend-independent queries which includes almost all of the various SQL constructs, codified in a platform-agnostic way inside the sqlalchemy.sql package.

You may have used this layer directly if you ever wrote queries based on Table objects and ran them with Session.execute. But even those constructed using the more familiar Query class interface end up in this intermediate representation. Often there is little to no additional processing involved.

Arguments to the Query.filter method, for example, are already given as SQL clause objects. It just so happens that its creation is hidden behind a very neat, operator-based API.

Thus, if our regular expression filters are to cooperate, they also need to return pieces of the SQL syntax tree. Not very complicated pieces, mind you, since we only need to represent simple expressions: something like foo ~ 'bar', where ~ may optionally be replaced by one of the other three operators.

Creating the node

They are all binary operators, by the way (i.e. taking exactly two arguments), so it makes sense that the corresponding AST node class is called BinaryExpression. The node’s children are the left argument, the right argument, and the operator itself.

With a little help from a few more SQL syntax wrappers, the implementation of regexp and the other methods turns out to be quite straightforward:

from sqlalchemy.sql.expression import BinaryExpression, literal
from sqlalchemy.sql.operators import custom_op


# (inside String.comparator_factory)

def regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~'))

def iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~*'))

def not_regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~'))

def not_iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~*'))

The use of literal function is dictated by the limitation that was mentioned earlier: any regular expression given in the query must be a SQL literal. If we now try to pass a column-like clause, we’ll get an exception right at the query definition, rather than a database error when we try to execute it.

The custom_op function, on the other hand, is simply the easiest way to create an “operator node” that’s required as a child of BinaryExpression. Since it’s a custom operator, it won’t be interpreted by SQLAlchemy in any way; it will simply be used verbatim in the final SQL string that’s sent to the database.

Compile!

You may have noticed that this would pose a problem if said database doesn’t support ~ or the other operators, which happens to be the case for everything besides Postgres. Because we originally intended to support SQLite in addition to Postgres, this is evidently a problem.

It’s also where the portability of an intermediate SQL representation comes into play. Since our new operators return AST nodes and not just textual SQL, we can redefine the way those nodes are compiled into actual query fragments on various database backends.

To accomplish that, first we need to distinguish our regex filters from other BinaryExpressions:

class RegexMatchExpression(BinaryExpression):
    """Represents matching of a column againsts a regular expression."""

# (inside String.comparator_factory)

def regexp(self, other):
    return RegexMatchExpression(self.expr, literal(other), custom_op('~'))
# etc.

Once we’ve introduced such a distinction, it becomes possible to provide a different way for those filters to be turned into SQL. We can namely define a new compilation routine for them, and mark it as canonical for a specific SQL dialect:

from sqlalchemy.ext.compiler import compiles


@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    """Compile the SQL expression representing a regular expression match
    for the SQLite engine.
    """
    # ...

The function receives an AST element to process (here, the RegexMatchExpression), along with a special compiler object that controls the whole translation process. Armed with those tools, we are allowed to modify the process in arbitrary ways and output just the right SQL statement that’ll do the job in SQLite.

Regex support lite

How does such a statement look like, though? As I’ve remarked early on, SQLite is very easy to extend with your own functions, and the sqlite3 driver used by SQLAlchemy enables us to write those functions directly in Python. Obviously, this is fantastic news when you have something like the standard re module at your disposal.

Indeed, coding the four required functions is quite trivial:

import re

# Mapping from the regular expression matching operators
# to named Python functions that implement them for SQLite.
SQLITE_REGEX_FUNCTIONS = {
    '~': ('REGEXP',
          lambda value, regex: bool(re.match(regex, value))),
    '~*': ('IREGEXP',
           lambda value, regex: bool(re.match(regex, value, re.IGNORECASE))),
    '!~': ('NOT_REGEXP',
           lambda value, regex: not re.match(regex, value)),
    '!~*': ('NOT_IREGEXP',
            lambda value, regex: not re.match(regex, value, re.IGNORECASE)),
}

What’s less apparent is how — or rather, when — to instruct the SQLite database to use them. As per the API we have to use, custom SQLite functions are created on a per-connection basis. But SQLAlchemy, like any good database interface, takes care of connection management, lifecycle, and pooling. Nowhere in our application there is a connect call (nor there should be!) that we could just follow with a few create_function invocations.

Yet, there is a way of doing exactly what we require, and it involves utilizing the event subsystem included in SQLAlchemy. Anytime something interesting happens to any of its ORM or core objects — models, sessions, connection pools, etc. — SQLAlchemy publishes an event that our application code can listen (subscribe) to. It’s a classic PubSub system that introduces some serious potential for extensibility.

Our use of it will be pretty modest, though. All we’re interested in is the establishing of a connection to the SQLite database. This translates directly to a 'connect' event of the Engine object:

from sqlalchemy import event
from sqlalchemy.engine import Engine
import sqlite3


@event.listens_for(Engine, 'connect')
def sqlite_engine_connect(dbapi_connection, connection_record):
    """Listener for the event of establishing connection to a SQLite database.

    Creates the functions handling regular expression operators
    within SQLite engine, pointing them to their Python implementations above.
    """
    if not isinstance(dbapi_connection, sqlite3.Connection):
        return

    for name, function in SQLITE_REGEX_FUNCTIONS.values():
        dbapi_connection.create_function(name, 2, function)

Note that this will catch all the connection events, so we have to verify it’s really SQLite we’re talking to. Afterwards, the creation of REGEXP, IREGEXP, etc. functions is extremely straightforward.

Compilation, take two

This was quite a build-up, but now we’re very close to the finale. What remains is finishing the compilation routine:

@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    pass

We know that element corresponds to an expression in the form of a ~ 'foo'. For SQLite, however, the compatible version is a function call: REGEXP(a, 'foo'). At first this may appear rather disconcerting, because it’s basically a completely different AST node to build.

But it’s actually not a problem at all. Inside compiler hooks, we are allowed to use the much of the same API that’s available when drafting regular queries. This includes the func factory object which produces calls to arbitrary SQL functions. Rather than compiling the original binary expression, we’ll simply poach its operands and use them as arguments to one of the new functions:

from sqlalchemy import exc
from sqlalchemy.sql.expression import func


@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    # determine the name of a custom SQLite function to use for the operator
    operator = element.operator.opstring
    try:
        func_name, _ = SQLITE_REGEX_FUNCTIONS[operator]
    except (KeyError, ValueError), e:
        would_be_sql_string = ' '.join((compiler.process(element.left),
                                        operator,
                                        compiler.process(element.right)))
        raise exc.StatementError(
            "unknown regular expression match operator: %s" % operator,
            would_be_sql_string, None, e)

    # compile the expression as an invocation of the custom function
    regex_func = getattr(func, func_name)
    regex_func_call = regex_func(element.left, element.right)
    return compiler.process(regex_func_call)

Notice how compiler.process is used to obtain the final result. The compiler object doesn’t care that we use it on a totally different AST node; it will dutifully carry out its compilation into raw SQL. We can even use this capability to add a little bit of paranoid error handling: if we encounter an unknown operator, the resulting error message will include fully compiled, SQL versions of both arguments.

Summary

This concludes our efforts: the original query examples with Person.nick.regexp should now work in both Postgres and SQLite. For you convenience, I’ve bundled all the code in this gist.

If you feel like tinkering with it further, I would suggest you try to remove the superfluous NOT_* functions. They make little sense given that SQL has a perfectly adequate NOT keyword. A clean solution would probably prefer an additional reverse flag in RegexMatchExpression over looking for a '!' character in the operator string.


  1. It may or may not be a good practice, though. 

  2. Although it’s possible to write Column(Integer), it’s merely a syntactical convenience. SQLAlchemy interprets it readily as Column(Integer()). Parameterized types — like String — always require explicit instantiation. 


Don’t Copy & Paste. Retype.

Posted on Sat 03 October 2015 in Programming • Tagged with problem solving, Stack Overflow, typingLeave a comment

In this day and age, Google and Stack Overflow are quite essential tools for any developer. Lately, though, the latter seems to be getting some bad rap. On one side, it’s because of seemingly peculiar and sometimes alienating moderation policies. But more pertinently, it’s the apparent rise of a phenomenon that’s wittily dubbed “the full Stack Overflow developer“.

In a nutshell, individuals deserving to be called that are code slingers who throw software artifacts together mostly by copying and pasting code samples found in Stack Overflow answers. They may be getting something working pretty quickly, but they also lack understanding of problems they’re facing and solutions they’re using so cheerily.

Of course, not every instance of code Copy Pasta is to be scorned. I’m pretty sure most people reading this post (and certainly the person writing it!) are guilty of replicating at least a few snippets from Stack Overflow, verbatim, in their own codebase. Heck, we may have even done so with nigh zero interest as to why it has been written this way. Not every technology is intrinsically fascinating, after all, and deadlines are sometimes too close for comfort.

But if so, does it mean we are gradually turning into full Stack Overflow developers?… Yikes! Surely we don’t want that to happen!

Mitigation tactic

Before you shut off your Internet connection altogether while coding, consider employing the following technique whenever you feel like scraping a piece of code from Stack Overflow, and dumping it in your project source.

Don’t use the clipboard. Don’t copy and paste. Retype the code you’ve found instead.

It’s going to take more time, yes. It’s definitely more cumbersome than just hitting Ctrl+C/Ctrl+V. It may also make little sense: if the end result is the same, why does it matter whether the code was transfered through the clipboard or not?

Rationale

I’d argue, however, that it makes perfect sense. From the least to the most important, the reasons why I think so are the following:

  • The fact that retyping is slower than copy-pasting is what actually makes it better. If you vow not to use the clipboard, you’re much less likely to just pick whatever’s the first Stack Overflow result Google has given. You’ll weigh different solutions, and you’ll be rightfully biased towards shorter and simpler ones.

  • When you type something, you cannot do it completely thoughtlessly. Whether you want it or not, you’ll absorb some of the knowledge through sheer osmosis, because the code will flow through your eyes and fingers as it’s transfered from the browser to your editor or IDE. Your subconscious brain will latch onto the bits and pieces of information, and it will sort them out for you to use later. Even if you didn’t intend to, you will most likely learn something.

  • But most importantly, what you type almost certainly won’t be a perfect copy of the original snippet. As you progress through the code, you’ll inevitably deviate from it, if only to conform to a particular style guide your project is following.
    It’s quite likely, though, that you’ll make larger changes as well. You will replace familiar patterns with calls to utility functions. You’ll rearrange the code visually for better readability. You will add comments, or extract functions to make it more self-documenting. You might even enhance and customize it, so that you can abstract and reuse it multiple times.

Afterwards, what you’ve just typed won’t be just some code you have found on the Internet. It’ll be your code.


Optional loading of RequireJS modules

Posted on Tue 29 September 2015 in Code • Tagged with JavaScript, RequireJS, modules, Web Workers, DOM, AJAXLeave a comment

RequireJS is a module loader for JavaScript. Similar to its alternatives such as Browserify, it tries to solve an important problem on the web front(end): dividing JavaScript code into modules for better maintainability while still loading them correctly and efficiently without manual curation of the <script> tags.

Once it’s configured correctly (which can be rather non-trivial, though), modules in RequireJS are simply defined as functions that return arbitrary JavaScript objects:

define([
    'jquery',
    'lodash',

    'myapp/dep1',
    'myapp/dep2',
], function($, _, dep1, dep2) {
    // ... all of the module's code ...

    return {
        exportedSymbol1: ...,
        exportedSymbol2: ...,
    };
});

Before executing the function, RequireJS loads all the specified dependencies, repeating the process recursively and asynchronously. Return values from module functions are passed as parameters to the next module function, and thus the whole mechanism clicks, serving as a crafty workaround for the lack of proper import functionality1.

Relative failures

If, at some point in the chain, the desired module cannot be found or loaded, the entire process grinds to a halt with an error. Most of the time, this is perfectly acceptable (or even desirable) behavior, equivalent to an incorrect import statement, invalid #include directive, or similar mistake in other languages.

But there are situations when we’d like to proceed with a missing module, because the dependent code is prepared to handle it. The canonical example are Web Workers. Unlike traditional web application code, Web Worker scripts operate outside of a context of any single page, having no access to the DOM tree (because which DOM tree would it be?). Correspondingly, they have no document nor window objects in their global scope.

Unfortunately, some libraries (*cough* jQuery *cough*) require those objects as a hard (and usually implicit) dependency. This doesn’t exactly help if we’d like to use them in worker code for other features, not related to DOM. In case of jQuery, for example, it could be the API for making AJAX calls, which is still decidedly more pleasant than dealing with bare XMLHTTPRequest if we’re doing anything non-trivial.

Due to this hard dependency on DOM, however, Web Workers cannot require jQuery. No biggie, you may think: browsers supporting workers also offer an excellent, promise-based Fetch API that largely replaces the old AJAX, so we may just use it in worker code. Good thinking indeed, but it doesn’t solve the issue of sharing code between main (“UI”) part of the app and Web Workers.

Suppose you have the following dependency graph:

The common module has some logic that we’d want reused between regular <script>-based code and a Web Worker, but its dependency on jQuery makes it impossible. It would work, however, if this dependency was a soft one. If common could detect that jQuery is not available and fall back to other solutions (like the Fetch API), we would be able to require it in both execution environments.

The optional plugin

What we need, it seems, is an ability to say that some dependencies (like 'jquery') are optional. They can be loaded if they’re available but otherwise, they shouldn’t cause the whole dependency structure to crumble. RequireJS does not support this functionality by default, but it’s easy enough to add it via a plugin.

There are already several useful plugins available for RequireJS that offer some interesting features. As of this writing, however, optional module loading doesn’t seem to be among them. That’s not a big problem: rolling out our own2 plugin turns out to be relatively easy.

RequireJS plugins are themselves modules: you create them as separate JavaScript files having code wrapped in define call. They can also declare their own dependencies like any other module. The only requirement is that they export an object with certain API: at minimum, it has to include the load method. Since our optional plugin is very simple, load is in fact the only method we have to implement:

/* Skeleton of a simple RequireJS plugin module. */

define([], function() {

function load(moduleName, parentRequire, onload, config) {
    // ...
}

return {
    load: load,
};

});

As its name would hint, load carries out the actual module loading which a plugin is allowed to influence, modify, or even replace with something altogether different. In our case, we don’t want to be too invasive, but we need to detect failure in the original loading procedure and step in.

I mentioned previously that module loading is asynchronous, which JavaScript often translates to “callbacks”. Here, load receives the onload callback which we eventually need to invoke. It also get the mysterious parentRequire argument; this is simply a regular require function that’d normally be used if our plugin didn’t stand in the way.

Those two are the most important pieces of the puzzle, which overall has a pretty succinct solution:

/**
 * RequireJS plugin for optional module loading.
 */
define ([], function() {


/** Default value to return when a module failed to load. */
var DEFAULT = null;

function load(moduleName, parentRequire, onload) {
    parentRequire([moduleName], onload, function (err) {
        var failedModule = err.requireModules && requireModules[0];
        console.warn("Could not load optional module: " + failedModule);
        requirejs.undef(failedModule);

        define(failedModule, [], function() { return DEFAULT; });
        parentRequire([failedModule], onload);
    });
}

return {
    load: load,
};

});

The logic here is as follows:

  • First, try to load the module normally (via the outer parentRequire call).
  • If it succeeds, onload is called and there is nothing for us to do.
  • If it fails, we log the failedModule and cleanup some internal RequireJS state with requirejs.undef.
  • Most importantly, we define the module as a trivial shim that returns some DEFAULT (here, null).
  • As a result, when we require it again (through the inner parentRequire call), we know it’ll be loaded successfully.

Usage

Plugins in RequireJS are invoked on a per-module basis. You can specify that a certain dependency 'bar' shall be loaded through a plugin 'foo' by putting 'foo!bar' on the dependency list:

define([ 'foo!bar'], function(bar) {
    // ...
});

Both 'foo' and 'bar' represent module paths here: the first one is the path to the plugin module, while the second one is the actual dependency. In a more realistic example — like when our optional loader is involved — both of them would most likely be multi-segments paths:

define([
    'myapp/ext/require/optional!myapp/common/buttons/awesome-button',
], function(AwesomeButtonController) {
    // ...
});

As you can see, they can get pretty unreadable rather quickly. It would be better if the plugin prefix consisted of just one segment (i.e. optional!) instead. We can make that happen by adding a mapping to the RequireJS config:

requirejs.config({
    // ...
    map: {
        '*': {
            'optional': 'myapp/ext/require/optional',
        }
    }
})

With this renaming in place, the loading of non-mandatory dependencies becomes quite a bit clearer:

define([
    'optional!myapp/common/buttons/awesome-button',
], function(AwesomeButtonController) {

// ...
if (!AwesomeButtonController) {
    // ... (some work around) ...
}

});

Of course, you still need to actually code around the potential lack of an optional dependency. The if statement above is just an illustrative example; you may find it more sensible to provide some shim instead:

AwesomeButtonController = AwesomeButtonController || function() {
    // ...
};

Either way, I recommend trying to keep the size of such conditional logic to a minimum. Ideally, it should be confined to a single place, or — better yet — abstracted behind a function.


  1. An actual import statement has made it into the ES6 (ECMAScript 2015) standard but, as of this writing, no browser implements it. 

  2. Most of the code for the plugin presented here is based on this StackOverflow answer


Reload Jinja macro files

Posted on Thu 24 September 2015 in Code • Tagged with Python, Jinja, Flask, Werkzeug, Flask-ScriptLeave a comment

The integration between Jinja templating engine and the Flask microframework for Python is quite seamless most of the time. It also facilitates rapid development: when we run our web application through Flask’s development server, any changes in Python code will get picked up immediately without restarting it. Similarly, Jinja’s default template loader will detect whether any template has been modified after its last compilation and recompile it if necessary.

One instance when it doesn’t seem to work that well, though, is template files that only contain Jinja macros. They apparently aren’t subject to the same caching rules that apply to regular templates. Modifications to those files alone may not be picked up by Jinja Environment, causing render_template calls in Flask to (indirectly) use their stale versions.

I’ll be watching you

This problem can be alleviated by making the server watch those macro files explicitly, not unlike the Python sources it already monitors. When running a Flask server through the Flask.run method, you can pass additional keyword arguments which aren’t handled by Flask itself, but by the underlying WSGI scaffolding called Werkzeug. The automatic reloader is actually part of it and is quite configurable. Most importantly, it allows passing a list of extra_files= to watch:

import os

app.run(host='0.0.0.0', port=int(os.environ.get('PORT', 5000)),
        extra_files=[os.path.join(app.root_path, app.template_folder, 'macros.html')])

But of course it’s tedious and error-prone to keep this list up to date manually, so let’s try to do better.

…all of you

I’m going to assume you’re keeping all your Jinja macro files inside a single directory. This makes sense, as macros are reusable pieces of template code that are imported by multiple regular templates, so they shouldn’t be scattered around the codebase without some order. The folder may be named macros, util, common, or similar; what’s important is that all the macros have a designated place in the project source tree.

Under this assumption, it’s not difficult to programmatically compute their complete list1:

from pathlib import Path

from myapplication import app  # Flask application object


#: Directories under app's template folder that contain files
#: with only Jinja macros.
JINJA_MACRO_DIRS = ('macros',)


def get_extra_files():
    """Returns an iterable of the extra files that the Flask server
    should monitor for changes and reload the app if any has been modified.
    """
    app_root = Path(app.root_path)

    for dirname in JINJA_MACRO_DIRS:
        macros_dir = app_root / app.template_folder / dirname
        for filepath in macros_dir.rglob('*.html'):
            yield str(filepath)

If you’re using Flask blueprints, in addition to the global application templates you’d also want to monitor any blueprint-specific macro files:

        for bp in (app.blueprints or {}).values():
            macros_dir = Path(bp.root_path) / bp.template_folder / dirname
            for filepath in macros_dir.rglob('*.html'):
                yield str(filepath)

A new start

How you’re going to use the get_extra_files function from the snippet above depends on how exactly you’re running the Flask development server. In the simplest case, when Flask.run is invoked at the top-level scope of some module, it’s pretty obvious:

app.run(..., extra_files=list(get_extra_files()))

More realistically, though, you may be using the Flask-Script extension for all management tasks, including starting the server. If so, calling Flask.run will be relegated to it, so you’ll need to override the Server command to inject additional parameters there:

import os

from flask.ext.script import Manager, Server as _Server

from myapplication import app


class Server(_Server):
    def __init__(self, *args, **kwargs):
        if kwargs.get('use_reloader', True):
            kwargs.setdefault('extra_files', []).extend(get_extra_files())
        super(Server, self).__init__(*args, **kwargs)


manager = Manager(app, with_default_commands=False)

server = Server(port=int(os.environ.get('PORT', 5000)))
manager.add_command('server', server)

This new server command should work just like the default one provided by Flask-Script out of the box, except that all your Jinja macro files will now be monitored for changes.


  1. Like before, I’m using the pathlib module, for which there is a backport if you’re using Python 3.3 or older. 


Purging local Memcached

Posted on Sat 19 September 2015 in Code • Tagged with memcache, Memcached, expect, caching, scripting, TclLeave a comment

Memcached is a ubiquitous caching solution, most commonly used to speed things up in web application backends. You deploy it as a separate binary and have your application servers talk to it before querying a database, calling a third party API over HTTP, or performing some other time-consuming I/O operation. Since it stores data in memory, as its name obviously suggests, it can be orders of magnitude faster than anything that may have to hit a spinning disk (like a database) or an unreliable, external network.

From the point of view of a developer, using Memcached is pretty simple. There exist numerous libraries that wrap its protocol in a neat, language-appropriate API. It does put another requirement on the development environment, of course: you need to have a working memcached deamon in the background if you want the local server to hit code paths where it retrieves data from memcache1. Thankfully, it’s an extremely popular piece of software, present in basically all package repositories, so having it up and running is just one apt-get install or brew install away.

How to flush

It’s a little dated and finicky piece of software, too. Once you have its local instance used for a while, there comes a time when you’d like to purge all its contents and have your server(s) fill it up again. Rather than restarting the daemon (which is system-specific procedure that may require root privileges), you’d like to just issue the flush_all command to it.

This should be easy. Unlike Redis, however, memcached doesn’t come with a dedicated CLI client. In theory, it doesn’t have to, because you can talk to it over just telnet:

$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
flush_all
OK
^]

telnet> quit
Connection closed.

But going through this rigmarole every time we want to flush memcache gets a bit tedious after a while. An obvious attempt at automating it will, however, fail:

$ echo "flush_all\n" | telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

Turns out the you cannot just bombard memcached with flush_all (or any other command) while the connection is being established. You can of course try adding a sleep to allow it some time, but this is inherently unreliable, especially when connecting to remote servers.

The most successful approach I’ve managed to find is to automate not the raw data exchange, but the interactive telnet session itself. By that I mean invoking a telnet command, observing its output and reacting to it — just like a human would do, but in a scripted, automated way.

Nobody expects the Telnet instrumentation

There is a dedicated Unix program for scripting just this kind of CLI interactions. It’s decidedly more obscure than mere echo or pipes, but should nonetheless be available for any modern (or ancient) Unix system, from Linuxes to OSX. Its name is expect and — if installed2 — the usual place you can find it is /usr/bin or /usr/local/bin.

expect accepts commands and scripts written in Tcl — a scripting language whose syntax looks a bit like a mix of PHP and Objective-C. It’s a fully featured language with all the usual programming constructs you’d normally want but expect enhances it further with instructions dedicated to spawning external processes, reading their output, and sending input in response.

In fact, the main commands of the expect program are spawn, expect, and send. Mashing them together, we can easily nail the crucial part of our memcache-purging script:

#!/usr/bin/expect

spawn telnet localhost 11211;
expect "Connected to localhost";
send "flush_all\n";

This performs the actual flush but the telnet session/process is kept open afterwards. Once the expect program ends (after finishing our script), that process may get orphaned and hog system resources. To prevent that, we should behave like a good Unix citizen and wait for our child processes to terminate:

wait;

But obviously, this will never happen, because we’ve never instructed the telnet process to end!

Escape artist

Your first instinct may be to send ^C (SIGINT) or ^D (end-of-file) to the telnet process but this won’t work either. Everything we type inside an active Telnet session is sent to the remote server, and that includes those two key chords.

Authors of the telnet program were of course aware of this problem. As a countermeasure, they introduced the concept of an escape character that allows the user to control the Telnet connection itself — including, but not limited to, its termination. When encountered in the input stream, the escape character causes telnet to temporally suspend its normal operation, “escaping” from the client-server session to a simple command shell of the telnet program itself. (This is indicated by the telnet> prompt).

The default escape character is ^], which can be easily typed on a keyboard using the key combination Ctrl+]. The expect program, however, only simulates typing real characters. Coupled with some syntactical quirks of the Tcl language, this makes the usage of ^] as an escape character rather cumbersome, to say the least.

Fortunately, this default can be changed rather easily with an -e flag to the telnet application. After setting it to something more common but still outside of the memcache protocol — such as the exclamation mark — we are now able to send it without an issue:

#!/usr/bin/expect

set escapechar !;

spawn telnet -e $escapechar localhost 11211;
expect "Connected to localhost";
send "flush_all\n";

send $escapechar;
expect "telnet>";
send "quit\n";
wait;

The script will also terminate despite waiting for the telnet process to finish, which means everything has been shut down gracefully.

Production-ready

As a final touch, it’d be nice to make our solution work with any memcache server and not just localhost. You can see it done in this gist: the script accepts two optional arguments (host & port) and uses it when spawning the telnet process.


  1. It probably goes without saying that any value in the memcache, as well as the entire memcached server (or servers), must be treated as potentially unavailable at any time. A properly written application server should still be able to service all requests — if only a little slower — without any (mem)caching at all. 

  2. On all distros, as well as in Homebrew for OSX, it should be available as a package with a totally uncreative name of expect


mailto: URLs in JavaScript

Posted on Tue 15 September 2015 in Code • Tagged with JavaScript, email, URLLeave a comment

Though not as popular as back in the days, mailto: URLs are sometimes still the best way — and most certainly the easiest — to enable users to send emails from a web application. Typically, they are used in plain regular links that are created with the <a> tag:

<a href="mailto:john.doe@example.com">Send email</a>

Depending on the operating system and browser settings, clicking on such a link could invoke a different mail client. In the past, this could cause much confusion if the user has never used its local email application (e.g. Outlook), but nowadays browsers are increasingly doing the right thing and presenting a choice of possible options. This includes the various web applications from popular email providers. User can thus choose — say — Gmail, and have all subsequent mailto: links take them to a Compose Mail view there.

Given those developments, it’s possible that soon those URLs won’t be as quaint as they used to be ;) In any case, I think it’s worthwhile to have a few mailto: tricks up our sleeve for when they turn out to be an appropriate solution for the task at hand.

Triggering them in JavaScript

In the example above, I’ve used a mailto: URL directly in some HTML markup. For the most part, however, those URLs behave like any other and support a lot of the same functionality that http:// URLs do:

function sendEmail(address) {
    window.location.href = 'mailto:' + address;
}

Predictably, calling this sendEmail function is equivalent to clicking on a mailto: link. This additional level of indirection, however, may be enough to fool some web crawlers that look for email addresses to spam. Its effectiveness depends largely on how radically we’re going to obfuscate the address argument in the actual invocation.

More importantly, though, a function like that can be part of some more complicated logic. The only limitation is the same one that restricts where and when it’s possible to window.open another browser window. Namely, it has to be called in a direct response to a UI action triggered by the user (a button click would be a good example).

Customization

In their simplest form, mailto: URLs aren’t terribly interesting, since their only parameter is the address of a sole recipient. Fortunately, this isn’t where their capabilities end: it’s possible to customize other email fields as well, such as the subject or even body1.

Somewhat surprisingly, all this can be done in a manner that’s more typical for http:// URL: with query string parameters. Right after the recipient’s address, we can drop a question mark and add some ampersand-separated key-value pairs:

<a href="mailto:john.doe@example.com?subject=Hi&amp;body=Hello%20John!">
  Say hello to John!
</a>

A more programmatic approach is of course also possible. One thing we need to remember, though, is the proper escaping of input values in order to make them safe to put inside an URL2. The encodeURIComponent function can be used for this purpose:

function getMailtoUrl(to, subject, body) {
    var args = [];
    if (typeof subject !== 'undefined') {
        args.push('subject=' + encodeURIComponent(subject));
    }
    if (typeof body !== 'undefined') {
        args.push('body=' + encodeURIComponent(body))
    }

    var url = 'mailto:' + encodeURIComponent(to);
    if (args.length > 0) {
        url += '?' + args.join('&');
    }
    return url;
}

In theory, almost any of the numerous email headers can be specified this way. However, support may vary across browsers, especially since some of the headers (like the Bcc: one) are subject to certain security or privacy considerations. Personally, I wouldn’t recommend relying on anything besides subject= and body=, with a possible addition of cc= for the Cc:header .


  1. By RFC 2368, only text/plain email body can be specified this way. 

  2. Note that this is unrelated to the HTML escaping of & character as &amp; entity in the previous example. 


Query string authentication in Requests

Posted on Fri 11 September 2015 in Code • Tagged with Python, Requests, HTTP, authenticationLeave a comment

Requests is a widely used Python library that provides a nicer API for writing HTTP clients than the standard urllib2 module does. It deals with authentication in an especially concise way: through a simple auth= argument, rather than a separate password manager & authentication handler or other such nonsense.

There are several possible ways to authenticate an HTTP call with Requests, and it’s pretty easy to implement our own approach if the server requires it. All the built-in ways, however, as well as the examples of custom implementations, are heavily biased towards using HTTP headers to transmit the necessary credentials (such as username/password or some kind of opaque token).

Non-standard auth

This is actually quite reasonable: the most popular authentication methods, including OAuth 1.0 & 2.0, use HTTP headers either primarily or exclusively.

Not every HTTP API follows this convention, though. Sometimes, credentials are put in other parts of the request, commonly the URL itself. It may seem like a bad idea at first but it can also be perfectly acceptable: credentials don’t have to expose secrets of any particular user of the remote system.

Steam API is a good example here. Calling any of its endpoints requires providing an API key but it grants no special rights to access data of any particular Steam user. All the information it returns is already visible on their public profile1.

For those special authentication schemes, Requests necessitate rolling out our own implementation. Thankfully, doing so is mostly pretty straightforward.

Simple example

All Requests’ authenticators are callable objects inheriting from requests.auth.AuthBase class. Writing your own is hence a matter of defining a subclass of AuthBase with at least a __call__ method:

class SillyAuth(AuthBase):
    def __call__(self, request):
        request.headers['X-ID'] = 'im valid so auth is yes'
        return request

# usage
requests.get('http://example.com', auth=SillyAuth())

The job of an authenticator is to modify the request so that it includes appropriate credentials in whatever form necessary to have them accepted by the remote server. Like I’ve mentioned before, HTTP headers are the most common option, but the request can be modified in other ways as well.

Query string parameters

One problem with modifying a query string, though, is that it’s a part of request URL. By the time it reaches authenticators, the Requests library has already merged any additional query params into it2. Including more params will thus require modifying the URL.

Though it may sound like a risky endeavour involving string manipulations that are fraught with security issues, it’s not really that bad at all. In fact, the Requests library provides an API to do exactly this:

class QueryStringAuth(AuthBase):
    """Authenticator that attaches a set of query string parameters
    (e.g. an API key) to the request.
    """
    def __init__(self, **params):
        self.params = params

    def __call__(self, request):
        if self.params:
            request.prepare_url(request.url, self.params)
        return request

Albeit scantly documented, the prepare_url method will take an existing URL and a dictionary of query string params, and outfit the request with a brand new URL that contains those params neatly encoded.

Full implementation of QueryStringAuth is a little more involved than the snippet above, because we should like to replicate all the idiosyncracies of how regular Requests API handles query string params. Some of them — like allowing both strings and lists as param values — are taken care of by prepare_url itself, but the rest should be dealt with on our own.

Usage

To finish up, let’s use this authenticator to call Steam API and return a list of games that a given user owns but hasn’t played yet:

STEAM_API_KEY = 'a1b2c3d4e5f6g7h8i9j'  # not a real one, of course


def get_steam_backlog(steam_id):
    url = 'http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/'
    params = {
        'steamid': steam_id,
        'include_appinfo': 1,
    }

    response = requests.get(
        url, params=params, auth=QueryStringAuth(key=STEAM_API_KEY))
    games = response.json().get('response', {}).get('games', ())

    for game in games:
        if game.get('playtime_forever', 0):
            continue
        yield game['name']

We could’ve put STEAM_API_KEY directly in params, of course. Singling it out explicitly as an authentication detail, however, makes the code clearer and plays nicely with more advanced features of Requests, such as sessions.


  1. It can be said that only in this case we’re dealing with exclusively authentication, whereas the others also perform authorization. I wouldn’t quibble too much about those details. The fact that both terms are often shortened to “auth” doesn’t exactly help with distinguishing them anyway. 

  2. In fact, what AuthBase.__call__ receives is a special PreparedRequest object which contains the exact bytes that’ll be sent to the server. Most of the higher level abstractions offered by the Requests library (like form data or JSON request body) has been compiled to raw octets at this point. This is done to allow some authenticators (like OAuth) to analyze the full request payload and sign it cryptographically as a part of their flow. 


Automatic error pages in Flask

Posted on Tue 08 September 2015 in Code • Tagged with Flask, Python, HTTP, errorsLeave a comment

In Flask, a popular web framework for Python, it’s pretty easy to implement custom handlers for specific HTTP errors. All you have to do is to write a function and annotate it with the @errorhandler decorator:

@app.errorhandler(404)
def not_found(error):
    return render_template('errors/404.html')

Albeit simple, the above example is actually very realistic. There’s rarely anything else to do in response to serious request error than to send back some HTML with an appropriate message. If you have more handlers like that, though, you’ll notice they get pretty repetitive: for each erroneous HTTP status code (404, 403, 400, etc.), pick a corresponding template and simply render it.

Which is, of course, why we’d want to deal with all in a little smarter way and with less boilerplate.

Just add a template

Ideally, we would like to avoid writing any Python code at all for each individual error handler. Since all we’re doing revolves around predefined templates, let’s just define the handlers automatically based on the HTML files themselves.

Assuming we store them in the same template directory — say, errors/ — and name their files after numeric status codes (e.g. 404.html), getting all those codes is quite simple1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from pathlib import Path, PurePath

from mywebapp import app


#: Path to the directory where HTML templates for error pages are stored.
ERRORS_DIR = PurePath('errors')


def get_supported_error_codes():
    """Returns an iterable of HTTP status codes for which we have
    a custom error page templates defined.
    """
    error_templates_dir = Path(app.root_path, app.template_folder, ERRORS_DIR)

    potential_error_templates = (
        entry for entry in error_templates_dir.glob('*.html')
        if entry.is_file())
    for template in potential_error_templates:
        try:
            code = int(template.stem)  # e.g. 404.html
        except ValueError:
            pass  # could be some base.html template, or similar
        else:
            if code < 400:
                app.logger.warning(
                    "Found error template for non-error HTTP status %s", code)
                continue
            yield code

Once we have them, we can try wiring up the handlers programmatically and making them do the right thing.

One function to handle them all

Although I’ve used a plural here, in reality we only need a single handler, as it’ll be perfectly capable of dealing with any HTTP error we bind it to. To help distinguishing between the different status codes, Flask by default invokes our error handlers with an HTTPException argument that has the code as an attribute:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from flask import render_template
from jinja2 import TemplateNotFound


def error_handler(error):
    """Universal handler for all HTTP errors.

    :param error: :class:`~werkzeug.exceptions.HTTPException`
                  representing the HTTP error.

    :return: HTTP response to be returned
    """
    code = getattr(error, 'code', None)
    if not code:
        app.logger.warning(
            "Got spurious argument to HTTP error handler: %r", error)
        return FATAL_ERROR_RESPONSE

    app.logger.debug("HTTP error %s, rendering error page", code)
    template = ERRORS_DIR / ('%s.html' % code)
    try:
        return render_template(str(template)), code
    except TemplateNotFound:
        # shouldn't happen if the handler has been wired up properly
        app.logger.fatal("Missing template for HTTP error page: %s", template)
        return FATAL_ERROR_RESPONSE

#: Response emitted when an error occurs in the error handler itself.
FATAL_ERROR_RESPONSE = ("Internal Server Error", 500)

Catching TemplateNotFound exception is admittedly a little paranoid here, as it can occur pretty much exclusively due to a programming error elsewhere. Feel free to treat it as a failed assertion about application’s internal state and e.g. convert to AssertionError if desirable.

The setup

The final step is to actually set up the handler(s):

for code in get_supported_error_codes():
    app.errorhandler(code)(error_handler)

It may look a bit wonky, but it’s just a simple desugaring of the standard Python decorator syntax from the first example. There exists a more direct approach of putting the handler inside app.error_handler_spec dictionary, but it is discouraged by Flask documentation.

Where to put the above code, though? I posit it’s perfectly fine to place in the module’s global scope, because the error handlers (and other request handlers) are traditionally defined at import time anyway.

Also notice that the default error handling we’ve defined here doesn’t preclude more specialized variants for select HTTP codes. All you have to do is ensure that your custom @app.errorhandler definition occurs after the above loop.


  1. Note that I’m using the pathlib module here — and you should, too, since it’s all around awesome. However, it is only a part of the standard library since Python 3.4, so you will most likely need to get the backport first.