Karol Kuczmarski's Blog

Source code of a Python lambda

Posted on Tue 19 April 2016 in Code • Tagged with Python, functions, AST, bytecode • Leave a comment

…or: The Most Hideous Hack I’ve (Almost) Done

In callee, the argument matcher library for Python that I released recently, there is this lovely TODO note for a seemingly simple feature. When using the Matching construct with a simple lambda predicate:

mock_foo.assert_called_with(Matching(lambda x: x % 2 == 0))

it would be great to see its code in the error message if the assertion fails. Right now it’s just going to say something like <Matching <function <lambda> at 0x7f5d8a06eb18>>. Provided you don’t possess a supernatural ability of dereferencing pointers in your head, this won’t give you any immediate hint as to what went wrong. Wouldn’t it be nice if it read as, say, <Matching \x: x % 2> instead?¹

So I thought: why not try and implement such a mechanism? This is Python, after all — a language where you can spawn completely new classes at runtime, walk the stack backwards (or even forward) and read the local variables, or change the behavior of the import system itself. Surely it would be possible — nay, easy — to get the source code of a short lambda function, right?

Boy, was I wrong.

Make no mistake, though: the task turned out to be absolutely doable, at least in the scope I wanted it done. But what would you think of a solution that involves not just the usual Python hackery, but also AST inspection, transformations of the source code as text, and bytecode shenanigans?…

The code, all the code, and… much more than the code

Let’s start from the beginning, though. Here’s a short lambda function, the kind of which we’d like to obtain the source code of:

is_even = lambda x: x % 2 = 0

If the documentation for Python standard library is to be believed, this should be pretty easy. In the inspect module, there is a function called no different than getsource. For our purposes, however, getsourcelines is a little more convienient, because we can easily tell when the lambda is too long:

def get_short_lambda_source(lambda_func):
    try:
        source_lines, _ = inspect.getsourcelines(lambda_func)
    except IOError:
        return None
    if len(source_lines) > 1:
        return None
    return source_lines[0].strip()

Of course if you programmed in Python for any longer period of time, you know very well that the standard docs are not to be trusted. And it’s not just that the except clause should also include TypeError, because it will be thrown when you try to pass any of the Python builtins to getsourcelines.

More important is the ambiguity of what does “source lines for an object” actually mean. “Source lines containing the object definition” would be much more accurate, and this seemingly small distinction is rather crucial here. Passing a lambda function to either getsourcelines or getsource, we’ll get its source and everything else that the returned lines included.

That’s right. Say hello to the complete is_even = assignment, and the entire assert_called_with invocation! And in case you are wondering: yes, the result will also include any end-of-line comments. No token left behind!

Trim left

Clearly this is more than we’ve bargained for. Maybe there is a way to strip away the unnecessary cruft? Python does know how to parse itself, after all: the standard ast module is a manifestation of this knowledge. Perhaps we can use it to retrieve the lambda AST node in order to turn it — and just it — back into Python code?…

def get_short_lambda_ast_node(lambda_func):
    source_text = get_short_lambda_source(lambda_func)
    if source_text:
        source_ast = ast.parse(source_text)
        return next((node for node in ast.walk(source_ast)
                     if isinstance(node, ast.Lambda)), None)

But as it turns out, getting the source text back this way is only mostly possible.

See, every substantial AST node — which is either an expression (ast.expr) or a statement (ast.stmt) — has two common attributes: lineno and col_offset. When combined, they point to a place in the original source code where the node was parsed from. This is how we can find out where to look for the definition of our lambda function.

Looks promising, right? The only problem is we don’t know when to stop looking. That’s right: nodes created by ast.parse are annotated with their start offset, but not with length nor the end offset. As a result, the best we can do when it comes to carving out the lambda source from the very first example is this:

lambda x: x % 2 == 0))

So close! Those hanging parentheses are evidently just taunting us, but how can we remove them? lambda is basically just a Python expression, so in principle it can be followed by almost anything. This is doubly true for lambdas inside the Matching construct, as they may be a part of some larger mock assertion:

mock_foo.assert_called_with(Matching(lambda x: x % 2 == 0), Integer() & GreaterThan(42))

Here, the extraneous suffix is the entirety of ), Integer() & GreaterThan(42)), quite a lot of more than just )). And that’s of course nowhere near the limit of possiblities: for one, there may be more lambdas in there, too!

Back off, slowly

It seems, however, that there is one thing those troublesome tails have in common: they aren’t syntactically valid.

Intuitively, a lambda node nested within some other syntactical constructs will have their closing fragments (e.g. )) appear somewhere after its end. Without the corresponding openings (e.g. Matching(), those fragments won’t parse.

So here’s the crazy idea. What we have is invalid Python, but only because of some unspecified number of extra characters. How about we just try and remove them, one by one, until we get something that is syntactically correct? If we are not mistaken, this will finally be our lambda and nothing else.

The fortune favors the brave, so let’s go ahead and try it:

# ... continuing get_short_lambda_source() ...

source_text = source_lines[0].strip()
lambda_node = get_short_lambda_ast_node(lambda_func)

lambda_text = source_text[lambda_node.col_offset:]
min_length = len('lambda:_')  # shortest possible lambda expression
while len(lambda_text) > min_length:
    try:
        ast.parse(lambda_text)
        return lambda_text
    except SyntaxError:
        lambda_text = lambda_text[:-1]
return None

Considering that we’re basically taking lessons from the dusty old tomes in the Restricted Section of Hogwarts library, the magic here looks quite simple. As long as there is something that can pass for a lambda definition, we try to parse it and see if it succeeds. The line that says except SyntaxError: is obviously not something for the faint of heart, but at least we are specifying what exception we anticipate catching.

And the kicker? It works. By that I mean it doesn’t return garbage results for a few obvious and not so obvious test cases, which is already more than you would normally expect from hacks of this magnitude. All the lambdas defined until this paragraph, for example, can have their source code extracted without issue.

Just one more thing

So… victory? Not quite. Astute readers may recall my promise of some bytecode arcana, and now’s the time for it.

Despite the initial success of our gradual, character dropping approach, there are cases where it doesn’t produce the correct result. Consider, for example, a lambda definition that’s nestled within a tuple²:

>>> x = lambda _: True, 0
>>> get_short_lambda_source(x[0])
lambda _: True, 0

We would of course expect the result to be lambda _: True, without a comma or zero.

Unfortunately, here’s where our earlier assumption fails rather spectacularly. The line of code extracted from AST is syntactically valid even with the extra characters. As a result, ast.parse succeeds too early and returns an incorrect definition. It should have been of a lambda contained within a tuple, but tuple is apparently what the lambda returns.

You may say that this is the sharp end of a narrow edge case, and anyone who defines functions like that deserves all the trouble they get. And sure, I wouldn’t mind if we just threw hands in the air and told them we’re simply unable to retrieve the source here. But my opinion is that it doesn’t justify serving them obviously wrong results!

A halting problem

Not if we can help it, anyway. Have a look at the expected source code and the one we’ve extracted, side by side:

lambda _: True
lambda _: True, 0

The second line isn’t just longer: it is also doing more. It isn’t just defining a lambda; it defines it, conjures up a constant 0, and then packs them both into a tuple. That’s at least two additional steps compared to the original.

Those steps have a more precise name, too: they are the bytecode instructions. Every piece of Python source is compiled to a binary bytecode before it’s executed, because the interpreter can only work with this representation. Compilation typically happens when a Python module is first imported, producing a .pyc file corresponding to its .py file. Subsequent imports will simply reuse the cached bytecode.

Moreover, any function or class object has its bytecode accessible (read-only) at runtime. There is even a dedicated data type to hold it — called simply code — with a buffer of raw bytes under one of its attributes.

Finally, the bytecode compiler itself is also available to Python programs as a built-in compile function. You don’t see it used as often as its counterparts eval and exec (which hopefully are a rare sight themselves!), but it taps into the same internal machinery of Python.

So how does it all add up? The idea is, basically, to cross-check the alleged source code of the lambda with its own bytecode. Any junk that’s still left to trim — even if syntactically valid — will surface as a divergence after compilation. Thus we can simply continue dropping characters until the bytecodes match:

lambda_text = source_text[lambda_node.col_offset:]
lambda_body_text = source_text[lambda_node.body.col_offset:]
min_length = len('lambda:_')  # shortest possible lambda expression
while len(lambda_text) > min_length:
    try:
        code = compile(lambda_body_text, '<unused filename>', 'eval')
        if len(code.co_code) == len(lambda_func.__code__.co_code):
            return lambda_text
    except SyntaxError:
        pass
    lambda_text = lambda_text[:-1]
    lambda_body_text = lambda_body_text[:-1]
return None

Okay, maybe not the exact bytes³, but stopping at the identical bytecode length is good enough a strategy. As an obvious bonus, compile will also take care of detecting syntax errors in the candidate source code, so we don’t need the ast parsing anymore.

That escalated quickly!

Believe it or not, but there aren’t any more objections to this solution, You can view it in its glorious entirety by looking at this gist.

Does it mean it is also making its cameo in the callee library?…

No, I’m afraid not.

Normally, I’m not the one to shy away from, ahem, bold solutions to tough problems. But in this case, the magnitude of hackery required is just too great, the result not satisfactory enough, the feature’s priority isn’t really all that high, and the maintenance burden it’d introduce is most likely too large.

In the end, it was great fun figuring it out: yet another example of how you can fiddle with Python to do basically anything. Still, we must not get too preoccupied with whether or not we can as to forget if we should.

Backslash (\) is how lambda functions are denoted in Haskell. We want to be short and sweet, so it feels like a natural choice. ↩
This isn’t an actual snippet from a Python REPL, because inspect.getsourcelines requires the object to be defined in a .py file. ↩
Why we won’t always get an identical bytecode? The short answer is that some instructions may be swapped for their approximate equivalents.
The long answer is that with compile, we aren’t able to replicate the exact closure environment of the original lambda. When a function refers to an free variable (like foo in lambda x: x + foo), it is its closure where the value for that variable comes from. For ad-hoc lambdas, this is typically the local scope of its outer function.
Code produced by compile, however, isn’t associated with any such local scope. All free names are thus assumed to refer to global variables. Because Python uses different bytecode instructions for referencing local and global names (LOAD_FAST vs LOAD_GLOBAL), the result of compile may differ from a piece of bytecode produced in the regular manner. ↩

Regular expression filters in SQLAlchemy

Posted on Wed 14 October 2015 in Code • Tagged with regular expressions, SQLAlchemy, SQL, Postgres, SQLite, AST • Leave a comment

The nice part about SQLAlchemy — an ORM library/framework for Python — is that it offers much more than just mapping of SQL relations to model classes. At its lower level, for example, there exists an additional layer between those classes and raw SQL queries: an abstraction for the SQL language itself.

Although it may sound byzantine at first, this extra layer serves two important purposes:

portability: compiling of the same query to potentially different SQL syntaxes, as required by different database backends
extensibility: allowing the user to introduce new elements: operators, functions, or even whole statements

I’m going to take advantage of both of these qualities here, and show how you can implement support for regular expression operators in query filters. On supported database backends, it will allow for powerful pattern matching on string columns.

Database side

We will use the Postgres syntax of POSIX regular expression matchers as a reference. It includes four operators: for case sensitive or insensitive matching, in regular or negated versions.

Since it’s a common practice to use an in-memory SQLite for running database-involved “unit” tests¹, we will also add support for that backend. Regular expressions are not implemented there directly, but thanks to sqlite3‘s custom function ability, it’ll be easy enough to provide the necessary support in Python.

Desired API

In essence, what we want to do is to enhance the String column type with additional comparators, which will then be usable as arguments for the Query.filter method.

As an example, consider the following model class with a string column:

class Person(Model):
    nick = Column(String(255))
    # ...

Once we’re done, it should be possible to query for e.g. all the people whose nicks contain numbers in a pretty straightforward fashion:

numerics = session.query(Person).filter(Person.nick.regexp(r'\d+')).all()

Because of the limitations of underlying databases backends, only literal regular expressions would be supported. It means that, from the SQL standpoint, they have to be constants: you cannot use the value of one column as part of a regular expression that another column is matched against.

But of course, you can still construct those literals in Python code:

def get_copycats(nick):
    """Return all the people who tried to parrot given nick
    but failed and had to append some numbers to it.
    """
    nick_regexp = '^' + re.escape(nick) + r'\d+$'
    return session.query(Person).filter(Person.nick.regexp(nick_regexp)).all()

Considering that regular expressions themselves are already pretty powerful, this really ought to be sufficient for all reasonable purposes.

How comparators work?

So how do we add the regexp method to the String type? It may seem logical to simply extend the class and add it directly:

from sqlalchemy import String as _String


class String(_String):
    """Enhanced version of the standard string type."""

    def regexp(self, value):
        # hmm...

But this won’t actually work. In SQLAlchemy, objects like String or Integer do not represent columns of certain type; they correspond to types themselves². Extending them with additional methods won’t do much good, because regular code rarely operates on column types.

On the other hand, though, we obviously don’t want to mess with the Column class itself. Our additions are only meaningful for string columns, but putting them there would expose them to all columns, regardless of type!

Thankfully, there is an intermediate mechanism which SQLAlchemy introduced precisely to address the need we’ve got here. Every column type can define a comparator_factory: a kind of mixin class whose methods are incorporated into columns of that type. By overriding this inner class, we can both modify the behavior of existing operators, as well as introduce completely new ones.

So in order to add regexp and other methods to all String columns, our new string type must define its own comparator_factory:

class String(_String):
    class comparator_factory(_String.comparator_factory):
        def regexp(self, other):
            # ...

We need to remember about deriving it from the original one, too. Otherwise, all the standard operators you’d want to use in queries (==, +, etc.) would cease to work, because the new comparator_factory wouldn’t include an implementation of any of the necessary magic methods (__eq__, __add__, etc.).

SQL, abstracted

Knowing where to put our new comparator methods is certainly desirable, but the more interesting question is how do we implement them?

Like I’ve mentioned in the beginning, SQLAlchemy employs an additional layer between ORM models and raw, textual SQL language. Basically, it’s an AST for backend-independent queries which includes almost all of the various SQL constructs, codified in a platform-agnostic way inside the sqlalchemy.sql package.

You may have used this layer directly if you ever wrote queries based on Table objects and ran them with Session.execute. But even those constructed using the more familiar Query class interface end up in this intermediate representation. Often there is little to no additional processing involved.

Arguments to the Query.filter method, for example, are already given as SQL clause objects. It just so happens that its creation is hidden behind a very neat, operator-based API.

Thus, if our regular expression filters are to cooperate, they also need to return pieces of the SQL syntax tree. Not very complicated pieces, mind you, since we only need to represent simple expressions: something like foo ~ 'bar', where ~ may optionally be replaced by one of the other three operators.

Creating the node

They are all binary operators, by the way (i.e. taking exactly two arguments), so it makes sense that the corresponding AST node class is called BinaryExpression. The node’s children are the left argument, the right argument, and the operator itself.

With a little help from a few more SQL syntax wrappers, the implementation of regexp and the other methods turns out to be quite straightforward:

from sqlalchemy.sql.expression import BinaryExpression, literal
from sqlalchemy.sql.operators import custom_op


# (inside String.comparator_factory)

def regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~'))

def iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('~*'))

def not_regexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~'))

def not_iregexp(self, other):
    return BinaryExpression(self.expr, literal(other), custom_op('!~*'))

The use of literal function is dictated by the limitation that was mentioned earlier: any regular expression given in the query must be a SQL literal. If we now try to pass a column-like clause, we’ll get an exception right at the query definition, rather than a database error when we try to execute it.

The custom_op function, on the other hand, is simply the easiest way to create an “operator node” that’s required as a child of BinaryExpression. Since it’s a custom operator, it won’t be interpreted by SQLAlchemy in any way; it will simply be used verbatim in the final SQL string that’s sent to the database.

Compile!

You may have noticed that this would pose a problem if said database doesn’t support ~ or the other operators, which happens to be the case for everything besides Postgres. Because we originally intended to support SQLite in addition to Postgres, this is evidently a problem.

It’s also where the portability of an intermediate SQL representation comes into play. Since our new operators return AST nodes and not just textual SQL, we can redefine the way those nodes are compiled into actual query fragments on various database backends.

To accomplish that, first we need to distinguish our regex filters from other BinaryExpressions:

class RegexMatchExpression(BinaryExpression):
    """Represents matching of a column againsts a regular expression."""

# (inside String.comparator_factory)

def regexp(self, other):
    return RegexMatchExpression(self.expr, literal(other), custom_op('~'))
# etc.

Once we’ve introduced such a distinction, it becomes possible to provide a different way for those filters to be turned into SQL. We can namely define a new compilation routine for them, and mark it as canonical for a specific SQL dialect:

from sqlalchemy.ext.compiler import compiles


@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    """Compile the SQL expression representing a regular expression match
    for the SQLite engine.
    """
    # ...

The function receives an AST element to process (here, the RegexMatchExpression), along with a special compiler object that controls the whole translation process. Armed with those tools, we are allowed to modify the process in arbitrary ways and output just the right SQL statement that’ll do the job in SQLite.

Regex support lite

How does such a statement look like, though? As I’ve remarked early on, SQLite is very easy to extend with your own functions, and the sqlite3 driver used by SQLAlchemy enables us to write those functions directly in Python. Obviously, this is fantastic news when you have something like the standard re module at your disposal.

Indeed, coding the four required functions is quite trivial:

import re

# Mapping from the regular expression matching operators
# to named Python functions that implement them for SQLite.
SQLITE_REGEX_FUNCTIONS = {
    '~': ('REGEXP',
          lambda value, regex: bool(re.match(regex, value))),
    '~*': ('IREGEXP',
           lambda value, regex: bool(re.match(regex, value, re.IGNORECASE))),
    '!~': ('NOT_REGEXP',
           lambda value, regex: not re.match(regex, value)),
    '!~*': ('NOT_IREGEXP',
            lambda value, regex: not re.match(regex, value, re.IGNORECASE)),
}

What’s less apparent is how — or rather, when — to instruct the SQLite database to use them. As per the API we have to use, custom SQLite functions are created on a per-connection basis. But SQLAlchemy, like any good database interface, takes care of connection management, lifecycle, and pooling. Nowhere in our application there is a connect call (nor there should be!) that we could just follow with a few create_function invocations.

Yet, there is a way of doing exactly what we require, and it involves utilizing the event subsystem included in SQLAlchemy. Anytime something interesting happens to any of its ORM or core objects — models, sessions, connection pools, etc. — SQLAlchemy publishes an event that our application code can listen (subscribe) to. It’s a classic PubSub system that introduces some serious potential for extensibility.

Our use of it will be pretty modest, though. All we’re interested in is the establishing of a connection to the SQLite database. This translates directly to a 'connect' event of the Engine object:

from sqlalchemy import event
from sqlalchemy.engine import Engine
import sqlite3


@event.listens_for(Engine, 'connect')
def sqlite_engine_connect(dbapi_connection, connection_record):
    """Listener for the event of establishing connection to a SQLite database.

    Creates the functions handling regular expression operators
    within SQLite engine, pointing them to their Python implementations above.
    """
    if not isinstance(dbapi_connection, sqlite3.Connection):
        return

    for name, function in SQLITE_REGEX_FUNCTIONS.values():
        dbapi_connection.create_function(name, 2, function)

Note that this will catch all the connection events, so we have to verify it’s really SQLite we’re talking to. Afterwards, the creation of REGEXP, IREGEXP, etc. functions is extremely straightforward.

Compilation, take two

This was quite a build-up, but now we’re very close to the finale. What remains is finishing the compilation routine:

@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    pass

We know that element corresponds to an expression in the form of a ~ 'foo'. For SQLite, however, the compatible version is a function call: REGEXP(a, 'foo'). At first this may appear rather disconcerting, because it’s basically a completely different AST node to build.

But it’s actually not a problem at all. Inside compiler hooks, we are allowed to use the much of the same API that’s available when drafting regular queries. This includes the func factory object which produces calls to arbitrary SQL functions. Rather than compiling the original binary expression, we’ll simply poach its operands and use them as arguments to one of the new functions:

from sqlalchemy import exc
from sqlalchemy.sql.expression import func


@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
    # determine the name of a custom SQLite function to use for the operator
    operator = element.operator.opstring
    try:
        func_name, _ = SQLITE_REGEX_FUNCTIONS[operator]
    except (KeyError, ValueError), e:
        would_be_sql_string = ' '.join((compiler.process(element.left),
                                        operator,
                                        compiler.process(element.right)))
        raise exc.StatementError(
            "unknown regular expression match operator: %s" % operator,
            would_be_sql_string, None, e)

    # compile the expression as an invocation of the custom function
    regex_func = getattr(func, func_name)
    regex_func_call = regex_func(element.left, element.right)
    return compiler.process(regex_func_call)

Notice how compiler.process is used to obtain the final result. The compiler object doesn’t care that we use it on a totally different AST node; it will dutifully carry out its compilation into raw SQL. We can even use this capability to add a little bit of paranoid error handling: if we encounter an unknown operator, the resulting error message will include fully compiled, SQL versions of both arguments.

Summary

This concludes our efforts: the original query examples with Person.nick.regexp should now work in both Postgres and SQLite. For you convenience, I’ve bundled all the code in this gist.

If you feel like tinkering with it further, I would suggest you try to remove the superfluous NOT_* functions. They make little sense given that SQL has a perfectly adequate NOT keyword. A clean solution would probably prefer an additional reverse flag in RegexMatchExpression over looking for a '!' character in the operator string.

It may or may not be a good practice, though. ↩
Although it’s possible to write Column(Integer), it’s merely a syntactical convenience. SQLAlchemy interprets it readily as Column(Integer()). Parameterized types — like String — always require explicit instantiation. ↩