Regular expression filters in SQLAlchemy
Posted on Wed 14 October 2015 in Code
The nice part about SQLAlchemy — an ORM library/framework for Python — is that it offers much more than just mapping of SQL relations to model classes. At its lower level, for example, there exists an additional layer between those classes and raw SQL queries: an abstraction for the SQL language itself.
Although it may sound byzantine at first, this extra layer serves two important purposes:
- portability: compiling of the same query to potentially different SQL syntaxes, as required by different database backends
- extensibility: allowing the user to introduce new elements: operators, functions, or even whole statements
I’m going to take advantage of both of these qualities here, and show how you can implement support for regular expression operators in query filters. On supported database backends, it will allow for powerful pattern matching on string columns.
Database side
We will use the Postgres syntax of POSIX regular expression matchers as a reference. It includes four operators: for case sensitive or insensitive matching, in regular or negated versions.
Since it’s a common practice to use an in-memory SQLite for running database-involved
“unit” tests1, we will also add support for that backend. Regular expressions are not implemented there directly,
but thanks to
sqlite3
‘s custom function ability,
it’ll be easy enough to provide the necessary support in Python.
Desired API
In essence, what we want to do is to enhance the String
column type with additional comparators, which will then
be usable as arguments for the Query.filter
method.
As an example, consider the following model class with a string column:
class Person(Model):
nick = Column(String(255))
# ...
Once we’re done, it should be possible to query for e.g. all the people whose nicks contain numbers in a pretty straightforward fashion:
numerics = session.query(Person).filter(Person.nick.regexp(r'\d+')).all()
Because of the limitations of underlying databases backends, only literal regular expressions would be supported. It means that, from the SQL standpoint, they have to be constants: you cannot use the value of one column as part of a regular expression that another column is matched against.
But of course, you can still construct those literals in Python code:
def get_copycats(nick):
"""Return all the people who tried to parrot given nick
but failed and had to append some numbers to it.
"""
nick_regexp = '^' + re.escape(nick) + r'\d+$'
return session.query(Person).filter(Person.nick.regexp(nick_regexp)).all()
Considering that regular expressions themselves are already pretty powerful, this really ought to be sufficient for all reasonable purposes.
How comparators work?
So how do we add the regexp
method to the String
type? It may seem logical to simply extend the class
and add it directly:
from sqlalchemy import String as _String
class String(_String):
"""Enhanced version of the standard string type."""
def regexp(self, value):
# hmm...
But this won’t actually work. In SQLAlchemy, objects like String
or Integer
do not represent columns of certain type;
they correspond to types themselves2. Extending them with additional methods won’t do much good, because regular code
rarely operates on column types.
On the other hand, though, we obviously don’t want to mess with the Column
class itself. Our additions are only
meaningful for string columns, but putting them there would expose them to all columns, regardless of type!
Thankfully, there is an intermediate mechanism which SQLAlchemy introduced precisely to address the need we’ve got here.
Every column type can define a
comparator_factory
: a kind of mixin class whose methods are incorporated into columns of that type.
By overriding this inner class,
we can both modify the behavior of existing operators, as well as introduce completely new ones.
So in order to add regexp
and other methods to all String
columns, our new string type must define its own
comparator_factory
:
class String(_String):
class comparator_factory(_String.comparator_factory):
def regexp(self, other):
# ...
We need to remember about deriving it from the original one, too. Otherwise, all the standard operators
you’d want to use in queries (==
, +
, etc.) would cease to work, because the new comparator_factory
wouldn’t
include an implementation of any of the necessary magic methods (__eq__
, __add__
, etc.).
SQL, abstracted
Knowing where to put our new comparator methods is certainly desirable, but the more interesting question is how do we implement them?
Like I’ve mentioned in the beginning, SQLAlchemy employs an additional layer between
ORM models and raw, textual SQL language. Basically, it’s an AST
for backend-independent queries which includes almost all of the various SQL constructs, codified in a platform-agnostic
way inside the sqlalchemy.sql
package.
You may have used this layer directly if you ever wrote queries based on
Table
objects and ran them with
Session.execute
. But even those constructed using the more familiar
Query
class interface
end up in this intermediate representation. Often there is little to no additional processing involved.
Arguments to the
Query.filter
method,
for example, are already given as SQL clause objects. It just so happens that its creation is hidden behind
a very neat, operator-based API.
Thus, if our regular expression filters are to cooperate, they also need to return pieces of the SQL syntax tree.
Not very complicated pieces, mind you, since we only need to represent simple expressions: something like foo ~ 'bar'
, where ~
may optionally be replaced by one of the other three operators.
Creating the node
They are all binary operators, by the way (i.e. taking exactly two arguments), so it makes sense that the corresponding
AST node class is called BinaryExpression
. The node’s children are the left argument, the right argument, and the operator itself.
With a little help from a few more SQL syntax wrappers, the implementation of regexp
and the other methods
turns out to be quite straightforward:
from sqlalchemy.sql.expression import BinaryExpression, literal
from sqlalchemy.sql.operators import custom_op
# (inside String.comparator_factory)
def regexp(self, other):
return BinaryExpression(self.expr, literal(other), custom_op('~'))
def iregexp(self, other):
return BinaryExpression(self.expr, literal(other), custom_op('~*'))
def not_regexp(self, other):
return BinaryExpression(self.expr, literal(other), custom_op('!~'))
def not_iregexp(self, other):
return BinaryExpression(self.expr, literal(other), custom_op('!~*'))
The use of
literal
function
is dictated by the limitation that was mentioned earlier: any regular expression given in the query must be a SQL literal.
If we now try to pass a column-like clause, we’ll get an exception right at the query definition, rather than
a database error when we try to execute it.
The custom_op
function,
on the other hand, is simply the easiest way to create an “operator node” that’s required as a child
of BinaryExpression
. Since it’s a custom operator, it won’t be interpreted by SQLAlchemy in any way; it will simply
be used verbatim in the final SQL string that’s sent to the database.
Compile!
You may have noticed that this would pose a problem if said database doesn’t support ~
or the other operators,
which happens to be the case for everything besides Postgres. Because we originally intended to support SQLite in addition
to Postgres, this is evidently a problem.
It’s also where the portability of an intermediate SQL representation comes into play. Since our new operators return AST nodes and not just textual SQL, we can redefine the way those nodes are compiled into actual query fragments on various database backends.
To accomplish that, first we need to distinguish our regex filters from other BinaryExpression
s:
class RegexMatchExpression(BinaryExpression):
"""Represents matching of a column againsts a regular expression."""
# (inside String.comparator_factory)
def regexp(self, other):
return RegexMatchExpression(self.expr, literal(other), custom_op('~'))
# etc.
Once we’ve introduced such a distinction, it becomes possible to provide a different way for those filters to be turned into SQL. We can namely define a new compilation routine for them, and mark it as canonical for a specific SQL dialect:
from sqlalchemy.ext.compiler import compiles
@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
"""Compile the SQL expression representing a regular expression match
for the SQLite engine.
"""
# ...
The function receives an AST element
to process (here, the RegexMatchExpression
), along with a
special compiler
object
that controls the whole translation process. Armed with those tools, we are allowed to modify the process
in arbitrary ways and output just the right SQL statement that’ll do the job in SQLite.
Regex support lite
How does such a statement look like, though? As I’ve remarked early on, SQLite is very easy to extend with your own
functions, and the sqlite3
driver used by SQLAlchemy enables us to write those functions directly in Python.
Obviously, this is fantastic news when you have something like
the standard re
module at your disposal.
Indeed, coding the four required functions is quite trivial:
import re
# Mapping from the regular expression matching operators
# to named Python functions that implement them for SQLite.
SQLITE_REGEX_FUNCTIONS = {
'~': ('REGEXP',
lambda value, regex: bool(re.match(regex, value))),
'~*': ('IREGEXP',
lambda value, regex: bool(re.match(regex, value, re.IGNORECASE))),
'!~': ('NOT_REGEXP',
lambda value, regex: not re.match(regex, value)),
'!~*': ('NOT_IREGEXP',
lambda value, regex: not re.match(regex, value, re.IGNORECASE)),
}
What’s less apparent is how — or rather, when — to instruct the SQLite database to use them.
As per the API we have to use,
custom SQLite functions are created on a per-connection basis. But SQLAlchemy, like any good database interface,
takes care of connection management, lifecycle, and pooling. Nowhere in our application there is a connect
call
(nor there should be!) that we could just follow with a few create_function
invocations.
Yet, there is a way of doing exactly what we require, and it involves utilizing the event subsystem included in SQLAlchemy. Anytime something interesting happens to any of its ORM or core objects — models, sessions, connection pools, etc. — SQLAlchemy publishes an event that our application code can listen (subscribe) to. It’s a classic PubSub system that introduces some serious potential for extensibility.
Our use of it will be pretty modest, though. All we’re interested in is the establishing of a connection
to the SQLite database. This translates directly to a 'connect'
event of the
Engine
object:
from sqlalchemy import event
from sqlalchemy.engine import Engine
import sqlite3
@event.listens_for(Engine, 'connect')
def sqlite_engine_connect(dbapi_connection, connection_record):
"""Listener for the event of establishing connection to a SQLite database.
Creates the functions handling regular expression operators
within SQLite engine, pointing them to their Python implementations above.
"""
if not isinstance(dbapi_connection, sqlite3.Connection):
return
for name, function in SQLITE_REGEX_FUNCTIONS.values():
dbapi_connection.create_function(name, 2, function)
Note that this will catch all the connection events, so we have to verify it’s really SQLite we’re talking to.
Afterwards, the creation of REGEXP
, IREGEXP
, etc. functions is extremely straightforward.
Compilation, take two
This was quite a build-up, but now we’re very close to the finale. What remains is finishing the compilation routine:
@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
pass
We know that element
corresponds to an expression in the form of a ~ 'foo'
. For SQLite, however, the compatible
version is a function call: REGEXP(a, 'foo')
. At first this may appear rather disconcerting, because it’s basically
a completely different AST node to build.
But it’s actually not a problem at all. Inside compiler hooks, we are allowed to use the much of the same API that’s
available when drafting regular queries. This includes the
func
factory object
which produces calls to arbitrary SQL functions. Rather than compiling the original binary expression, we’ll simply
poach its operands and use them as arguments to one of the new functions:
from sqlalchemy import exc
from sqlalchemy.sql.expression import func
@compiles(RegexMatchExpression, 'sqlite')
def sqlite_regex_match(element, compiler, **kw):
# determine the name of a custom SQLite function to use for the operator
operator = element.operator.opstring
try:
func_name, _ = SQLITE_REGEX_FUNCTIONS[operator]
except (KeyError, ValueError), e:
would_be_sql_string = ' '.join((compiler.process(element.left),
operator,
compiler.process(element.right)))
raise exc.StatementError(
"unknown regular expression match operator: %s" % operator,
would_be_sql_string, None, e)
# compile the expression as an invocation of the custom function
regex_func = getattr(func, func_name)
regex_func_call = regex_func(element.left, element.right)
return compiler.process(regex_func_call)
Notice how compiler.process
is used to obtain the final result. The compiler
object doesn’t care that we use it on
a totally different AST node; it will dutifully carry out its compilation into raw SQL. We can even use this capability
to add a little bit of paranoid error handling: if we encounter an unknown operator, the resulting error message
will include fully compiled, SQL versions of both arguments.
Summary
This concludes our efforts: the original query examples with Person.nick.regexp
should now work
in both Postgres and SQLite.
For you convenience, I’ve bundled all the code in this gist.
If you feel like tinkering with it further, I would suggest you try to remove the superfluous NOT_*
functions.
They make little sense given that SQL has a perfectly adequate NOT
keyword. A clean solution would probably prefer
an additional reverse
flag in RegexMatchExpression
over looking for a '!'
character in the operator string.
-
It may or may not be a good practice, though. ↩
-
Although it’s possible to write
Column(Integer)
, it’s merely a syntactical convenience. SQLAlchemy interprets it readily asColumn(Integer())
. Parameterized types — likeString
— always require explicit instantiation. ↩