…or: The Most Hideous Hack I’ve (Almost) Done
In callee, the argument matcher library for Python that
I released recently, there is this lovely
for a seemingly simple feature. When using the
with a simple
mock_foo.assert_called_with(Matching(lambda x: x % 2 == 0))
it would be great to see its code in the error message if the assertion fails. Right now it’s just going to say
<Matching <function <lambda> at 0x7f5d8a06eb18>>. Provided you don’t possess a supernatural ability
of dereferencing pointers in your head, this won’t give you any immediate hint as to what went wrong. Wouldn’t it be nice
if it read as, say,
<Matching \x: x % 2> instead?1
So I thought: why not try and implement such a mechanism? This is Python, after all — a language where you can spawn completely new classes at runtime, walk the stack backwards (or even forward) and read the local variables, or change the behavior of the import system itself. Surely it would be possible — nay, easy — to get the source code of a short lambda function, right?
Boy, was I wrong.
Make no mistake, though: the task turned out to be absolutely doable, at least in the scope I wanted it done. But what would you think of a solution that involves not just the usual Python hackery, but also AST inspection, transformations of the source code as text, and bytecode shenanigans?…
The code, all the code, and… much more than the code
Let’s start from the beginning, though. Here’s a short lambda function, the kind of which we’d like to obtain the source code of:
is_even = lambda x: x % 2 = 0
If the documentation for Python standard library is to be believed, this should be pretty easy.
there is a function called no different than
getsource. For our purposes, however,
getsourcelines is a little more
convienient, because we can easily tell when the lambda is too long:
def get_short_lambda_source(lambda_func): try: source_lines, _ = inspect.getsourcelines(lambda_func) except IOError: return None if len(source_lines) > 1: return None return source_lines.strip()
Of course if you programmed in Python for any longer period of time, you know very well that the standard docs
are not to be trusted. And it’s not just that the
except clause should also include
TypeError, because it
will be thrown when you try to pass any of the Python builtins to
More important is the ambiguity of what does “source lines for an object” actually mean. “Source lines containing
the object definition” would be much more accurate, and this seemingly small distinction is rather crucial here.
Passing a lambda function to either
getsource, we’ll get its source and everything else
that the returned lines included.
That’s right. Say hello to the complete
is_even = assignment, and the entire
And in case you are wondering: yes, the result will also include any end-of-line comments. No token left behind!
Clearly this is more than we’ve bargained for. Maybe there is a way to strip away the unnecessary cruft? Python does
know how to parse itself, after all: the standard
is a manifestation of this knowledge. Perhaps we can use it to retrieve the
lambda AST node in order to turn it —
and just it — back into Python code?…
def get_short_lambda_ast_node(lambda_func): source_text = get_short_lambda_source(lambda_func) if source_text: source_ast = ast.parse(source_text) return next((node for node in ast.walk(source_ast) if isinstance(node, ast.Lambda)), None)
But as it turns out, getting the source text back this way is only mostly possible.
See, every substantial AST node — which is either an expression (
ast.expr) or a statement (
has two common attributes:
col_offset. When combined, they point to a place in the original source code
where the node was parsed from. This is how we can find out where to look for the definition of our lambda function.
Looks promising, right? The only problem is we don’t know when to stop looking.
That’s right: nodes created by
ast.parse are annotated with their start offset, but not with length nor the end offset.
As a result, the best we can do when it comes to carving out the lambda source from the very first example is this:
lambda x: x % 2 == 0))
So close! Those hanging parentheses are evidently just taunting us, but how can we remove them?
lambda is basically
just a Python expression, so in principle it can be followed by almost anything. This is doubly true for lambdas inside
Matching construct, as they may be a part of some larger mock assertion:
mock_foo.assert_called_with(Matching(lambda x: x % 2 == 0), Integer() & GreaterThan(42))
Here, the extraneous suffix is the entirety of
), Integer() & GreaterThan(42)), quite a lot of more than just
And that’s of course nowhere near the limit of possiblities: for one, there may be more
lambdas in there, too!
Back off, slowly
It seems, however, that there is one thing those troublesome tails have in common: they aren’t syntactically valid.
lambda node nested within some other syntactical constructs will have their closing fragments (e.g.
appear somewhere after its end. Without the corresponding openings (e.g.
Matching(), those fragments won’t parse.
So here’s the crazy idea. What we have is invalid Python, but only because of some unspecified number of extra characters. How about we just try and remove them, one by one, until we get something that is syntactically correct? If we are not mistaken, this will finally be our lambda and nothing else.
The fortune favors the brave, so let’s go ahead and try it:
# ... continuing get_short_lambda_source() ... source_text = source_lines.strip() lambda_node = get_short_lambda_ast_node(lambda_func) lambda_text = source_text[lambda_node.col_offset:] min_length = len('lambda:_') # shortest possible lambda expression while len(lambda_text) > min_length: try: ast.parse(lambda_text) return lambda_text except SyntaxError: lambda_text = lambda_text[:-1] return None
Considering that we’re basically taking lessons from the dusty old tomes in the Restricted Section of Hogwarts library,
the magic here looks quite simple. As long as there is something that can pass for a lambda definition,
we try to parse it and see if it succeeds. The line that says
except SyntaxError: is obviously not something for
the faint of heart, but at least we are specifying
what exception we anticipate catching.
And the kicker? It works. By that I mean it doesn’t return garbage results for a few obvious and not so obvious test cases, which is already more than you would normally expect from hacks of this magnitude. All the lambdas defined until this paragraph, for example, can have their source code extracted without issue.
Just one more thing
So… victory? Not quite. Astute readers may recall my promise of some bytecode arcana, and now’s the time for it.
Despite the initial success of our gradual, character dropping approach, there are cases where it doesn’t produce the correct result. Consider, for example, a lambda definition that’s nestled within a tuple2:
>>> x = lambda _: True, 0 >>> get_short_lambda_source(x) lambda _: True, 0
We would of course expect the result to be
lambda _: True, without a comma or zero.
Unfortunately, here’s where our earlier assumption fails rather spectacularly. The line of code extracted from AST
is syntactically valid even with the extra characters. As a result,
ast.parse succeeds too early and returns an
incorrect definition. It should have been of a lambda contained within a tuple, but tuple is apparently what the lambda
You may say that this is the sharp end of a narrow edge case, and anyone who defines functions like that deserves all the trouble they get. And sure, I wouldn’t mind if we just threw hands in the air and told them we’re simply unable to retrieve the source here. But my opinion is that it doesn’t justify serving them obviously wrong results!
A halting problem
Not if we can help it, anyway. Have a look at the expected source code and the one we’ve extracted, side by side:
lambda _: True lambda _: True, 0
The second line isn’t just longer: it is also doing more. It isn’t just defining a lambda; it defines it,
conjures up a constant
0, and then packs them both into a tuple. That’s at least two additional steps compared to
Those steps have a more precise name, too: they are the bytecode instructions. Every piece of Python source is compiled to a binary bytecode before it’s executed, because the interpreter can only work with this representation. Compilation typically happens when a Python module is first imported, producing a .pyc file corresponding to its .py file. Subsequent imports will simply reuse the cached bytecode.
Moreover, any function or class object has its bytecode accessible (read-only) at runtime. There is even a
dedicated data type to hold it — called simply
code — with a buffer of raw bytes under one of its attributes.
Finally, the bytecode compiler itself is also available to Python programs as a built-in
compile function. You don’t see it used as often as its
exec (which hopefully are a rare sight themselves!),
but it taps into the same internal machinery of Python.
So how does it all add up? The idea is, basically, to cross-check the alleged source code of the lambda with its own bytecode. Any junk that’s still left to trim — even if syntactically valid — will surface as a divergence after compilation. Thus we can simply continue dropping characters until the bytecodes match:
lambda_text = source_text[lambda_node.col_offset:] lambda_body_text = source_text[lambda_node.body.col_offset:] min_length = len('lambda:_') # shortest possible lambda expression while len(lambda_text) > min_length: try: code = compile(lambda_body_text, '<unused filename>', 'eval') if len(code.co_code) == len(lambda_func.__code__.co_code): return lambda_text except SyntaxError: pass lambda_text = lambda_text[:-1] lambda_body_text = lambda_body_text[:-1] return None
Okay, maybe not the exact bytes3, but stopping at the identical bytecode length is good enough a strategy.
As an obvious bonus,
compile will also take care of detecting syntax errors in the candidate source code,
so we don’t need the
ast parsing anymore.
That escalated quickly!
Believe it or not, but there aren’t any more objections to this solution, You can view it in its glorious entirety by looking at this gist.
Does it mean it is also making its cameo in the callee library?…
No, I’m afraid not.
Normally, I’m not the one to shy away from, ahem, bold solutions to tough problems. But in this case, the magnitude of hackery required is just too great, the result not satisfactory enough, the feature’s priority isn’t really all that high, and the maintenance burden it’d introduce is most likely too large.
In the end, it was great fun figuring it out: yet another example of how you can fiddle with Python to do basically anything. Still, we must not get too preoccupied with whether or not we can as to forget if we should.
\) is how lambda functions are denoted in Haskell. We want to be short and sweet, so it feels like a natural choice. ↩
This isn’t an actual snippet from a Python REPL, because
inspect.getsourcelinesrequires the object to be defined in a .py file. ↩
Why we won’t always get an identical bytecode? The short answer is that some instructions may be swapped for their approximate equivalents.
The long answer is that with
compile, we aren’t able to replicate the exact closure environment of the original lambda. When a function refers to an free variable (like
lambda x: x + foo), it is its closure where the value for that variable comes from. For ad-hoc lambdas, this is typically the local scope of its outer function. Code produced by
compile, however, isn’t associated with any such local scope. All free names are thus assumed to refer to global variables. Because Python uses different bytecode instructions for referencing local and global names (
LOAD_GLOBAL), the result of
compilemay differ from a piece of bytecode produced in the regular manner. ↩