str.startswith() with tuple argument

Posted on Tue 28 June 2016 in Code • Tagged with Python, strings, tuplesLeave a comment

Here’s a little known trick that’s applicable to Python’s startswith and endswith methods of str (and unicode).

Suppose you’re checking whether a string starts with some prefix:

if s.startswith('http://'):
    # totally an URL

You eventually add more possible prefixes (or suffixes) to your condition:

if s.startswith('http://') or s.startswith('https://'):
    # ...

Later on you notice the repetition and refactor it into something like this:

SCHEMES = ['http://', 'https://', 'ftp://', 'git://']
if any(s.startswith(p) for p in SCHEMES):
    # ...

or if you’re feeling extra functional:

if any(map(s.startswith, SCHEMES)):
    # ...

Turns out, however, that startswith (and endswith) support this use case natively. Rather than passing just a single string as the argument, you can provide a tuple of strings instead:

SCHEMES = ('http://', 'https://', 'ftp://', 'git://')
if s.startswith(SCHEMES):
    # ...

Either method will then check the original string against every element of the passed tuple. Both will only return True if at least one of the strings is recognized as prefix/suffix. As you can see, that’s exactly what we would previously do with any.

Somewhat surprisingly, however, the feature only works for actual tuples. Trying to pass a seemingly equivalent iterable — a list or set, for example — will be met with interpreter’s refusal:

>>> is_jpeg = filename.endswith(['.jpg', '.jpeg'])
TypeError: endswith first arg must be str, unicode, or tuple, not list

If you dig into it, there doesn’t seem to be a compelling reason for this behavior. The relevant feature request talks about consistency with the built-in isinstance function, but it’s quite difficult to see how those two are related.

In any case, this can be worked around without much difficulty:

PROTOCOLS = ('http', 'https', 'ftp', 'git')
if s.startswith(tuple(p + '://' for p in PROTOCOLS)):
    # ...

though ideally, you’d want to pack the prefixes in a tuple to begin with.


…or lambda?

Posted on Mon 20 June 2016 in Code • Tagged with Python, syntax, lambda, operatorsLeave a comment

a.k.a. Curious Facts about Python Syntax

In Python 3.3, a new method has been added to the str type: casefold. Its purpose is to return a “sanitized” version of the string that’s suitable for case-insensitive comparison. For older versions of Python, an alternative way that’s mostly compatible is to use the str.lower method, which simply changes all letters in the string to lowercase.

Syntax is hard

Easy enough for a compatibility shim, right? That’s exactly what I thought when I came up with this:

casefold = getattr(str, 'casefold', None) or lambda s: s.lower()

Let’s ignore for a moment the fact that for a correct handling of unicode objects in Python 2, a much more sophisticated approach is necessary. What’s rather more pertinent is that this simple code doesn’t parse:

  File "foo.py", line 42
    getattr(str, 'casefold', None) or lambda s: s.lower()
                                      ^
SyntaxError: invalid syntax

It’s not very often that you would produce a SyntaxError with code that looks perfectly valid to most pythonistas. The last time I had it happen, the explanation was rather surprising and not exactly trivial to come by.

Fortunately, there is always one place where we can definitively resolve any syntactic confusion. That place is the full grammar specification of the Python language.

It may be a little intimidating at first, especially if you’re not familiar with the ENBF notation it uses. All the Python’s language constructs are there, though, so the SyntaxError from above should be traceable to a some rule of the grammar1.

The culprit

And indeed, the offending bit is right here:

or_test: and_test ('or' and_test)*
and_test: not_test ('and' not_test)*
...

It says, essentially, that Python defines the or expression (or_test) as a sequence of and expressions (and_test). If you follow the syntax definition further, however, you will notice that and_test expands to comparisons (a < b, etc.), arithmetic expressions (x + y, etc.), list & dict constructors ([foo, bar], etc.), and finally to atoms such as literal strings and numbers.

What you won’t see along the way are lambda definitions:

lambdef: 'lambda' [varargslist] ':' test

In fact, the branch to allow them is directly above the or_test:

test: or_test ['if' or_test 'else' test] | lambdef

As you can see, the rule puts lambdas at the same syntactical level as conditional expressions (x if a else b), which is very high up. The only thing you can do with a lambda to make a larger expression is to add a yield keyword before it2, or follow it with a comma to create a tuple3.

You cannot, however, pass it as an argument to a binary operator, even if it otherwise makes sense and even looks unambiguous. This is also why the nonsensical expressions such as this one:

1 + lambda: None

will fail not with TypeError, but also with SyntaxError, as they won’t even be evaluated.

More parentheses

Savvy readers may have noticed that this phenomenon is very much reminiscent of the issue of operator precedence.

Indeed, in Python and in many other languages it is the grammar that ultimately specifies the order of operations. It does so simply by defining how expressions can be constructed.

Addition, for example, will be of lower priority than multiplication simply because a sum is said to comprise of terms that are products:

arith_expr: term (('+'|'-') term)*
term: factor (('*'|'/'|'%'|'//') factor)*

This makes operator precedence a syntactic feature, and its resolution is baked into the language parser and handled implicitly4.

We know, however, that precedence can be worked around where necessary by enclosing the operator and its arguments in a pair of parenthesis. On the syntax level, this means creating an entirely new, top-level expression:

atom: '(' [yield_expr|testlist_comp] ')' |  # parenthesized expression
       '[' [listmaker] ']' |
       '{' [dictorsetmaker] '}' |
       '`' testlist1 '`' |
       NAME | NUMBER | STRING+)

There, it is again possible to use even the highest-level constructs, including also the silly stuff such as trying to add a number to a function:

1 + (lambda: None)

This expression will now parse correctly, and produce TypeError as expected.

In the end, the resolution of our initial dilemma is therefore rather simple:

casefold = getattr(str, 'casefold', None) or (lambda s: s.lower())

  1. Such rules are sometimes called productions of the grammar, a term from computational linguistics. 

  2. Yes, yield foo is an expression. Its result is the value sent to the generator by outer code via the send method. Since most generators are used as iterables, typically no values are passed this way so the result of a yield expression is None

  3. There are also a legacy corner cases of lambdas in list/dict/etc. comprehensions, but those only apply under Python 2.x. 

  4. This saying, there are languages where the order is resolved at later stage, after the expressions have already been parsed. They usually allow the programmer to change the precedence of their own operators, as it’s the case in Haskell


& vs. ref in Rust patterns

Posted on Thu 02 June 2016 in Code • Tagged with Rust, pattern matching, borrowing, referencesLeave a comment

Rust is one of those nice languages with pattern matching. If you don’t know, it can be thought of as a generalization of the switch statement: comparing objects not just by value (or overloaded equality operator, etc.) but by structure:

match hashmap.get(&key) {
    Some(value) => do_something_with(value),
    None => { panic!("Oh noes!"); },
}

It doesn’t end here. As you can see above, objects can also be destructured during the match (Some(value)), their parts assigned to bindings (value), and those bindings can subsequently be used in the match branch.

Neat? Definitely. In Rust, pattern matching is bread-and-butter of not only the match statement, but also for, (if)let, and even ordinary function arguments.

Mixing in Rust semantics

For a long time, however, I was somewhat confused as to what happens when references and borrowing is involved in matching. The two “operators” that often occur there are & (ampersand) and ref.

You should readily recognize the first one, as it is used pervasively in Rust to create references (and reference types). The second one quite obviously hints towards references as well. Yet those two constructs serve very different purposes when used within a pattern.

To add to the confusion, they are quite often encountered together:

use hyper::Url;

// print query string params of some URL
let url = Url::parse(some_url).unwrap();
let query_params: Vec<(String, String)> = url.query_pairs().unwrap_or(vec![]);
for &(ref name, ref value) in &query_params {
    println!("{}={}", name, value);
}

Lack of one or the other will be (at best) pointed out to you by the compiler, along with a suggestion where to add it. But addressing problems in this manner can only go so far. So how about we delve deeper and see what it’s really about?

Part of the reference, part of the pattern

Rust is very flexible as to what value can be a subject of pattern matching. You would be very hard pressed to find anything that cannot be used within a match statement, really. Both actual objects and references to objects are perfectly allowed:

struct Foo(i32);
// ...
let foo = &Foo(42);
match foo {
    x => println!("Matched!"),
}

In the latter case, however, we aren’t typically interested in the reference itself (like above). Instead, we want to determine some facts about the object it points to:

match foo {
    &Foo(num) => println!("Matched with number {}", num),
}

As you can see, this is where the ampersand comes into play. Just like a type constructor (Some, Ok, or Foo above), the & operator informs the Rust compiler what kind of value we’re expecting from the match. When it sees the ampersand, it knows we’re looking for references to certain objects, and not for the objects themselves.

Why is the distinction between an object and its reference important, though? In many other places, Rust is perfectly happy to blur the gap between references and actual objects1 — for example when calling most of their methods.

Pattern matching, however, due to its ability to unpack values into their constituent parts, is a destructive operation. Anything we apply match (or similar construct) to will be moved into the block by default:

let maybe_name = Some(String::from("Alice"));
// ...
match maybe_name {
    Some(n) => println!("Hello, {}", n),
    _ => {},
}
do_something_with(maybe_name)

Following the typical ownership semantics, this will prevent any subsequent moves and essentially consume the value:

error: use of partially moved value: `maybe_name` [E0382]
    do_something_with(maybe_name);
                      ^~~~~~~~~~

So just like the aforementioned type constructors (Some, etc.), the ampersand operator is simply part of the pattern that we match against. And just like with Some and friends, there is an obvious symmetry here: if & was used to create the value, it needs to be used when unpacking it.

The syntax used in a pattern that destructures an object is analogous to one used by the expression which created it.

Preventing the move

Errors like the one above often contain helpful notes:

note: `(maybe_name:core::option::Option::Some).0` moved here because it has type `collections::string::String`, which is moved by default
         Some(n) => println!("Hello, {}", n),
              ^

as well as hints for resolving them:

help: if you would like to borrow the value instead, use a `ref` binding as shown:
        Some(ref n) => println!("Hello, {}", n),

Here’s where ref enters the scene.

The message tells us that if we add a ref keyword in the suggested spot, we will switch from moving to borrowing for the match binding that follows (here, n). It will still capture its value exactly as before, but it will no longer assume ownership of it.

This is the crucial difference.

Unlike the ampersand, ref is not something we match against. It doesn’t affect what values match the pattern it’s in, and what values don’t2.

The only thing it changes is how parts of the matched value are captured by the pattern’s bindings:

  • by default, without ref, they are moved into the match arms
  • with ref, they are borrowed instead and represented as references

Looking at our example, the n binding in Some(n) is of type String: the actual field type from the matched structure. By contrast, the other n in Some(ref n) is a &String — that is, a reference to the field.

One is a move, the other one is a borrow.

ref annotates pattern bindings to make them borrow rather than move. It is not a part of the pattern as far as matching is concerned.

Used together

To finish off, let’s untangle the confusing example from the beginning of this post:

for &(ref name, ref value) in &query_params {
    println!("{}={}", name, value);
}

Since we know ref doesn’t affect whether or not the pattern matches, we could just as well have something like &(a, b). And this should be quite a bit easier to read: it clearly denotes we expect a reference to a 2-tuple of simple objects. Not coincidentally, such tuples are items from the vector we’re iterating over.

Problem is, without the refs we will attempt to move those items into the loop scope. But due to the way the vector is iterated over (&query_params), we’re only borrowing each item, so this is actually impossible. In fact, it would be a classic attempt to move out of a borrowed context.

It is also wholly unnecessary. The only thing this loop does is printing the items out, so accessing them through references is perfectly fine.

And this is exactly what the ref operator gives us. Adding the keyword back, we will switch from moving the values to just borrowing them instead.

To sum up

  • & denotes that your pattern expects a reference to an object. Hence & is a part of said pattern: &Foo matches different objects than Foo does.

  • ref indicates that you want a reference to an unpacked value. It is not matched against: Foo(ref foo) matches the same objects as Foo(foo).


  1. The technical term for this is a Deref coercion

  2. We can say that it doesn’t affect the satisfiability (or conversely, refutability) of the pattern. 


Please don’t use Click

Posted on Fri 20 May 2016 in Programming • Tagged with Python, CLI, UI, ClickLeave a comment

…not for standalone programs anyway.

Chances are, you have written some command line programs in Python. This is quite probable even if you normally code in some other language. And if you have, it is not unlikely that you needed to parse the argv of your program at one point or another.

There are plenty of options here, both in the standard library as well as among third party packages. One does stand out, however, and it’s mostly for how it is often overused. I’m talking about Click here.

If you wanted to use it in your next Python program, I hereby urge you to reconsider.

What’s the fuss?

click_ The somewhat bizarrely named Click library is described as a “package for creating beautiful command line interfaces”. Its main trick is the ability to create subcommands by adorning Python functions with the @click.command() decorator1. It then makes them coalesce into an argument parser, equipped with the necessary dispatching logic.

This idea isn’t new, of course. Prior art goes back at least seven years to the now-abandoned opster package. Click, however, was the first one of its kind to garner noticeable popularity, which is easily attributed to whom it’s been authored by.

So while my arguments against using this kind of CLI framework would apply to any package implementing the paradigm, it just happens that Click is currently its most prominent example. Purely for the sake of convenience, I will therefore refer to it as if it was interchangeable with the whole concept. Because why not? Whatever you may say about the library’s name, it’s hard to imagine a more concise moniker than a simple Click.

What’s wrong, then, with the way Click handles command line interfaces?

CLI: Little Interfaces

It’s how it encourages to treat them as an accidental afterthought rather than a deliberate design decision.

For applications invoked repeatedly from a terminal, their command line arguments and flags are the primary means of user interaction2. It is how users communicate their intent to perform an action; provide the neccessary input data to carry it throgh; decide how they want to receive the output; and control many other aspects of the programs execution. Absent graphical components and widgets, the command line is virtually the only way to interact with a terminal program.

In other words, it is the UI.

And how important the UI is for any application? It seems to be important enough that entire fields of study are devoted to reducing friction of human-computer interaction. In many projects, the emphasis on user interface design is on par with that of actual software engineering.
Like everything, of course, it is susceptible to trends and fads (such as the recent “mobile/responsive everything!” craze). But its significance remains undiminished. Quite the opposite: in the age of ubiquitous computing, user interfaces are probably more important than ever.

Yes, this includes CLI. One of the main reasons we turn to the command line are speed and efficacy. Common tasks must utilize short and convenient syntax that is quick to integrate into user’s muscle memory. Others should not only be possible, but discoverable and accessible without going through reams of man pages.

Any terminal program intended for frequent use by humans should therefore strive to excel in those two qualities. But except for the simplest of cases, it won’t happen by itself. Designing an efficient CLI for any non-trivial application is a challenging and demanding task.

It doesn’t click

With Click, however, we’re encouraged to just wing it.

Click tells us to slap some decorators on our top-level functions and call it a day. Sure, you can dig deep enough and uncover the underlying layers of abstraction that may eventually allow you do things for which argparse has a first-class support.

By default, however, Click shoehorns your programs into predefined patterns that, incidentally, mirror those of some least intuitive command-line tools in existence.

Indeed, the whole idea of subdiving your program into several distinct is already suspect, for it appears at odds with the fundamental Unix philosophy of doing one thing well. While it is occasionally justified, it shouldn’t be the first thing that comes to your mind. But that’s completely at odds with the Click’s approach, where not ending up with multiple distinct commands is something you have to consciously avoid.

…though it sometimes might

So, what am I suggesting you use instead libraries such as Click?… Nothing outrageous, really.

If you care about your command line interface, consider just using the argparse module. Yes, it will force you to create parser objects, add arguments & flags to it, and in general pay some attention to the whole business. When it comes to UI, it’s always good to make it an explicit concern, maybe even sufficient to warrant its own module.

Alternatively, the docopt library provides another take on the UI-first approach to CLI, though it is more limited in its capabilities3.

Finally, I’m not advocating to ditch Click in all scenarios. There’s plenty of situations when we’re interested in getting any CLI up and running, and not so much in making the most efficient and intuitive interface possible. The prime example is any kind of automation scripts that are ancillary to some bigger project, like manage.py is in Django4. The Python ecosystem doesn’t really have dedicated task runners that are as featureful as Grunt or Gulp, and that makes Click a viable and compelling option5.

But for standalone programs whose CLI is the main interface? Yeah, not really.


  1. Oddly enough, that pair of parentheses seems to be mandatory. 

  2. Environment variables and config files deserve a honorary mention, of course. But those are usually derivatives of the command line arguments, containing e.g. the default values for flags. 

  3. Click’s own documentation actually describes quite nicely how theirs and docopt’s philosophies differ in a way that’s consistent with this article. 

  4. Incidentally, this appears to be a major motivation behind creating Click in the first place: to support web applications built upon on the Flask framework, and possibly obviate the need for extensions such as Flask-Script

  5. This saying, there are some task runners which offer similar experience, like Invoke


Mock.configure_mock fix for Python

Posted on Sat 07 May 2016 in Code • Tagged with Python, mock, patchingLeave a comment

Python’s mocking library is rather uncomplicated. Most of what it does is creating mock objects: veritable sponges that absorb every interaction with any code that we pass them to.

This simplicity is also surfaced in the API, especially in the main part of it — the mock.Mock constructor:

some_mock = mock.Mock(url='http://example.com')
assert some_mock.url == 'http://example.com'

Any arguments that we pass there become attributes on the resulting mock object. This is really useful when patching, because it allows us to completely specify the replacement object within a @mock.patch decorator:

@patch.object(requests, 'get', new=Mock(return_value=Mock(status_code=400)))
def test_404(self):
    self.assertRaises(NotFoundError):
        do_stuff()

You have to keep in mind, however, that the mock.Mock class also has some constructor arguments of its own. For this reason, there exists some potential for name collision: some of the Mocks own arguments may have the same names as the attributes we’d like to set on the mock object:

some_mock = mock.Mock(name="John Doe")  # doesn't set the `name` attribute
assert some_mock.name == "John Doe"  # blows up!

Here, the name argument is inherent to the Mock class. Its constructor will interpret it in a special way, and so it won’t set a name attribute on the resulting mock. Other possible culprits include the spec and wraps parameters, both of which have relatively common names that we may want to use as object attributes1.

Collision avoidance

It’s trivial to fix the issue, of course:

some_mock = mock.Mock()
some_mock.name = "John Doe"
assert sock_mock.name == "John Doe"

but this approach has a downside. Creating and configuring a mock is no longer a single expression, which means we cannot use it with patchers as easily as before:

@patch.object(foo, 'bar')
def test_something(self, mock_bar):
    mock_bar.name = "John Doe"
    # (...rest of the test...)

We can either configure the mock after patching, like above, or perhaps introduce some utility functions to be called inside the @patch decorator.

The almost-there method

In any case, this is somewhat disappointing. And it is even more so when we discover that there is a method called configure_mock which looks like it was designed to solve this very issue. Its arguments are always interpreted as attributes of the mock: it has no “special” or “reserved” names. Indeed, this method is what allows us to actually write the mock setup as a single expression:

some_mock.configure_mock(name="John Doe")

Problem is, this expression returns None.

Yes, configure_mock returns nothing.
Or in other words, it doesn’t return anything.
In fact, it has no return statement whatsoever.

Most importantly, it doesn’t have the return self line that’d enable us to write this:

some_mock = mock.Mock().configure_mock(name="John Doe")

Well, that is quite a let-down.

Fixing it

But hey, this is Python! Shortcomings like that don’t necessarily mean we have to fork whole libraries. Let’s just add the missing return, shall we?

from mock import Mock as _Mock

class Mock(_Mock):
    def configure_mock(self, **kwargs):
        super(Mock, self).configure_mock(**kwargs)
        return self  # <-- there!

Whew, that was quick!

…Alright, that’s actually the whole fix, but it’s close. To complete it, we need to apply the same treatment to three more Mock classes: MagicMock, NonCallableMock, and NonCallableMagicMock.

A complete solution can be seen in this gist.


  1. Collision may also occur with mock.patch constructs. The most likely offender there is probably the new parameter


Source code of a Python lambda

Posted on Tue 19 April 2016 in Code • Tagged with Python, functions, AST, bytecodeLeave a comment

…or: The Most Hideous Hack I’ve (Almost) Done

In callee, the argument matcher library for Python that I released recently, there is this lovely TODO note for a seemingly simple feature. When using the Matching construct with a simple lambda predicate:

mock_foo.assert_called_with(Matching(lambda x: x % 2 == 0))

it would be great to see its code in the error message if the assertion fails. Right now it’s just going to say something like <Matching <function <lambda> at 0x7f5d8a06eb18>>. Provided you don’t possess a supernatural ability of dereferencing pointers in your head, this won’t give you any immediate hint as to what went wrong. Wouldn’t it be nice if it read as, say, <Matching \x: x % 2> instead?1

So I thought: why not try and implement such a mechanism? This is Python, after all — a language where you can spawn completely new classes at runtime, walk the stack backwards (or even forward) and read the local variables, or change the behavior of the import system itself. Surely it would be possible — nay, easy — to get the source code of a short lambda function, right?

Boy, was I wrong.

Make no mistake, though: the task turned out to be absolutely doable, at least in the scope I wanted it done. But what would you think of a solution that involves not just the usual Python hackery, but also AST inspection, transformations of the source code as text, and bytecode shenanigans?…

The code, all the code, and… much more than the code

Let’s start from the beginning, though. Here’s a short lambda function, the kind of which we’d like to obtain the source code of:

is_even = lambda x: x % 2 = 0

If the documentation for Python standard library is to be believed, this should be pretty easy. In the inspect module, there is a function called no different than getsource. For our purposes, however, getsourcelines is a little more convienient, because we can easily tell when the lambda is too long:

def get_short_lambda_source(lambda_func):
    try:
        source_lines, _ = inspect.getsourcelines(lambda_func)
    except IOError:
        return None
    if len(source_lines) > 1:
        return None
    return source_lines[0].strip()

Of course if you programmed in Python for any longer period of time, you know very well that the standard docs are not to be trusted. And it’s not just that the except clause should also include TypeError, because it will be thrown when you try to pass any of the Python builtins to getsourcelines.

More important is the ambiguity of what does “source lines for an object” actually mean. “Source lines containing the object definition” would be much more accurate, and this seemingly small distinction is rather crucial here. Passing a lambda function to either getsourcelines or getsource, we’ll get its source and everything else that the returned lines included.

That’s right. Say hello to the complete is_even = assignment, and the entire assert_called_with invocation! And in case you are wondering: yes, the result will also include any end-of-line comments. No token left behind!

Trim left

Clearly this is more than we’ve bargained for. Maybe there is a way to strip away the unnecessary cruft? Python does know how to parse itself, after all: the standard ast module is a manifestation of this knowledge. Perhaps we can use it to retrieve the lambda AST node in order to turn it — and just it — back into Python code?…

def get_short_lambda_ast_node(lambda_func):
    source_text = get_short_lambda_source(lambda_func)
    if source_text:
        source_ast = ast.parse(source_text)
        return next((node for node in ast.walk(source_ast)
                     if isinstance(node, ast.Lambda)), None)

But as it turns out, getting the source text back this way is only mostly possible.

See, every substantial AST node — which is either an expression (ast.expr) or a statement (ast.stmt) — has two common attributes: lineno and col_offset. When combined, they point to a place in the original source code where the node was parsed from. This is how we can find out where to look for the definition of our lambda function.

Looks promising, right? The only problem is we don’t know when to stop looking. That’s right: nodes created by ast.parse are annotated with their start offset, but not with length nor the end offset. As a result, the best we can do when it comes to carving out the lambda source from the very first example is this:

lambda x: x % 2 == 0))

So close! Those hanging parentheses are evidently just taunting us, but how can we remove them? lambda is basically just a Python expression, so in principle it can be followed by almost anything. This is doubly true for lambdas inside the Matching construct, as they may be a part of some larger mock assertion:

mock_foo.assert_called_with(Matching(lambda x: x % 2 == 0), Integer() & GreaterThan(42))

Here, the extraneous suffix is the entirety of ), Integer() & GreaterThan(42)), quite a lot of more than just )). And that’s of course nowhere near the limit of possiblities: for one, there may be more lambdas in there, too!

Back off, slowly

It seems, however, that there is one thing those troublesome tails have in common: they aren’t syntactically valid.

Intuitively, a lambda node nested within some other syntactical constructs will have their closing fragments (e.g. )) appear somewhere after its end. Without the corresponding openings (e.g. Matching(), those fragments won’t parse.

So here’s the crazy idea. What we have is invalid Python, but only because of some unspecified number of extra characters. How about we just try and remove them, one by one, until we get something that is syntactically correct? If we are not mistaken, this will finally be our lambda and nothing else.

The fortune favors the brave, so let’s go ahead and try it:

# ... continuing get_short_lambda_source() ...

source_text = source_lines[0].strip()
lambda_node = get_short_lambda_ast_node(lambda_func)

lambda_text = source_text[lambda_node.col_offset:]
min_length = len('lambda:_')  # shortest possible lambda expression
while len(lambda_text) > min_length:
    try:
        ast.parse(lambda_text)
        return lambda_text
    except SyntaxError:
        lambda_text = lambda_text[:-1]
return None

Considering that we’re basically taking lessons from the dusty old tomes in the Restricted Section of Hogwarts library, the magic here looks quite simple. As long as there is something that can pass for a lambda definition, we try to parse it and see if it succeeds. The line that says except SyntaxError: is obviously not something for the faint of heart, but at least we are specifying what exception we anticipate catching.

And the kicker? It works. By that I mean it doesn’t return garbage results for a few obvious and not so obvious test cases, which is already more than you would normally expect from hacks of this magnitude. All the lambdas defined until this paragraph, for example, can have their source code extracted without issue.

Just one more thing

So… victory? Not quite. Astute readers may recall my promise of some bytecode arcana, and now’s the time for it.

Despite the initial success of our gradual, character dropping approach, there are cases where it doesn’t produce the correct result. Consider, for example, a lambda definition that’s nestled within a tuple2:

>>> x = lambda _: True, 0
>>> get_short_lambda_source(x[0])
lambda _: True, 0

We would of course expect the result to be lambda _: True, without a comma or zero.

Unfortunately, here’s where our earlier assumption fails rather spectacularly. The line of code extracted from AST is syntactically valid even with the extra characters. As a result, ast.parse succeeds too early and returns an incorrect definition. It should have been of a lambda contained within a tuple, but tuple is apparently what the lambda returns.

You may say that this is the sharp end of a narrow edge case, and anyone who defines functions like that deserves all the trouble they get. And sure, I wouldn’t mind if we just threw hands in the air and tell them we’re simply unable to retrieve the source here. But my opinion is that it doesn’t justify serving them obviously wrong results!

A halting problem

Not if we can help it, anyway. Have a look at the expected source code and the one we’ve extracted, side by side:

lambda _: True
lambda _: True, 0

The second line isn’t just longer: it is also doing more. It isn’t just defining a lambda; it defines it, conjures up a constant 0, and then packs them both into a tuple. That’s at least two additional steps compared to the original.

Those steps have a more precise name, too: they are the bytecode instructions. Every piece of Python source is compiled to a binary bytecode before it’s executed, because the interpreter can only work with this representation. Compilation typically happens when a Python module is first imported, producing a .pyc file corresponding to its .py file. Subsequent imports will simply reuse the cached bytecode.

Moreover, any function or class object has its bytecode accessible (read-only) at runtime. There is even a dedicated data type to hold it — called simply code — with a buffer of raw bytes under one of its attributes.

Finally, the bytecode compiler itself is also available to Python programs as a built-in compile function. You don’t see it used as often as its counterparts eval and exec (which hopefully are a rare sight themselves!), but it taps into the same internal machinery of Python.

So how does it all add up? The idea is, basically, to cross-check the alleged source code of the lambda with its own bytecode. Any junk that’s still left to trim — even if syntactically valid — will surface as a divergence after compilation. Thus we can simply continue dropping characters until the bytecodes match:

lambda_text = source_text[lambda_node.col_offset:]
lambda_body_text = source_text[lambda_node.body.col_offset:]
min_length = len('lambda:_')  # shortest possible lambda expression
while len(lambda_text) > min_length:
    try:
        code = compile(lambda_body_text, '<unused filename>', 'eval')
        if len(code.co_code) == len(lambda_func.__code__.co_code):
            return lambda_text
    except SyntaxError:
        pass
    lambda_text = lambda_text[:-1]
    lambda_body_text = lambda_body_text[:-1]
return None

Okay, maybe not the exact bytes3, but stopping at the identical bytecode length is good enough a strategy. As an obvious bonus, compile will also take care of detecting syntax errors in the candidate source code, so we don’t need the ast parsing anymore.

That escalated quickly!

Believe it or not, but there aren’t any more objections to this solution, You can view it in its glorious entirety by looking at this gist.

Does it mean it is also making its cameo in the callee library?…

No, I’m afraid not.

Normally, I’m not the one to shy away from, ahem, bold solutions to tough problems. But in this case, the magnitude of hackery required is just too great, the result not satisfactory enough, the feature’s priority isn’t really all that high, and the maintenance burden it’d introduce is most likely too large.

In the end, it was great fun figuring it out: yet another example of how you can fiddle with Python to do basically anything. Still, we must not get too preoccupied with whether or not we can as to forget if we should.


  1. Backslash (\) is how lambda functions are denoted in Haskell. We want to be short and sweet, so it feels like a natural choice. 

  2. This isn’t an actual snippet from a Python REPL, because inspect.getsourcelines requires the object to be defined in a .py file. 

  3. Why we won’t always get an identical bytecode? The short answer is that some instructions may be swapped for their approximate equivalents.
    The long answer is that with compile, we aren’t able to replicate the exact closure environment of the original lambda. When a function refers to an free variable (like foo in lambda x: x + foo), it is its closure where the value for that variable comes from. For ad-hoc lambdas, this is typically the local scope of its outer function.
    Code produced by compile, however, isn’t associated with any such local scope. All free names are thus assumed to refer to global variables. Because Python uses different bytecode instructions for referencing local and global names (LOAD_FAST vs LOAD_GLOBAL), the result of compile may differ from a piece of bytecode produced in the regular manner. 


In Microsoft we trust

Posted on Fri 08 April 2016 in Thoughts • Tagged with Microsoft, Windows, GitHub, Apple, Facebook, Google, tech cultureLeave a comment

Just like many other people, I was following news from the last week’s BUILD conference with piqued interest. The ability to run Linux userland programs on Windows — including, of course, bash — is something to be excited about. If nothing else, it should dramatically improve Windows support of new programming languages that seem to pop up all the time.

There was something else, however, that I couldn’t help but notice. The reactions of tech communities to this and similar developments focused very frequently on Microsoft itself.

The beloved “new Microsoft”, as some call it, embraces open source, supports Linux, and generally does almost a full 180 with their stance on proprietary vs. free software. The circumstances fit this narrative rather snuggly, too: a new CEO makes a clean break with the past to pivot the company in this new world ruled by mobile and cloud.

Still… love? Even considering how Internet comments are grotesquely exaggerated most of the time, that’s quite a declaration. The sentiment is nowhere near isolated either. But it’s not a question whether this infatuation has a rational merit, or whether those feelings will eventually turn out to be misplaced.

The question is: why does it exist at all? We are talking about a company here, a for-profit organization. How can such a language even enter the picture?

Then I realized this is not really a new phenomenon. Quite the opposite: the broad developer community seems to always need a company to champion its core values. Nowadays, we’re simply trying to find someone new to carry the standard.

Why? Because we feel that our old heroes have forsaken us.

Hall of past fame

Take GitHub for example. Once a darling of the open source community, it’s been suffering sharp criticism for many months now. Rightfully or not, many people aren’t exactly excited — to put it mildly — about changes to the policy and atmosphere at GitHub, typified by the ill-fated meritocracy rug. The widely backed and long standing plea for a few critical features has only recently stopped falling on deaf ears. And in the background, there are always concerns about GitHub simply becoming too big, and exterting too much control over the open source ecosystem.

Nota bene, the very same ecosystem it had once been lauded for nurturing.

Among the other flagship tech companies, Apple or Facebook have never garnered much good will. Sure, they are recognized for the pure utilitarian value of hardware they produce; or convenience, broad applicability, and stability of APIs they offer. Facebook may be scoring some additional points for trying to sort out the mess of frontend development, but many say it’s not doing anybody any favors. Neither company is easy to portray as a paragon of openness, though, and it’s probably easier to argue the exact opposite.

And then there is Google, of course1. Some time in the past few years, a palpable shift occurred in how the company is perceived by the techie crowd. It’s difficult to pinpoint the exact pivotal moment, and the one event that leaps to attention doesn’t seem sufficient to explain it. But the tone has been set, allowing news to be molded to fit it. Add this to the usual backdrop of complaints about the onerous interview process and general fearmongering, and the picture doesn’t look very bright.

Shades of grey

No similar woes seem to have been plaguing Microsoft as of late, even though their record of recent “unwholesome” deeds isn’t exactly clean either. Does it mean they are indeed the most fitting candidate for a (new) enterprise ally of the hacker community?…

Or maybe we can finally recognize the whole notion as the utterly silly concept it actually is. Hard to shake though it be, this quasi-Manicheic mentality of assigning labels of virtue or sin is, at best, a naive idealism. At worst, it’s a peculiar kind of harmful partisanship that technologists are particularly susceptible to. You may recognize it as something that has a very long tradition in the hacker community.

But it doesn’t mean it’s a tradition worth keeping.


  1. I have no illusions I will be attributed any objectivity, but it bears mentioning again that nothing I say here is representative of anything but my own opinions. 


Pipe `xargs` into `find`

Posted on Sun 03 April 2016 in Code • Tagged with shell scripting, Bash, xargs, find, zshLeave a comment

Here’s a trick that’s hardly new, but if you haven’t heard about, it will save you a trip to a man page or two.

Assuming you’re a person who mostly prefers the terminal over some fancy GUI, you’ve probably used the find command along with xargs at least a few times. It’s very common, for example, to use the results of find as arguments to some other program. It could something as simple as figuring out which modules in your project have grown slightly too large:

$ find . -name '*.py' | xargs wc -l | sort -hr
1467 total
 322 callee/base.py
 261 callee/general.py
 251 callee/collections.py
# etc.

We find them all first, and then use xargs to build a long wc invocation, and we finally display results in the reverse order. Pretty easy stuff: I don’t usually have to try more than a dozen times to get it right!1

But how about the opposite situation? Let’s say you have a list of directories you want to search through with find. Doing so may seem easy enough2:

$ cat packagedirs.txt | xargs find -name '__init__.py'

Except it’s not going to work. Like a few other Unix commands, find is very particular about the order of arguments it receives. Not only are the predicate flags (like -name) considered in sequence, but they also have to appear after the directories we want to search through.

But in the xargs invocation above, essentially the opposite is going to happen.

The replacement flag

So how to remedy this? Enter the -I flag to xargs:

$ cat packagedirs.txt | xargs -I{} find {} -name '__init__.py'

This flag will tell xargs quite a few things.

The most important one is to stop putting the arguments at the end of the command invocation. Instead, it shall place them wherever it sees the replacement string — here, pair of braces3: {}. And because we placed the braces where find is normally expecting the list of directories to search through, the command will now get us exactly the results we wanted.

What’s almost impossible to see, however, is that it may not use the exact way we intended to obtain those results. The difference is easier to spot when we replace find with echo:

$ cat >/tmp/list
foo
bar
$ cat /tmp/list | xargs echo
foo bar
$ cat /tmp/list | xargs -I{} echo {}
foo
bar

or, better yet, use xargs with the -t flag to print the commands on stderr before executing them:

$ cat packagedirs.txt | xargs -I{} -t find {} -name '__init__.py' >/dev/null
find callee -name '__init__.py'
find tests -name '__init__.py'

As you can see, we actually have more than one find invocation here!

This is the second effect of -I: it causes xargs to execute given command line for each argument separately. It so happens that it doesn’t really make any difference for our usage of find, which is why it wasn’t at all obvious we were running it multiple times.

To avoid problems, though, you should definitely be cognizant of this fact when calling other programs with xargs -I.

Make arguments spaced again

Incidentally, I’m not aware of any method that’d actually make xargs produce find foo bar -name ... calls. If you need this exact form, probably the easiest way is to use plain old shell variables:

$ (d=$(cat packagedirs.txt); find $d -name '*.py')

This takes advantage of the word splitting feature of Bash and a few other compatible shells. Caveat is, you may be using a shell where this behavior is disabled by default. The result would be making find interpret the content of $d as a single directory name: foo bar rather than foo and bar.

zsh is one such shell. Although probably a good thing overall, in times like these you’d want to bring the “normal” behavior back. In zsh, it’s fortunately pretty simple:

$ (d=$(cat packagedirs.txt); find ${=d} -name '*.py')

What about a portable solution? As far as I can tell, the only certain way you can ensure word splitting occurs is to use eval. Here, the xargs command can actually come in handy again, albeit only as a prop:

$ (d=$(cat packagedirs.txt | xargs echo); eval "find $d -name '*.py'")

One would hope such hacks aren’t needed very often.


  1. A completely kosher version would also use the -print0 flag to find and the -0 flag to xargs. It’s not necessary here because Python module files cannot contain spaces. 

  2. Purists shall excuse my use of cat here, it’s merely for illustrative purposes. 

  3. This use of braces in find has of course nothing to do with the other possible occurrences of {} there, like in the -exec flag. Since you cannot force find to expect a different placeholder, you should use something else for xargs in those cases, .e.g: xargs -I^ find ^ -name '__main__.py' -exec 'python {}' \;