…or lambda?

Posted on Mon 20 June 2016 in Code

a.k.a. Curious Facts about Python Syntax

In Python 3.3, a new method has been added to the str type: casefold. Its purpose is to return a “sanitized” version of the string that’s suitable for case-insensitive comparison. For older versions of Python, an alternative way that’s mostly compatible is to use the str.lower method, which simply changes all letters in the string to lowercase.

Syntax is hard

Easy enough for a compatibility shim, right? That’s exactly what I thought when I came up with this:

casefold = getattr(str, 'casefold', None) or lambda s: s.lower()

Let’s ignore for a moment the fact that for a correct handling of unicode objects in Python 2, a much more sophisticated approach is necessary. What’s rather more pertinent is that this simple code doesn’t parse:

  File "foo.py", line 42
    getattr(str, 'casefold', None) or lambda s: s.lower()
                                      ^
SyntaxError: invalid syntax

It’s not very often that you would produce a SyntaxError with code that looks perfectly valid to most pythonistas. The last time I had it happen, the explanation was rather surprising and not exactly trivial to come by.

Fortunately, there is always one place where we can definitively resolve any syntactic confusion. That place is the full grammar specification of the Python language.

It may be a little intimidating at first, especially if you’re not familiar with the ENBF notation it uses. All the Python’s language constructs are there, though, so the SyntaxError from above should be traceable to a some rule of the grammar¹.

The culprit

And indeed, the offending bit is right here:

or_test: and_test ('or' and_test)*
and_test: not_test ('and' not_test)*
...

It says, essentially, that Python defines the or expression (or_test) as a sequence of and expressions (and_test). If you follow the syntax definition further, however, you will notice that and_test expands to comparisons (a < b, etc.), arithmetic expressions (x + y, etc.), list & dict constructors ([foo, bar], etc.), and finally to atoms such as literal strings and numbers.

What you won’t see along the way are lambda definitions:

lambdef: 'lambda' [varargslist] ':' test

In fact, the branch to allow them is directly above the or_test:

test: or_test ['if' or_test 'else' test] | lambdef

As you can see, the rule puts lambdas at the same syntactical level as conditional expressions (x if a else b), which is very high up. The only thing you can do with a lambda to make a larger expression is to add a yield keyword before it², or follow it with a comma to create a tuple³.

You cannot, however, pass it as an argument to a binary operator, even if it otherwise makes sense and even looks unambiguous. This is also why the nonsensical expressions such as this one:

1 + lambda: None

will fail not with TypeError, but also with SyntaxError, as they won’t even be evaluated.

More parentheses

Savvy readers may have noticed that this phenomenon is very much reminiscent of the issue of operator precedence.

Indeed, in Python and in many other languages it is the grammar that ultimately specifies the order of operations. It does so simply by defining how expressions can be constructed.

Addition, for example, will be of lower priority than multiplication simply because a sum is said to comprise of terms that are products:

arith_expr: term (('+'|'-') term)*
term: factor (('*'|'/'|'%'|'//') factor)*

This makes operator precedence a syntactic feature, and its resolution is baked into the language parser and handled implicitly⁴.

We know, however, that precedence can be worked around where necessary by enclosing the operator and its arguments in a pair of parenthesis. On the syntax level, this means creating an entirely new, top-level expression:

atom: '(' [yield_expr|testlist_comp] ')' |  # parenthesized expression
       '[' [listmaker] ']' |
       '{' [dictorsetmaker] '}' |
       '`' testlist1 '`' |
       NAME | NUMBER | STRING+)

There, it is again possible to use even the highest-level constructs, including also the silly stuff such as trying to add a number to a function:

1 + (lambda: None)

This expression will now parse correctly, and produce TypeError as expected.

In the end, the resolution of our initial dilemma is therefore rather simple:

casefold = getattr(str, 'casefold', None) or (lambda s: s.lower())

Such rules are sometimes called productions of the grammar, a term from computational linguistics. ↩
Yes, yield foo is an expression. Its result is the value sent to the generator by outer code via the send method. Since most generators are used as iterables, typically no values are passed this way so the result of a yield expression is None. ↩
There are also a legacy corner cases of lambdas in list/dict/etc. comprehensions, but those only apply under Python 2.x. ↩
This saying, there are languages where the order is resolved at later stage, after the expressions have already been parsed. They usually allow the programmer to change the precedence of their own operators, as it’s the case in Haskell. ↩