Karol Kuczmarski's Blog

The brave “new” world of Python 3

Posted on Mon 15 August 2016 in Code • Tagged with Python, Python 3, Unicode, lazy evaluation, iterables • Leave a comment

I’ll blurt it straight up: I’m not a big fan of Python 3.

For a long time, I resisted the appeal of various incremental improvements that early 3.x releases offered. And the world agreed with me: a mere two years ago, Python 3 wasn’t even a blip on the PyPI radar.

Lately, however, things seem to be picking up some steam.

As if to compensate for years of “good enough”, Python 3 development team has given in to a steadily accelerating feature creep. Sure, some of it results in bad ideas (or even ideas you’d hope are jokes), but it nevertheless causes an increasingly wide functional gap between the 2.x and 3.x series.

Starting from around Python 3.5, this gap becomes really noticeable, even when partially bridged with many excellent backports. The ecosystem support is also mostly there, at least insofar as “not breaking horribly when a package is used in Python 3”.

And then, of course, there is the 2.7 EoL date looming ever closer.

Given all those portents, even old curmudg… ahem… seasoned developers cannot really ignore Python 3 anymore. For better or for worse, 3.x is how Python will look like in the coming years and decades. Might as well prepare for it.

In this post, I will discuss some important issues one should be aware of before trying to switch from Python 2 to 3. I won’t be talking about all the minute changes and additions, but cover the more significant, broader concepts that mark the divide between the 2.x and 3.x generations.

The two concepts I’ll be mentioning here are Unicode (obviously) and lazy vs. eager computation.

Unicode handling

You have probably heard it before. Python 3 was going to solve your Unicode problems once and for all. You haven’t believed it, of course, like you wouldn’t believe in any other silver bullet.

Still, it may be rather surprising to learn that in Python 3, you’ll actually see much more Unicode-related errors.

And strange as it may sound, it is a good thing.

In any case, either version of Python gets the most important thing about Unicode right. They both distinguish, at the type level, between strings (of Unicode codepoints) and their encodings (sequences of bytes). The type that holds the latter is called bytes in both versions, while strings are stored within the str type in Python 3 and unicode in Python 2.

It is from this crucial distinction — or rather, failing to account for it — where all the dreaded Unicode errors ultimately stem.

But where Python 2 does poorly is in the choice of defaults. You probably know all too well that bytes there is just an alias for str. That str is a fully functional string type, even though it can only contain ASCII characters. Moreover, it is also the default: quoted string literals, for example, will be of this type unless specially marked.

This poor choice of defaults is the primary source of latent Unicode bugs in Python 2 programs.

What Python 3 does here is to help expose those bugs sooner. If you already deal with Unicode correctly in your programs — maybe because you watched this excellent talk by Ned Batchelder — your main benefit will be not having to write that u"" quotes anymore. Otherwise, it’ll force you to consider the issue from the very beginning, rather than letting you write “working” programs that crash the moment they have to process some non-ASCII input.

Laziness by default

The second major change that Python 3 brings is of similar nature. It is also a change of defaults, but the impetus for it is much less evident.

What’s different in Python 3 is that many built-in functions and methods which used to return lists are now giving out bespoke objects that only mostly behave like lists. Included in these are functions like map or filter, as well as common dictionary methods such as keys or values.

This change is usually presented as removal of unnecessary cruft:

itertools.ifilter is now just filter
xrange is now just range
dict.iteritems is now just dict.items

and so on.

In some cases, this is exactly what happens. For example, there is virtually no downside to the new implementation of range, especially considering the way it is used most often.

But not every built-in managed to preserve all the functionality of lists. Indeed, many have downgraded their API guaratees to those of mere generators, i.e. the most simplistic and limited flavor of Python iterables. Working with them is trickier and more error-prone than with lists, which is due to various pitfalls that generators expose us to.

Navigating around those gotchas used to be something that Python code had to opt-in to, by explicitly importing the itertools module and using its functions in place of the built-ins. What you could gain in return was increased performance, and a lesser memory footprint. All those benefits came from making the computations lazy and refraining from storage of the intermediate results.

In Python 3, however, laziness is preordained. Even if we don’t need or care about the aforementioned perks, we have to devise some way of dealing with the pervasive generators.

One option is to embrace lazy evaluation fully, and adapt to handling unspecified iterables throughout our code bases.
The risk is an increased frequency of bugs stemming from generator misuse — including a common mistake of trying to iterate over lazy foos the second time, deeper down a long function, after it’s been already exhausted.

The alternative is to engage in a lot of “defensive listing”: wrapping of unknown (or known-but-lazy) iterables in list() calls in order to “sanitize” them for later (re)use.
Examples include immediate listification of a generator object:

primes = list(filter(is_prime, range(1000)))

or preemptive conversion of an incoming iterable argument:

def do_something(foos):
    foos = list(foos)
    # ...the rest of a long function...

Even if you choose the first path, and somehow use lazy generators everywhere, conversions are still required at the serialization boundaries:

d = {'foo': 42}
json.dumps({'keys': d.keys()})  # TypeError: dict_keys(['foo']) is not JSON serializable
json.dumps({'keys': list(d.keys())})  # works

At least in this case, the lazy iterable will vocally fail with an exception, rather than silently doing nothing (in case of repeated iteration) or always posing as truthy even when it’s empty (in if iterable: checks).

from future import doubts

So, here they are: the highlights of Python 3. If you are disappointed they all turned out to be mixed blessings, don’t worry: you are in a good company.

The truth is that Python 3 is more finnicky, less forgiving, and much less beginner-friendly than its predecessor. Its various superficial simplifications are almost squarely balanced by many new concerns that are thrust upon an unsuspecting programmer from the very beginning.

In one possible view, this is simply a sign that the language has matured. Perhaps it’s not a coincidence that almost exactly 18 years has passed between the first public version of Python (0.9) and the release of Python 3.0. By no conceivable means it is a toy language anymore, and it’s adequately equipped to tackle challenges presented by the computing world of today.

But on the other hand, it’s clear something is being gradually lost in the process.

It’s becoming harder to claim the language favors simplicity over complexity.
It is no longer so easy to pick which way is the obvious way to do it.
It is increasingly often that ugly replaces beautiful and nested replaces flat.

Little by little, Python itself is becoming less and less pythonic. The pace isn’t breakneck, but it’s definitely noticeable. But who knows? Maybe after two decades, a wholesale redefinition of the language’s core principles really is in order.

…Well, certainly that’s necessary if some of the latest ideas are about to get in!

Unicode handling

Laziness by default

from __future__ import doubts

from future import doubts