Karol Kuczmarski's Blog

Arguments to Python generator functions

Posted on Tue 14 March 2017 in Code • Tagged with Python, generators, functions, arguments, closures • Leave a comment

In Python, a generator function is one that contains a yield statement inside the function body. Although this language construct has many fascinating use cases (PDF), the most common one is creating concise and readable iterators.

A typical case

Consider, for example, this simple function:

def multiples(of):
    """Yields all multiples of given integer."""
    x = of
    while True:
        yield x
        x += of

which creates an (infinite) iterator over all multiples of given integer. A sample of its output looks like this:

>>> from itertools import islice
>>> list(islice(multiples(of=5), 10))
[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

If you were to replicate in a language such as Java or Rust — neither of which supports an equivalent of yield — you’d end up writing an iterator class. Python also has them, of course:

class Multiples(object):
    """Yields all multiples of given integer."""

    def __init__(self, of):
        self.of = of
        self.current = 0

    def __iter__(self):
        return self

    def next(self):
        self.current += self.of
        return self.current

    __next__ = next  # Python 3

but they are usually not the first choice¹.

It’s also pretty easy to see why: they require explicit bookkeeping of any auxiliary state between iterations. Perhaps it’s not too much to ask for a trivial walk over integers, but it can get quite tricky if we were to iterate over recursive data structures, like trees or graphs. In yield-based generators, this isn’t a problem, because the state is stored within local variables on the coroutine stack.

Lazy!

It’s important to remember, however, that generator functions behave differently than regular functions do, even if the surface appearance often says otherwise.

The difference I wanted to explore in this post becomes apparent when we add some argument checking to the initial example:

def multiples(of):
    """Yields all multiples of given integer."""
    if of < 0:
        raise ValueError("expected a natural number, got %r" % (of,))

    x = of
    while True:
        yield x
        x += of

With that if in place, passing a negative number shall result in an exception. Yet when we attempt to do just that, it will seem as if nothing is happening:

>>> m = multiples(-10)
>>>

And to a certain degree, this is pretty much correct. Simply calling a generator function does comparatively little, and doesn’t actually execute any of its code! Instead, we get back a generator object:

>>> m
<generator object multiples at 0x10f0ceb40>

which is essentially a built-in analogue to the Multiples iterator instance. Commonly, it is said that both generator functions and iterator classes are lazy: they only do work when we asked (i.e. iterated over).

Getting eager

Oftentimes, this is perfectly okay. The laziness of generators is in fact one of their great strengths, which is particularly evident in the immense usefulness of theitertools module.

On the other hand, however, delaying argument checks and similar operations until later may hamper debugging. The classic engineering principle of failing fast applies here very fittingly: any errors should be signaled immediately. In Python, this means raising exceptions as soon as problems are detected.

Fortunately, it is possible to reconcile the benefits of laziness with (more) defensive programming. We can make the generator functions only a little more eager, just enough to verify the correctness of their arguments.

The trick is simple. We shall extract an inner generator function and only call it after we have checked the arguments:

def multiples(of):
    """Yields all multiples of given integer."""
    if of < 0:
        raise ValueError("expected a natural number, got %r" % (of,))

    def multiples():
        x = of
        while True:
            yield x
            x += of

    return multiples()

From the caller’s point of view, nothing has changed in the typical case:

>>> multiples(10)
<generator object multiples at 0x110579190>

but if we try to make an incorrect invocation now, the problem is detected immediately:

>>> multiples(-5)

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    multiples(of=-5)
  File "<pyshell#0>", line 4, in multiples
    raise ValueError("expected a natural number, got %r" % (of,))
ValueError: expected a natural number, got -5

Pretty neat, especially for something that requires only two lines of code!

The last (micro)optimization

Indeed, we didn’t even have to pass the arguments to the inner (generator) function, because they are already captured by the closure.

Unfortunately, this also has a slight performance cost. A captured variable (also known as a cell variable) is stored on the function object itself, so Python has to emit a different bytecode instruction (LOAD_DEREF) that involves an extra pointer dereference. Normally, this is not a problem, but in a tight generator loop it can make a difference.

We can eliminate this extra work² by passing the parameters explicitly:

    # (snip)

    def multiples(of):
        x = of
        while True:
            yield x
            x += of

    return multiples(of)

This turns them into local variables of the inner function, replacing the LOAD_DEREF instructions with (aptly named) LOAD_FAST ones.

Technically, the Multiples class is here is both an iterator (because it has the next/__next__ methods) and iterable (because it has __iter__ method that returns an iterator, which happens to be the same object). This is common feature of iterators that are not associated with any collection, like the ones defined in the built-in itertools module. ↩
Note that if you engage in this kind of microoptimizations, I’d assume you have already changed your global lookup into local ones :) ↩

Simulating exceptions in Rust with IIFE

Posted on Sat 17 December 2016 in Code • Tagged with Rust, IIFE, error handling, exceptions, closures, lambdas • Leave a comment

While many languages use exceptions for handling errors, Rust prefers a slightly different, yet very classical approach: return values.

Now, they aren’t exactly the same thing as in C, where the error is indicated by a special value within the same return type. In Rust, the Result enum can neatly separate the two, in similar vein to how ad-hoc tuples in Go do¹. But unlike Go, Rust also offers additional facilities for error propagation, including the try! macro and the recently stabilized ? operator. And finally, the Result wrappings can be straightforwardly unpacked, possibly by defaulting to a known safe value.

Some conveniences of exceptions may be hard to pass up, though. The try-catch construct is evidently one of them, and Rust might eventually get it in one form or another. Before that happens, however, there is a trick that can often work as an acceptable substitute.

Many `let`s

Here’s an example where it can be very useful.

Have a look at the following function. Its purpose is to retrieve a GitHub login of a user who owns a specific gist — a small sample of code posted to the gists.github.com website².

Let’s assume we have already talked to GitHub API and received the following JSON response from its relevant endpoint:

{
    "id": "12345678",
    "owner": {
        "login": "Octocat",
        ...
    }
    ...
}

Parsing it is easy: we can do it with the rustc_serialize crate, among other options. What proves a little more involved is to dig through the JSON tree in order to reach the interesting value:

use rustc_serialize::json::Json;


/// Retrieve the gist owner from a JSON received from
/// the /gists/$ID endpoint of the GitHub API.
///
/// If the gist is anonymous, "anonymous" is returned.
fn gist_owner_from_info(info: &Json) -> String {
    if let Some(info) = info.as_object() {
        if let Some(owner) = info.get("owner").and_then(|o| o.as_object()) {
            if let Some(result) = owner.get("login").and_then(|l| l.as_string()) {
                return result.to_owned();
            }
        }
    }
    "anonymous".into()
}

Whew! I guess we’re lucky we don’t need to go too deep into that JSON. The code is clearly exhibiting a rightward slant, which some people refer to as the “arrow code”, Unsurprisingly, it is generally considered bad for readability.

There are few other ways of writing this, of course, including a style reminiscent of JavaScript promises — that is, relying completely on the and_then method. Neither seem very satisfying, though, especially if you compare it with something like this:

try:
    return str(info["owner"]["login"])
except (KeyError, TypeError):
    return "anonymous"

Yes, exceptions are quite useful sometimes.

So, how can we get something like this in Rust?

JavaScript for the rescue

Succor comes from an unexpected direction. To emulate exceptions — specifically, the try-catch exception blocks — we can utilize a technique that is most popular in… JavaScript.

At least until recently, JavaScript did not have a block local scope. Since every variable declaration within a function is hoisted to the top of that function, it essentially makes function scope the only usable one (besides global, of course).

As a result, a variety of JavaScript idioms rely on introducing “superfluous” functions, solely for the purpose of creating a nested scope. Many times, those functions are neither named nor stored in any variable; rather, they are immediately invoked.

This is what is commonly understood as Immediately Invoked Function Expression, or IIFE for short.

An oft-cited example involves an IIFE which itself returns another function:

for (var i = 0; i < 10; ++i) {
    var $para = $("p#" + i);  // <p id="0">, <p id="1">, etc.
    var clickHandler = (function(i) {  // IIFE!
        return function() {
            alert("Clicked element no. " + (i + 1));
        };
    })(i);
    $para.on('click', clickHandler);
}

The function expression is necessary here, because it allows to control what exactly goes into the closure of the inner function. If the clickHandlers were assigned the function() { alert(...) } expression directly, they would all close over the same loop counter variable. All would then display the exact same message.

We don’t need to employ those workarounds in Rust. Thanks to local scoping, a simple pair of { braces } would work exactly the same. You can imagine a direct rewrite of the above example, though, where an anonymous closure is used to similar effect:

// WARNING: Not idiomatic! (Also not a real DOM library).

for i in (0..10) {
    let para = dom.find_element_by_id("p", i.to_string()).unwrap();
    let click_handler = |i| {
        move |_: Event| { dom.exec_js(&format!(
            "alert('Clicked element no. #{}');", i + 1)); }
    }(i);
    para.add_event_listener(Event::Click, click_handler)
}

In other words, Rust supports IIFEs just fine.

Just put a function on it

Okay, this is quite amusing and probably pretty neat. But does it help us with the error handling story exactly?…

Let’s take another stab at rewriting the gist_owner_from_info routine. This time, we’ll extract the meaty part into a separate function. We will also take advantage of one trivial, but very useful try_opt crate which is essentially an equivalent of the try! macro for Options:

#[macro_use] extern crate try_opt;

fn gist_owner_from_info(info: &Json) -> String {
    gist_owner_from_info_internal(info).unwrap_or("anonymous".into())
}

fn gist_owner_from_info_internal(info: &Json) -> Option<String> {
    let info = try_opt!(info.as_object());
    let owner = try_opt!(info.get("owner").and_then(|o| o.as_object()));
    let login = try_opt!(owner.get("login").and_then(|l| l.as_string()));
    Some(login.to_owned())
}

Now this should be a little easier on the eyes. (And if you want, you can eschew and_then completely in favor of more try_opt!).

The downside is that we now have this _internal function that’s awkwardly sticking out. We could pull it in, and turn it into an inner function, but why stop half-way? Let’s just make it an IIFE already:

fn gist_owner_from_info(info: &Json) -> String {
    || -> Option<String> {
        let info = try_opt!(info.as_object());
        let owner = try_opt!(info.get("owner").and_then(|o| o.as_object()));
        let login = try_opt!(owner.get("login").and_then(|l| l.as_string()));
        Some(login.to_owned())
    }().unwrap_or("anonymous".into())
}

Not bad, eh? The analogies with exception handling should be pretty evident, too³:

The closure itself works as a try block, with closure’s body containing the “guarded” code.
The unwrap family of methods (especially unwrap_or_else) dubs for a catch/except section.

Sure, we do need try! (or try_opt!) macros to mark instructions that may “throw an exception”, but with the ?-based syntax it shouldn’t be too big of a deal. And when the time comes, this code will be very easy to port to a trait-based exception handling solution that’s currently in the works.

Oh, and the best part? Both Rust and the underlying LLVM are very adept at inlining closures, so everything here should compile to optimal code.

Bonus: a lifetime conundrum

Well, almost optimal. There is one more thing left to do before we can call this a truly zero-cost abstraction.

We need to stop allocating so damn much!

It should be pretty obvious that the function doesn’t need to create a brand new String every time it’s called. The text is in the input Json, and we take that Json by reference already. It’s only fair we stop creating Strings and simply return a &str reference instead.

In fact, this should be as easy as removing the to_owned/into calls, right?

fn gist_owner_from_info(info: &Json) -> &str {
    || -> Option<&str> {
        let info = try_opt!(info.as_object());
        let owner = try_opt!(info.get("owner").and_then(|o| o.as_object()));
        owner.get("login").and_then(|l| l.as_string()))
    }().unwrap_or("anonymous")
}

Wrong, apparently. If you present this code to the compiler, it will serve you quite a mouthful of an error, including helpful tidbits in the vein of “expected A, found A”:

error[E0495]: cannot infer an appropriate lifetime for autoref due to conflicting requirements
   --> src/github.rs:3:34
    |
  3 |         let info = try_opt!(info.as_object());
    |                                  ^^^^^^^^^
    |
note: first, the lifetime cannot outlive the anonymous lifetime #1 defined on the block at 1:45...
   --> src/github.rs:1:46
    |
  1 | fn gist_owner_from_info(info: &Json) -> &str {
    |                                              ^
note: ...so that reference does not outlive borrowed content
   --> src/github.rs:3:29
    |
  3 |         let info = try_opt!(info.as_object());
    |                             ^^^^
note: but, the lifetime must be valid for the anonymous lifetime #1 defined on the block at 2:23...
   --> src/github.rs:2:24
    |
  2 |     || -> Option<&str> {
    |                        ^
note: ...so that expression is assignable (expected std::option::Option<&str>, found std::option::Option<&str>)
   --> src/github.rs:5:9
    |
  5 |         owner.get("login").and_then(|l| l.as_string())
    |

The crux of this verbiage is that the Rust compiler is unable to reconcile the lifetime of the closure’s return value, the input, and final result of the function.

It shouldn’t really be trying very hard, though, for the lifetime is obvious. It’s the same as the one implicitly attached to the input &Json. Seems like in this case, we need to be a little more helpful and label it explicitly:

fn gist_owner_from_info<'i>(info: &'i Json) -> &'i str {
    || -> Option<&'i str> {
// (rest as before)

Voila, this should now compile without any issues.

Once again, “Keep calm and add more 'lifetimes” proves to be an effective approach ;)

Technically, they aren’t called tuples there but “multiple return values“. ↩
This is something I needed to do when rewriting this Python project of mine to Rust. ↩
This is also the closest Rust can currently get to a do notation from Haskell, at least without any macro-based hacks. ↩