Karol Kuczmarski's Blog

Taking string arguments in Rust

Posted on Tue 24 December 2019 in Code • Tagged with Rust, strings, arguments, borrowing, ownership • Leave a comment

Strings of text seem to always be a complicated topic when it comes to programming. This counts double for low-level languages which expose the programmer to the full complexity of memory management and allocation.

Rust is, obviously, one of those languages. Strings in Rust are therefore represented using two distinct types: str (the string slice) and String (the owned/allocated string). Learning how to juggle those types is something you need to do very early if you want to be productive in the language.

But even after you’ve programmed in Rust for some time, you may still trip on some more subtle issues with string handling. In this post, I will concentrate on just one common task: writing a function that takes a string argument. We’ll see that even there, we can encounter a fair number of gotchas.

Just reading it

Let’s start with a simple case: a function which merely reads its string argument:

fn hello(name: &str) {
    println!("Hello, {}!", name);
}

As you’re probably well aware, using str rather than String is the idiomatic approach here. Because a &str reference is essentially an address + length, it can point to any string wheresoever: a 'static literal, a heap-allocated String, or any portion or substring thereof:

hello("world");
hello(&String::from("Alice"));
hello(&"Dennis Ritchie"[0..6]);

Contrast this with an argument of type &String:

fn hello(name: &String) {
    println!("Hello, {}!", name);
}

which mandates an actual, full-blown String object:

hello(&String::from("Bob"));
// (the other examples won't work)

There are virtually no circumstances when you would want to do this, as it potentially forces the caller to needlessly put the string on the heap. Even if you anticipate all function calls to involve actual String objects, the automatic Deref coercion from &String to &str should still allow you to use the more universal, str-based API.

Hiding the reference

If rustc can successfully turn a &String into &str, then perhaps it should also be possible to simply use String when that’s more convenient?

In general, this kind of “reverse Deref” doesn’t happen in Rust outside of method calls with &self. It seems, however, that it would sometimes be desirable; one reasonable use case involves chains of iterator adapters, most importantly map and for_each:

let strings: Vec<String> = vec!["Alice".into(), "Bob".into()];
strings.into_iter().for_each(hello);

Since the compiler doesn’t take advantage ofDeref coercions when inferring closure types, their argument types have to match exactly. As a result, we often need explicit |x| foo(x) closures which suffer from poorer readability in long Iterator or Stream-based expressions.

We can make the above code work — and also retain the ability to make calls like hello("Charlie"); — by using one of the built-in traits that generalize over the borrowing relationships. The one that works best for accepting string arguments is called AsRef¹:

fn hello<N: AsRef<str>>(name: N) {
    println!("Hello, {}!", name.as_ref());
}

Its sole method, AsRef::as_ref, returns a reference to the trait’s type parameter. In the case above, that reference will obviously be of type &str, which circles back to our initial example, one with a direct &str argument.

The difference is, however, that AsRef<str> is implemented for all interesting string types — both in their owned and borrowed versions. This obviates the need for Deref coercions and makes the API more convenient.

Own it

Things get a little more complicated when the string parameter is needed for more than just reading. For storage and potential mutation, a &str reference is not enough: you need an actual, full-blown String object.

Now, you may think this is not a huge obstacle. After all, it’s pretty easy to “turn” &str into a String:

struct Greetings {
    Vec<String> names,
}

impl Greetings {
    // Don't do this!
    pub fn hello(&mut self, name: &str) {
        self.names.push(name.clone());
    }
}

But I strongly advise against this practice, at least in public APIs. If you expose such function to your users, you are essentially tricking them into thinking their input will only ever be read, not copied, which has implications on both performance and memory usage.

Instead, if you need to take ownership of the resulting String, it is much better to indicate this in the function signature directly:

pub fn hello(&mut self, name: String) {
    self.names.push(name);
}

This shifts the burden on creating the String onto the caller, but that’s not necessarily a bad thing. On their side, the added boilerplate can pretty minimal:

let mut greetings = Greetings::new();
grettings.hello(String::from("Dylan"));  // uhm...
greetings.hello("Eva".to_string());      // somewhat better...
grettings.hello("Frank".to_owned());     // not too bad
greetings.hello("Gene".into());          // good enough

while clearly indicating where does the memory allocation happen.

It is also idiomatically used for functions taking Path parameters, i.e. AsRef<Path>. ↩

Arguments to Python generator functions

Posted on Tue 14 March 2017 in Code • Tagged with Python, generators, functions, arguments, closures • Leave a comment

In Python, a generator function is one that contains a yield statement inside the function body. Although this language construct has many fascinating use cases (PDF), the most common one is creating concise and readable iterators.

A typical case

Consider, for example, this simple function:

def multiples(of):
    """Yields all multiples of given integer."""
    x = of
    while True:
        yield x
        x += of

which creates an (infinite) iterator over all multiples of given integer. A sample of its output looks like this:

>>> from itertools import islice
>>> list(islice(multiples(of=5), 10))
[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

If you were to replicate in a language such as Java or Rust — neither of which supports an equivalent of yield — you’d end up writing an iterator class. Python also has them, of course:

class Multiples(object):
    """Yields all multiples of given integer."""

    def __init__(self, of):
        self.of = of
        self.current = 0

    def __iter__(self):
        return self

    def next(self):
        self.current += self.of
        return self.current

    __next__ = next  # Python 3

but they are usually not the first choice¹.

It’s also pretty easy to see why: they require explicit bookkeeping of any auxiliary state between iterations. Perhaps it’s not too much to ask for a trivial walk over integers, but it can get quite tricky if we were to iterate over recursive data structures, like trees or graphs. In yield-based generators, this isn’t a problem, because the state is stored within local variables on the coroutine stack.

Lazy!

It’s important to remember, however, that generator functions behave differently than regular functions do, even if the surface appearance often says otherwise.

The difference I wanted to explore in this post becomes apparent when we add some argument checking to the initial example:

def multiples(of):
    """Yields all multiples of given integer."""
    if of < 0:
        raise ValueError("expected a natural number, got %r" % (of,))

    x = of
    while True:
        yield x
        x += of

With that if in place, passing a negative number shall result in an exception. Yet when we attempt to do just that, it will seem as if nothing is happening:

>>> m = multiples(-10)
>>>

And to a certain degree, this is pretty much correct. Simply calling a generator function does comparatively little, and doesn’t actually execute any of its code! Instead, we get back a generator object:

>>> m
<generator object multiples at 0x10f0ceb40>

which is essentially a built-in analogue to the Multiples iterator instance. Commonly, it is said that both generator functions and iterator classes are lazy: they only do work when we asked (i.e. iterated over).

Getting eager

Oftentimes, this is perfectly okay. The laziness of generators is in fact one of their great strengths, which is particularly evident in the immense usefulness of theitertools module.

On the other hand, however, delaying argument checks and similar operations until later may hamper debugging. The classic engineering principle of failing fast applies here very fittingly: any errors should be signaled immediately. In Python, this means raising exceptions as soon as problems are detected.

Fortunately, it is possible to reconcile the benefits of laziness with (more) defensive programming. We can make the generator functions only a little more eager, just enough to verify the correctness of their arguments.

The trick is simple. We shall extract an inner generator function and only call it after we have checked the arguments:

def multiples(of):
    """Yields all multiples of given integer."""
    if of < 0:
        raise ValueError("expected a natural number, got %r" % (of,))

    def multiples():
        x = of
        while True:
            yield x
            x += of

    return multiples()

From the caller’s point of view, nothing has changed in the typical case:

>>> multiples(10)
<generator object multiples at 0x110579190>

but if we try to make an incorrect invocation now, the problem is detected immediately:

>>> multiples(-5)

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    multiples(of=-5)
  File "<pyshell#0>", line 4, in multiples
    raise ValueError("expected a natural number, got %r" % (of,))
ValueError: expected a natural number, got -5

Pretty neat, especially for something that requires only two lines of code!

The last (micro)optimization

Indeed, we didn’t even have to pass the arguments to the inner (generator) function, because they are already captured by the closure.

Unfortunately, this also has a slight performance cost. A captured variable (also known as a cell variable) is stored on the function object itself, so Python has to emit a different bytecode instruction (LOAD_DEREF) that involves an extra pointer dereference. Normally, this is not a problem, but in a tight generator loop it can make a difference.

We can eliminate this extra work² by passing the parameters explicitly:

    # (snip)

    def multiples(of):
        x = of
        while True:
            yield x
            x += of

    return multiples(of)

This turns them into local variables of the inner function, replacing the LOAD_DEREF instructions with (aptly named) LOAD_FAST ones.

Technically, the Multiples class is here is both an iterator (because it has the next/__next__ methods) and iterable (because it has __iter__ method that returns an iterator, which happens to be the same object). This is common feature of iterators that are not associated with any collection, like the ones defined in the built-in itertools module. ↩
Note that if you engage in this kind of microoptimizations, I’d assume you have already changed your global lookup into local ones :) ↩

Optional arguments in Rust 1.12

Posted on Thu 29 September 2016 in Code • Tagged with Rust, arguments, parameters, functions • Leave a comment

Today’s announcement of Rust 1.12 contains, among other things, this innocous little tidbit:

Option implements From for its contained type

If you’re not very familiar with it, From is a basic converstion trait which any Rust type can implement. By doing so, it defines how to create its values from some other type — hence its name.

Perhaps the most widespread application of this trait (and its from method) is allocating owned String objects from literal str values:

let hello = String::from("Hello, world!");

What the change above means is that we can do similar thing with the Option type:

let maybe_int = Option::from(42);

At a first glance, this doesn’t look like a big deal at all. For one, this syntax is much more wordy than the traditional Some(42), so it’s not very clear what benefits it offers.

But this first impression is rather deceptive. In many cases, this change can actually reduce the number of times we have to type Some(x), allowing us to replace it with just x. That’s because this new impl brings Rust quite a bit closer to having optional function arguments as a first class feature in the language.

Until now, a function defined like this:

fn maybe_plus_5(x: Option<i32>) -> i32 {
    x.unwrap_or(0) + 5
}

was the closest Rust had to default argument values. While this works perfectly — and is bolstered by compile-time checks! — callers are unfortunately required to build the Option objects manually:

let _ = maybe_plus_5(Some(42));  // OK
let _ = maybe_plus_5(None);      // OK
let _ = maybe_plus_5(42);        // error!

After Option<T> implements From<T>, however, this can change for the better. Much better, in fact, for the last line above can be made valid. All that is necessary is to take advantage of this new impl in the function definition:

fn maybe_plus_5<T>(x: T) -> i32 where Option<i32>: From<T> {
    Option::from(x).unwrap_or(0) + 5
}

Unfortunately, this results in quite a bit of complexity, up to and including the where clause: a telltale sign of convoluted, generic code. Still, this trade-off may be well worth it, as a function defined once can be called many times throughout the code base, and possibly across multiple crates if it’s a part of the public API.

But we can do better than this. Indeed, using the From trait to constrain argument types is just complicating things for no good reason. What we should so instead is use the symmetrical trait, Into, and take advantage of its standard impl:

impl<T, U> Into<U> for T where U: From<T>

Once we translate it to the Option case (now that Option<T> implements From<T>), we can switch the trait bounds around and get rid of the where clause completely:

fn maybe_plus_5<T: Into<Option<i32>>>(x: T) -> i32 {
    x.into().unwrap_or(0) + 5
}

As a small bonus, the function body has also gotten a little simpler.

So, should you go wild and change all your functions taking Optionals to look like this?… Well, technically you can, although the benefits may not outweigh the downsides for small, private functions that are called infrequently.

On the other hand, if you can afford to only support Rust 1.12 and up, this technique can make it much more pleasant to use the external API of your crates.

What’s best is the full backward compatibility with any callers that still pass Some(x): for them, the old syntax will continue to work exactly like before. Also note that the Rust compiler is smart about eliding the no-op conversion calls like the Into::into above, so you shouldn’t observe any changes in the performance department either.

And who knows, maybe at some point Rust makes the final leap, and allows skipping the Nones?…