Karol Kuczmarski's Blog

Taking string arguments in Rust

Posted on Tue 24 December 2019 in Code • Tagged with Rust, strings, arguments, borrowing, ownership • Leave a comment

Strings of text seem to always be a complicated topic when it comes to programming. This counts double for low-level languages which expose the programmer to the full complexity of memory management and allocation.

Rust is, obviously, one of those languages. Strings in Rust are therefore represented using two distinct types: str (the string slice) and String (the owned/allocated string). Learning how to juggle those types is something you need to do very early if you want to be productive in the language.

But even after you’ve programmed in Rust for some time, you may still trip on some more subtle issues with string handling. In this post, I will concentrate on just one common task: writing a function that takes a string argument. We’ll see that even there, we can encounter a fair number of gotchas.

Just reading it

Let’s start with a simple case: a function which merely reads its string argument:

fn hello(name: &str) {
    println!("Hello, {}!", name);
}

As you’re probably well aware, using str rather than String is the idiomatic approach here. Because a &str reference is essentially an address + length, it can point to any string wheresoever: a 'static literal, a heap-allocated String, or any portion or substring thereof:

hello("world");
hello(&String::from("Alice"));
hello(&"Dennis Ritchie"[0..6]);

Contrast this with an argument of type &String:

fn hello(name: &String) {
    println!("Hello, {}!", name);
}

which mandates an actual, full-blown String object:

hello(&String::from("Bob"));
// (the other examples won't work)

There are virtually no circumstances when you would want to do this, as it potentially forces the caller to needlessly put the string on the heap. Even if you anticipate all function calls to involve actual String objects, the automatic Deref coercion from &String to &str should still allow you to use the more universal, str-based API.

Hiding the reference

If rustc can successfully turn a &String into &str, then perhaps it should also be possible to simply use String when that’s more convenient?

In general, this kind of “reverse Deref” doesn’t happen in Rust outside of method calls with &self. It seems, however, that it would sometimes be desirable; one reasonable use case involves chains of iterator adapters, most importantly map and for_each:

let strings: Vec<String> = vec!["Alice".into(), "Bob".into()];
strings.into_iter().for_each(hello);

Since the compiler doesn’t take advantage ofDeref coercions when inferring closure types, their argument types have to match exactly. As a result, we often need explicit |x| foo(x) closures which suffer from poorer readability in long Iterator or Stream-based expressions.

We can make the above code work — and also retain the ability to make calls like hello("Charlie"); — by using one of the built-in traits that generalize over the borrowing relationships. The one that works best for accepting string arguments is called AsRef¹:

fn hello<N: AsRef<str>>(name: N) {
    println!("Hello, {}!", name.as_ref());
}

Its sole method, AsRef::as_ref, returns a reference to the trait’s type parameter. In the case above, that reference will obviously be of type &str, which circles back to our initial example, one with a direct &str argument.

The difference is, however, that AsRef<str> is implemented for all interesting string types — both in their owned and borrowed versions. This obviates the need for Deref coercions and makes the API more convenient.

Own it

Things get a little more complicated when the string parameter is needed for more than just reading. For storage and potential mutation, a &str reference is not enough: you need an actual, full-blown String object.

Now, you may think this is not a huge obstacle. After all, it’s pretty easy to “turn” &str into a String:

struct Greetings {
    Vec<String> names,
}

impl Greetings {
    // Don't do this!
    pub fn hello(&mut self, name: &str) {
        self.names.push(name.clone());
    }
}

But I strongly advise against this practice, at least in public APIs. If you expose such function to your users, you are essentially tricking them into thinking their input will only ever be read, not copied, which has implications on both performance and memory usage.

Instead, if you need to take ownership of the resulting String, it is much better to indicate this in the function signature directly:

pub fn hello(&mut self, name: String) {
    self.names.push(name);
}

This shifts the burden on creating the String onto the caller, but that’s not necessarily a bad thing. On their side, the added boilerplate can pretty minimal:

let mut greetings = Greetings::new();
grettings.hello(String::from("Dylan"));  // uhm...
greetings.hello("Eva".to_string());      // somewhat better...
grettings.hello("Frank".to_owned());     // not too bad
greetings.hello("Gene".into());          // good enough

while clearly indicating where does the memory allocation happen.

It is also idiomatically used for functions taking Path parameters, i.e. AsRef<Path>. ↩

Add examples to your Rust libraries

Posted on Wed 28 February 2018 in Code • Tagged with Rust, Cargo, examples, documentation, packaging • Leave a comment

When you’re writing a library for other programs to depend on, it is paramount to think how the developers are going to use it in their code.

The best way to ensure they have a pleasant experience is to put yourself in their shoes. Forget the internal details of your package, and consider only its outward interface. Then, come up with a realistic use case and just implement it.

In other words, you should create complete, end-to-end, and (somewhat) usable example applications.

Examples are trouble

You may think this is asking a lot, and I wouldn’t really disagree here.

In most languages and programming platforms, it is indeed quite cumbersome to create example apps. This happens for at least several different reasons:

It typically requires bootstrapping an entire project from scratch. If you are lucky, you will have something like create-react-app to get you going relatively quickly. Still, you need to wire up the new project so that it depends on the source code of your library rather than its published version, and this tends to be a non-standard option — if it is available at all.
It’s unclear where should the example code live. Should you just throw it away, once it has served its immediate purpose? I’m sure this would discourage many people from creating examples in the first place. It’s certainly better to keep them in the version control, allowing their code to serve as additional documentation.

But if you intend to do this, you need to be careful not to deploy the example along with your library when you upload it to the package registry for your language. This may require maintaining an explicit blacklist and/or whitelist, in the vein of MANIFEST files in Python.
Examples may break as the library changes. Although example apps aren’t integration tests that have a clear, expected outcome, they should at the very least compile correctly.

The only way to ensure that is to include them in the build/test pipeline of your library. To accomplish this, however, you may need to complicate your CI setup, perhaps by introducing additional languages like Bash or Python.
It’s harder to maintain quality of example code. Any linters and static analyzers that you’re normally running will likely need to be configured to also apply to the examples. On the other hand, however, you probably don’t want those checkers to be too strict (it’s just example code, after all), so you may want to turn off some of the warnings, adjust the level of others, and so on.

So essentially, writing examples involves quite a lot of hassle. It would be great if the default tooling of your language helped to lessen the burden at least a little bit.

Well, good news! If you’re a Rust programmer, the language has basically got you covered.

Cargo — the standard build tool and package manager for Rust — has some dedicated features to support examples as a first-class concept. While it doesn’t completely address all the pain points outlined above, it goes a long way towards minimizing them.

What are Cargo examples?

In Cargo’s parlance, an example is nothing else but a Rust source code of a standalone executable¹ that typically resides in a single .rs file. All such files should be places in the examples/ directory, at the same level as src/ and the Cargo.toml manifest itself².

Here’s the simplest example of, ahem, an example:

// examples/hello.rs
fn main() {
    println!("Hello from an example!");
}

You can run it through the typical cargo run command; simply pass the example name after the --example flag:

$ cargo run --example hello
Hello from an example!

It is also possible to run the example with some additional arguments:

$ cargo run --example hello2 -- Alice
Hello, Alice!

which are relayed directly to the underlying binary:

// examples/hello2.rs
use std::env;

fn main() {
    let name = env::args().skip(1).next();
    println!("Hello, {}!", name.unwrap_or("world".into()));
}

As you can see, the way we run examples is very similar to how we’d run the src/bin binaries, which some people use as normal entry points to their Rust programs.

The important thing is that you don’t have to worry what to do with your example code anymore. All you need to do is drop it in the examples/ directory, and let Cargo do the rest.

Dependency included

Of course in reality, your examples will be at least a little more complicated than that. For one, they will surely call into your library to use its API, which means they need to depend on it & import its symbols.

Fortunately, this doesn’t complicate things even one bit.

The library crate itself is already an implied dependency of any code inside the examples/ directory. This is automatically handled by Cargo, so you don’t have to modify Cargo.toml (or do anything else really) to make it happen.

So without any additional effort, you can just to link to your library crate in the usual manner, i.e. by putting extern crate on top of the Rust file:

// examples/real.rs
extern crate mylib;

fn main() {
    let thing = mylib::make_a_thing();
    println!("I made a thing: {:?}", thing);
}

This goes even further, and extends to any dependency of the library itself. All such third-party crates are automatically available to the example code, which proves handy in common cases such as Tokio-based asynchronous APIs:

// example/async.rs
extern crate mylib;
extern crate tokio_core;  // assuming it's in mylib's [dependencies]

fn main() {
    let mut core = tokio_core::reactor::Core::new().unwrap();
    let thing = core.run(mylib::make_a_thing_asynchronously()).unwrap();
    println!("I made a thing: {:?}", thing);
}

More deps

Sometimes, however, it is very useful to pull in an additional package or two, just for the example code.

A typical case may involve logging.

If your library uses the usual log crate to output debug messages, you probably want to see them printed out when you run your examples. Since the log crate is just a facade, it doesn’t offer any built-in way to pipe log messages to standard output. To handle this part, you need something like the env_logger package:

// example/with_logging.rs
extern crate env_logger;
extern crate mylib;

fn main() {
    env_logger::init();
    println("{:?}", mylib::make_a_thing());
}

To be able to import env_logger like this, it natually has to be declared as a dependency in our Cargo.toml.

We won’t put it in the [dependencies] section of the manifest, however, as it’s not needed by the library code. Instead, we should place it in a separate section called [dev-dependencies]:

[dev-dependencies]
env_logger = "0.5"

Packages listed there are shared by tests, benchmarks, and — yes, examples. They are not, however, linked into regular builds of your library, so you don’t have to worry about bloating it with unnecessary code.

Growing bigger

So far, we have seen examples that span just a single Rust file. Practical applications tend to be bigger than that, so it’d be nice if we could provide some multi-file examples as well.

This is easily done, although for some reason it doesn’t seem to be mentioned in the official docs.

In any case, the approach is identical to executables inside src/bin/. Basically, if we have a single foo.rs file with executable code, we can expand it to a foo/ subdirectory with foo/main.rs as the entry point. Then, we can add whatever other submodules we want — just like we would do for a regular Rust binary crate:

// examples/multifile/main.rs
extern crate env_logger;
extern crate mylib;

mod util;

fn main() {
    env_logger::init();
    let ingredient = util::create_ingredient();
    let thing = mylib::make_a_thing_with(ingredient);
    println("{:?}", thing);
}

// examples/multifile/util.rs

pub fn create_ingredient() -> u64 {
    42
}

Of course, it won’t be often that examples this large are necessary. Showing how a library can scale to bigger applications can, however, be very encouraging to potential users.

Maintaining maintainability

Thus far, we have discussed how to create small and larger examples, how to use additional third-party crates in example programs, and how to easily build & run them using built-in Cargo commands.

All this effort spent on writing examples would be of little use if we couldn’t ensure that they work.

Like every type of code, examples are prone to breakage whenever the underlying API changes. If the library is actively developed, its interface represents a moving target. It is quite expected that changes may sometimes cause old examples to stop compiling.

Thankfully, Cargo is very dilligent in reporting such breakages. Whenever you run:

$ cargo test

all examples are built simultaneously with the execution of your regular test suite³. You get the compilation guarantee for your examples essentially for free — there is no need to even edit your .travis.yml, or to adjust your continuous integration setup in any other way!

Pretty neat, right?

This saying, you should keep in mind that simply compiling your examples on a regular basis is not a foolproof guarantee that their code never becomes outdated. Examples are not integration tests, and they won’t catch important changes in your implementation that aren’t breaking the interface.

Examples-Driven Development?

You may be wondering then, what’s exactly the point of writing examples? If you got tests on one hand to verify correctness, and documentation on the other hand to inform your users, then having a bunch of dedicated executable examples may seem superfluous.

To me, however, an impeccable test suite and amazing docs — which also remain comprehensive and awesome for an entire lifetime of the library! — sound a bit too much like a perfect world :) Adding examples to the mix can almost always improve things, and their maintenance burden should, in most cases, be very minimal.

But I have also found out that starting off with examples early on is a great way to validate the interface design.

Once the friction of creating small test programs has been eliminated, they become indispensable for prototyping new features. Wanna try out that new thing you’ve just added? Simple: just make a quick example for it, run it, and see what happens!

In many ways, doing this feels similar to trying out things in a REPL — something that’s almost exclusive to dynamic/interpreted languages. But unlike mucking around in Python shell, examples are not throwaway code: they become part of your project, and remain useful for both you & your users.

It is also possible to create examples which are themselves just libraries. I don’t think this is particularly useful, though, since all you can do with such examples is build them, so they don’t provide any additional value over normal tests (and especially doc tests). ↩
Because they are outside of the src/ directory, examples do not become a part of your library’s code, and are not deployed to crates.io. ↩
You can also run cargo build --examples to only compile the examples, without running any kind of tests. ↩

Unfolding a Stream of paginated items

Posted on Wed 24 January 2018 in Code • Tagged with Rust, Tokio, streams, HTTP • Leave a comment

My most recent Rust crate is an API client for the Path of Exile’s public stash tabs. One problem that I had to solve while writing it was to turn a sequence of paginated items (in this case, player stash tabs) into a single, asynchronous Stream.

In this post, I’ll explain how to use the Stream interface, along with functions from the futures crate, to create a single Stream from multiple batches of entities.

Pagination 101

To divide a long list of items into pages is a very common pattern in many HTTP-based APIs.

If the client requests a sequence of entities that would be too large to serve as a single response, there has to be some way to split it over multiple HTTP roundtrips. To accomplish that, API servers will often return a constant number of items at first (like 50), followed by some form of continuation token:

$ curl http://api.example.com/items
{
    "items": [
        {...},
        {...},
        {...}
    ],
    "continuationToken": "e53c68db0ee412ac239173db147a02a0"
}

Such token is preferably an opaque sequence of bytes, though sometimes it can be an explicit offset (index) into the list of results¹. Regardless of its exact nature, clients need to pass the token with their next request in order to obtain another batch of results:

$ curl 'http://api.example.com/items?after=e53c68db0ee412ac239173db147a02a0'
{
    "items": [
        {...},
        {...}
    ],
    "continuationToken": "4e3986e4c7f591b8cb17cf14addd40a6"
}

Repeat this procedure for as long as the response contains a continuation token, and you will eventually go through the entire sequence. If it’s really, really long (e.g. it’s a Twitter firehose for a popular hashtag), then you may of course hit some problems due to the sheer number of requests. For many datasets, however, this pagination scheme is absolutely sufficient while remaining relatively simple for clients to implement.

Stream it in Rust

What the client code would typically do, however, is to hide the pagination details completely and present only the final, unified sequence of items. Such abstraction is useful even for end-user applications, but it’s definitely expected from any shared library that wraps the third-party API.

Depending on your programming language of choice, this abstraction layer may be very simple to implement. Here’s how it could be done in Python, whose concepts of iterables and generators are a perfect fit for this task²:

import requests

def iter_items(after=None):
    """Yield items from an example API.
    :param after: Optional continuation token
    """
    while True:
        url = "http://api.example.com/items"
        if after is not None:
            url += "?after=%s" % after
        response = requests.get(url)
        response.raise_for_status()
        for item in response.json()['items']:
            yield item
        after = response.json().get("continuationToken")
        if after is None:
            break

# consumer
for item in iter_items():
    print(item)

In Rust, you can find their analogues in the Iterator and Stream traits, so we’re off to a pretty good start. What’s missing, however, is the equivalent of yield: something to tell the consumer “Here, have the next item!”, and then go back to the exact same place in the producer function.

This ability to jump back and forth between two (or more) functions involves having a language support for coroutines. Not many mainstream languages pass this requirement, although Python and C# would readily come to mind. In case of Rust, there have been some nightly proposals and experiments, but nothing seems to be stabilizing anytime soon.

DIY streaming

But of course, if you do want a Stream of paginated items, there is at least one straightforward solution: just implement the Stream trait directly.

This is actually quite a viable approach, very similar to rolling out a custom Iterator. Some minor differences stem mostly from a more complicated state management in Stream::poll compared to Iterator::next. While an iterator is either exhausted or not, a stream can also be waiting for the next item to “arrive” (Ok(Async::NotReady)), or have errored out permanently (Err(e)). As a consequence, the return value of Stream::poll is slightly more complex than just plain Option, but nevertheless quite manageable.

Irrespective of difficulty, writing a custom Stream from scratch would inevitably involve a lot of boilerplate. You may find it necessary in more complicated applications, of course, but for something that’s basically a glorified while loop, it doesn’t seem like a big ask to have a more concise solution.

The stream unfolds

Fortunately there is one! Its crucial element is the standalone stream::unfold function from the futures crate:

pub fn unfold<T, F, Fut, It>(init: T, f: F) -> Unfold<T, F, Fut> where
    F: FnMut(T) -> Option<Fut>,
    Fut: IntoFuture<Item = (It, T)>,

Reading through the signature of this function can be a little intimidating at first. Part of it is Rust’s verbose syntax for anything that involves both trait bounds and closures³, making stream::unfold seem more complicated than it actually is. Indeed, if you’ve ever used Iterator adapters like .filter_map or .fold, the unfold function will be pretty easy to understand. (And if you haven’t, don’t worry! It’s really quite simple :))

If you look closely, you’ll see that stream::unfold takes the following two arguments:

first one is essentially an arbitrary initial value, called a seed
second one is a closure that receives the seed and returns an optional pair of values

What are those values?… Well, the entire purpose of the unfold function is to create a Stream, and a stream should inevitably produce some items. Consequently, the first value in the returned pair will be the next item in the stream.

And what about the second value? That’s just the next state of the seed! It will be received by the very same closure when someone asks the Stream to produce its next item. By passing around a useful value — say, a continuation token — you can create something that’s effectively a while loop from the Python example above.

The last important bits about this pair of values is the wrapping.

First, it is actually a Future, allowing your stream to yield objects that it doesn’t quite have yet — for example, those which ultimately come from an HTTP response.

Secondly, its outermost layer is an Option. This enables you to terminate the stream when the underlying source is exhausted by simply returning None. Until then, however, you should return Some with the (future of) aforementioned pair of values.

Paginate! Paginate!

If you have doubts about how all those pieces of stream::unfold fit in, then looking at the usage example in the docs may give you some idea of what it enables you to do. It’s a very artificial example, though: the resulting Stream isn’t waiting for any asynchronous Futures, which is the very reason you’d use a Stream over an Iterator in the first place⁴.

We can find a more natural application for unfold if we go back to our original problem. To reiterate, we want to repeatedly query an HTTP API for a long list of items, giving our callers a Stream of such items they can process at their leisure. At the same time, all the details about pagination and handling of continuation tokens or offsets should be completely hidden from the caller.

To employ stream::unfold for this task, we need two things: the initial seed, and an appropriate closure.

I have hinted already at using the continuation token as our seed, or the state that we pass around from one closure invocation to another. What remains is mostly making the actual HTTP request and interpreting the JSON response, for which we’ll use the defacto standard Rust crates: hyper, Serde, and serde_json:

use std::error::Error;

use futures::{future, Future, stream, Stream};
use hyper::{Client, Method};
use hyper::client::Request;
use serde_json;
use tokio_core::reactor::Handle;

const URL: &str = "http://api.example.com/items";

fn items(
    handle: &Handle, after: Option<String>
) -> Box<Stream<Item=Item, Error=Box<Error>>>
{
    let client = Client::new(handle);
    Box::new(stream::unfold(after, move |cont_token| {
        let url = match cont_token {
            Some(ct) => format!("{}?after={}", URL, ct),
            None => return None,
        };
        let req = Request::new(Method::Get, url.parse().unwrap());
        Some(client.request(req).from_err().and_then(move |resp| {
            let status = resp.status();
            resp.body().concat2().from_err().and_then(move |body| {
                if status.is_success() {
                    serde_json::from_slice::<ItemsResponse>(&body)
                        .map_err(Box::<Error>::from)
                } else {
                    Err(format!("HTTP status: {}", status).into())
                }
            })
            .map(move |items_resp| {
                (stream::iter_ok(items_resp.items), items_resp.continuation_token)
            })
        }))
    })
    .flatten())
}

#[derive(Deserialize)]
struct ItemsResponse {
    items: Vec<Item>,
    #[serde(rename = "continuationToken")]
    continuation_token: Option<String>,
}

While this code may be a little challenging to decipher at first, it’s not out of line compared to how working with Futures and Streams looks like in general. In either case, you can expect a lot of .and_then callbacks :)

There is one detail here that I haven’t mentioned previously, though. It relates to the stream::iter_ok and Stream::flatten calls which you may have already noticed.
The issue with stream::unfold is that it only allows to yield an item once per closure invocation. For us, this is too limiting: a single batch response from the API will contain many such items, but we have no way of “splitting” them.

What we can do instead is to produce a Stream of entire batches of items, at least at first, and then flatten it. What Stream::flatten does here is to turn a nested Stream<Stream<Item>> into a flat Stream<Item>. The latter is what we eventually want to return, so all we need now is to create this nested stream of streams.

How? Well, that’s actually pretty easy.

We can already deserialize a Vec<Item> from the JSON response — that’s our item batch! — which is essentially an iterable of Items⁵. Another utility function from the stream module, namely stream::iter_ok, can readily turn such iterable into a “immediate” Stream. Such Stream won’t be asynchronous at all — its items will have been ready from the very beginning — but it will still conform to the Stream interface, enabling it to be flattened as we request.

But wait! There is a bug!

So in the end, is this the solution we’re looking for?…

Well, almost. First, here’s the expected usage of the function we just wrote:

let mut core = tokio_core::reactor::Core::new().unwrap();
core.run({
    let continuation_token = None;  // start from the beginning
    items(&core.handle(), continuation_token).for_each(|item| {
        println!("{:?}", item);
        Ok(())
    })
}).unwrap();

While this is more complicated than the plain for loop in Python, most of it is just Tokio boilerplate. The notable part is the invocation of items(), where we pass None as a continuation token to indicate that we want the entire sequence, right from its beginning.

And since we’re talking about fetching long sequences, we would indeed expect a lot of items. So it is probably quite surprising to hear that the stream we’ve created here will be completely empty.

…What? How?!

If you look again at the source code of items(), the direct reason should be pretty easy to find. The culprit lies in the return None branch of the first match. If we don’t pass Some(continuation_token) as a parameter to items(), this branch will be hit immediately, terminating the stream before it had a chance to produce anything.

It may not be very clear how to fix this problem. After all, the purpose of the match was to detect the end of the sequence, but it apparently prevents us from starting it in the first place!

Looking at the problem from another angle, we can see we’ve conflated two distinct states of our stream — “before it has started” and “after it’s ended” — into a single one (“no continuation token”). Since we obviously don’t want to make the after parameter mandatory — users should be able to say “Give me everything!” — we need another way of telling those two states apart.

In terms of Rust types, it seems that Option<String> is no longer sufficient for encoding all possible states of our Stream. Although we could try to fix that in some ad-hoc way (e.g. by adding another bool flag), it feels cleaner to just define a new, dedicated type. For one, this allows us to designate a name for each of the states in question, improving the readability and maintainability of our code:

enum State {
    Start(Option<String>),
    Next(String),
    End,
}

Note that we can put this definition directly inside the items() function, without cluttering the module namespace. All the relevant details of our Stream are thus nicely contained within a single function:

fn items(
    handle: &Handle, after: Option<String>
) -> Box<Stream<Item=Item, Error=Box<Error>>>
{
    // (definition of State enum can go here)

    let client = Client::new(handle);
    Box::new(stream::unfold(State::Start(after), move |state| {
        let cont_token = match state {
            State::Start(opt_ct) => opt_ct,
            State::Next(ct) => Some(ct),
            State::End => return None,
        };
        let url = match cont_token {
            Some(ct) => format!("{}?after={}", URL, ct),
            None => URL.into(),
        };
        let req = Request::new(Method::Get, url.parse().unwrap());
        Some(client.request(req).from_err().and_then(move |resp| {
            let status = resp.status();
            resp.body().concat2().from_err().and_then(move |body| {
                if status.is_success() {
                    serde_json::from_slice::<ItemsResponse>(&body)
                        .map_err(Box::<Error>::from)
                } else {
                    Err(format!("HTTP status: {}", status).into())
                }
            })
            .map(move |items_resp| {
                let next_state = match items_resp.continuation_token {
                    Some(ct) => State::Next(ct),
                    None => State::End,
                };
                (stream::iter_ok(items_resp.items), next_state)
            })
        }))
    })
    .flatten())
}

Sure, there is a little more bookkeeping required now, but at least all the items are being emitted by the Stream as intended.

You can see the complete source in the playground here.

Furthermore, the token doesn’t have to come as part of the HTTP response body. Some API providers (such as GitHub) may use the Link: header to point directly to the next URL to query. ↩
This example uses “traditional”, synchronous Python code. However, it should be easy to convert it to the asynchronous equivalent that works in Python 3.5 and above, provided you can replace requests with some async HTTP library. ↩
If you are curious whether other languages could express it better, you can check the Data.Conduit.List.unfold function from the Haskell’s conduit package. For most intents and purposes, it is their equivalent of stream::unfold. ↩
Coincidentally, you can create iterators in the very same manner through the itertools::unfold function from the itertools crate. ↩
In more technical Rust terms, it means Vec implements the IntoIterator trait, allowing anyone to get an Iterator from it. ↩

Terminating a Stream in Rust

Posted on Sat 16 December 2017 in Code • Tagged with Rust, streams, Tokio, async • Leave a comment

Here’s a little trick that may be useful in dealing with asynchronous Streams in Rust.

When you consume a Stream using the for_each method, its default behavior is to finish early should an error be produced by the stream:

use futures::prelude::*;
use futures::stream;
use tokio_core::reactor::Core;

let s = stream::iter_result(vec![Ok(1), Ok(2), Err(false), Ok(3)]);
let fut = s.for_each(|n| {
    println!("{}", n);
    Ok(())
});

In more precise terms, it means that the Future returned by for_each will resolve with the first error from the underlying stream:

// Prints 1, 2, and then panics with "false".
Core::new().unwrap().run(fut).unwrap();

For most purposes, this is perfectly alright; errors are generally meant to propagate, after all.

Certain kinds of errors, however, are better off silenced. Perhaps they are expected to pop up during normal program operation, or maybe their occurrence should merely affect program execution in a particular way, and not halt it outright. In a simple case like above, you can of course check what for_each itself has returned, but that doesn’t scale to building larger Stream pipelines.

I encountered a situation like this myself when using the hubcaps library. The code I was writing was meant to search for GitHub issues within a specific repository. In GitHub API, this is accomplished by sending a search query like repo:$OWNER/$NAME, which may result in a rather obscure HTTP error (422 Unprocessable Entity) if the given repository doesn’t actually exist. But I didn’t care about this error; should it occur, I’d simply return an empty stream, because doing so was more convenient for the larger bit of logic that was consuming it.

Unfortunately, the Stream trait offers no interface that’d target this use case. There are only a few methods that even allow to look at errors mid-stream, and even fewer that can end it prematurely. On the flip side, at least we don’t have to consider too many combinations when looking for the solution ;)

Indeed, it seems there are only two Stream methods that are worthy of our attention:

Stream::then, because it allows for a closure to receive all stream values (items and errors)
Stream::take_while, because it accepts a closure that can end the stream early (but only based on items, not errors)

Combining them both, we arrive at the following recipe:

Inside a .then closure, look for Errors that you consider non-fatal and replace them with a special item value. The natural choice for such a value is None. As a side effect, this forces us to convert the regular (“successful”) items into Some(item), effectively transforming a Stream<Item=T> into Stream<Item=Option<T>>.
Looks for the special value (i.e. None) in the .take_while closure and terminate the stream when it’s been found.
Finally, convert the wrapped items back into their original form using .map, thus giving us back a Stream of T‘s.

Applying this technique to our initial example, we get something that looks like this:

let s = stream::iter_result(vec![Ok(1), Ok(2), Err(false), Ok(3)])
    .then(|r| match r {
        Ok(r) => Ok(Some(r)),  // no-op passthrough of items
        Err(false) => Ok(None) // non-fatal error, terminate the stream
        Err(e) => Err(e),      // no-op passthrough of other errors
    })
    .take_while(|x| future::ok(x.is_some()))
    .map(Option::unwrap);

If we now try to consume this stream like before:

Core::new().run(
    s.for_each(|n| { println!("{}", n); Ok(()) })
).unwrap();

it will still end after the first two items, but without producing any errors afterwards.

For a more reusable version of the trick, you can check this gist; it adds a Stream::take_while_err method through an extension trait.

This isn’t a perfect solution, however, because it requires Boxing even on nightly Rust¹. We can fix that by introducing a dedicated TakeWhileErr stream type, similarly to what native Stream methods do. I leave that as an exercise for the reader ;-)

This is due to a limitation in the impl Trait feature which prevents it from being used as a return type of trait methods. ↩

Small Rust crates I (almost) always use

Posted on Tue 31 October 2017 in Code • Tagged with Rust, libraries • Leave a comment

Alternative clickbait title: My Little Crates: Rust is Magic :-)

Due to its relatively scant standard library, programming in Rust inevitably involves pulling in a good number of third-party dependencies.

Some of them deal with problems that are solved with built-ins in languages that take a more “batteries included” approach. A good example would be the Python’s re module, whose moral equivalent in the Rust ecosystem is the regex crate.

Things like regular expressions, however, represent comparatively large problems. It isn’t very surprising that dedicated libraries exist to address them. It is less common for a language to offer small packages that target very specialized applications.

As in, one function/type/macro-kind of specialized, or perhaps only a little larger than that.

In this post, we’ll take a whirlwind tour through a bunch of such essential “micropackages”.

either

Rust has the built-in Result type, which is a sum¹ of an Ok outcome or an Error. It forms the basis of a general error handling mechanism in the language.

Structurally, however, Result<T, E> is just an alternative between the types T and E. You may want to use such an enum for other purposes than representing results of fallible operations. Unfortunately, because of the strong inherent meaning of Result, such usage would be unidiomatic and highly confusing.

This is why the either crate exists. It contains the following Either type:

enum Either<L, R> {
    Left(L),
    Right(R),
}

While it is isomorphic to Result, it carries no connotation to the entrenched error handling practices². Additionally, it offers symmetric combinator methods such as map_left or right_and_then for chaining computations involving the Either values.

lazy_static

As a design choice, Rust doesn’t allow for safe access to global mutable variables. The semi-standard way of introducing those into your code is therefore the lazy_static crate.

However, the most important usage for it is to declare lazy initialized constants of more complex types:

lazy_static! {
    static ref TICK_INTERVAL: Duration = Duration::from_secs(7 * 24 * 60 * 60);
}

The trick isn’t entirely transparent³, but it’s the best you can do until we get a proper support for compile-time expressions in the language.

maplit

To go nicely with the crate above — and to act as a natural syntactic follow-up to the standard vec![] macro — we’ve got the maplit crate.

What it does is add HashMap and HashSet “literals” by defining some very simple hashmap! and hashset! macros:

lazy_static! {
    static ref IMAGE_EXTENSIONS: HashMap<&'static str, ImageFormat> = hashmap!{
        "gif" => ImageFormat::GIF,
        "jpeg" => ImageFormat::JPEG,
        "jpg" => ImageFormat::JPG,
        "png" => ImageFormat::PNG,
    };
}

Internally, hashmap! expands to the appropriate amount of HashMap::insert calls, returning the finished hash map with all the keys and values given.

try_opt

Before the ? operator was introduced to Rust, the idiomatic way of propagating erroneous Results was the try! macro.

A similar macro can also be implemented for Option types so that it propagates the Nones upstream. The try_opt crate is doing precisely that, and the macro can be used in a straightforward manner:

fn parse_ipv4(s: &str) -> Option<(u8, u8, u8, u8)> {
    lazy_static! {
        static ref RE: Regex = Regex::new(
            r"^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$"
        ).unwrap();
    }
    let caps = try_opt!(RE.captures(s));
    let a = try_opt!(caps.get(1)).as_str();
    let b = try_opt!(caps.get(2)).as_str();
    let c = try_opt!(caps.get(3)).as_str();
    let d = try_opt!(caps.get(4)).as_str();
    Some((
        try_opt!(a.parse().ok()),
        try_opt!(b.parse().ok()),
        try_opt!(c.parse().ok()),
        try_opt!(d.parse().ok()),
    ))
}

Until Rust supports ? for Options (which is planned), this try_opt! macro can serve as an acceptable workaround.

exitcode

It is a common convention in basically every mainstream OS that a process has finished with an error if it exits with a code different than 0 (zero), Linux divides the space of error codes further, and — along with BSD — it also includes the sysexits.h header with some more specialized codes.

These have been adopted by great many programs and languages. In Rust, those semi-standard names for common errors can be used, too. All you need to do is add the exitcode crate to your project:

fn main() {
    let options = args::parse().unwrap_or_else(|e| {
        print_args_error(e).unwrap();
        std::process::exit(exitcode::USAGE);
    });

In addition to constants like USAGE or TEMPFAIL, the exitcode crate also defines an ExitCode alias for the integer type holding the exit codes. You can use it, among other things, as a return type of your top-level functions:

    let code = do_stuff(options);
    std::process::exit(code);
}

fn do_stuff(options: Options) -> exitcode::ExitCode {
    // ...
}

enum-set

In Java, there is a specialization of the general Set interface that works for enum types: the EnumSet class. Its members are represented very compactly as bits rather than hashed elements.

A similar (albeit slightly less powerful⁴) structure has been implemented in the enum-set crate. Given a #[repr(u32)] enum type:

#[repr(u32)]
#[derive(Clone, Copy, Debug Eq, Hash, PartialEq)]
enum Weekday {
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,
}

you can create an EnumSet of its variants:

let mut weekend: EnumSet<Weekday> = EnumSet::new();
weekend.insert(Weekday::Saturday);
weekend.insert(Weekday::Sunday);

as long as you provide a simple trait impl that specifies how to convert those enum values to and from u32:

impl enum_set::CLike for Weekday {
    fn to_u32(&self) -> u32            { *self as u32 }
    unsafe fn from_u32(v: u32) -> Self { std::mem::transmute(v) }
}

The advantage is having a set structure represented by a single, unsigned 32-bit integer, leading to O(1) complexity of all common set operations. This includes membership checks, the union of two sets, their intersection, difference, and so on.

antidote

As part of fulfilling the promise of Fearless Concurrency™, Rust offers multiple synchronization primitives that are all defined in the std::sync module. One thing that Mutex, RwLock, and similar mechanisms there have in common is that their locks can become “poisoned” if a thread panicks while holding them. As a result, acquiring a lock requires handling the potential PoisonError.

For many programs, however, lock poisoning is not even a remote, but a straight-up impossible situation. If you follow the best practices of concurrent resource sharing, you won’t be holding locks for more than a few instructions, devoid of unwraps or any other opportunity to panic!(). Unfortunately, you cannot prove this to the Rust compiler statically, so it will still require you to handle a PoisonError that cannot happen.

This is where the aptly named antidote crate crate offers help. In it, you can find all the same locks & guards API that is offered by std::sync, just without the PoisonError. In many cases, this removal has radically simplified the interface, for example by turning Result<Guard, Error> return types into just Guard.

The caveat, of course, is that you need to ensure all threads holding these “immunized” locks either:

don’t panic at all; or
don’t leave guarded resources in an inconsistent state if they do panic

Like it’s been mentioned earlier, the best way to make that happen is to keep lock-guarded critical sections minimal and infallible.

matches

Pattern matching is one of the most important features of Rust, but some of the relevant language constructs have awkward shortcomings. The if let conditional, for example, cannot be combined with boolean tests:

if let Foo(_) = x && y.is_good() {

and thus requires additional nesting, or a different approach altogether.

Thankfully, to help with situations like this, there is the matches crate with a bunch of convenient macros. Besides its namesake, matches!:

if matches!(x, Foo(_)) && y.is_good() {

it also exposes assertion macros (assert_match! and debug_assert_match!) that can be used in both production and test code.

This concludes the overview of small Rust crates, at least for now.

To be certain, these crates are by far not the only ones that are small in size and simultaneously almost indispensable. Many more great libraries can be found e.g. in the Awesome Rust registry, though obviously you could argue if all of them are truly “micro” ;-)

If you know more crates in the similar vein, make sure to mention them in the comments!

A sum type consists of several alternatives, out of which only one has been picked for a particular instance. The other common name for it is a tagged union. ↩
Unless you come from Haskell, that is, where Either is the equivalent of Rust’s Result :) ↩
You will occasionally need an explicit * to trigger the Deref coercion it uses. ↩
It only supports unitary enums of up to 32 variants. ↩

The Printer Monad in Haskell

Posted on Fri 04 August 2017 in Code • Tagged with Haskell, Writer, monads, monad transformers, WriterT • Leave a comment

Quite recently, I have encountered an interesting case of monad-based refactoring in Haskell.

Suppose you have a ComplicatedRecord that holds the results of some lengthy and important process in your program. You want to present that data to the user in a nicely formatted way, so you write a function which begins somewhat like this:

{-# LANGUAGE RecordWildcards #-}

-- | Pretty-print the content of the record.
ppRecord :: ComplicatedRecord -> IO ()
ppRecord ComplicatedRecord{..} = do
    -- ...

Inside, there is plenty of putStrLn calls, likely hidden inside more specific subfunctions that format all the numerous parts of ComplicatedRecord. But the IO monad isn’t there just for printing: because the code went through multiple iterations, some of this logic actually takes advantage of it by making additional system & network calls.

So yeah, it’s not particularly pretty.

Now, however, we find out that the output we’re printing here shouldn’t always go directly to stdout. In some cases, unsurprisingly, we actually want it back as a single string, without having it sent to the standard output at all.

Just $ return . it

Your first instinct here may be to simply give back the final string (well, Text) as the function result¹:

ppRecord :: ComplicatedRecord -> IO Text

However, this turns out to be rather awkward. While in most other languages we would simply accumulate output by progressively adding more data to a mutable result, this would be much more inconvenient (and somewhat weird) to do in Haskell.

This is where the stdout-based approach seems cleaner; instead of straightforward, sequential code like this:

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad.Extra (whenJust)
import Data.Text.IO
import TextShow

ppOrder Order{..} = do
    putStrLn $ "Order #" <> ordNumber
    ppAddress ordDeliveryAddress
    forM_ (zip [1..] ordItems) $ \(i, Item{..}) -> do
        putStrLn $ showt (i::Int) <> ". " <> itName <> " x" <> showt itQuantity
    whenJust ordBillingAddress ppAddress

ppAddress Address{..} = do
    putStrLn $ addrFirstName <> " " <> addrLastName
    putStrLn addrLine1
    whenJust addrLine2 putStrLn
    putStrLn $ addrCity <> ", " <> addrPostalCode

we have to overhaul each function and turn it into a much less pleasant “mappendage”:

ppOrder Order{..} = unlines $ mconcat
    [ "Order #" <> ordNumber
    , ppAddress ordDeliveryAddress
    , ppItems ordItems
    ] + maybe [] (\addr -> [ppAddress addr]) ordBillingAddress
    where
    ppItems = mconcat . map (uncurry ppItem) . zip [1..]
    ppItem i Item{..} = showt i <> ". " <> itName <> " x" <> itQuantity

One may argue that this is, in fact, the more idiomatic approach, but I’m not very fond of all those commas. Plus, it shows rather clearly that any conditional logic (like with ordBillingAddress here) is going to get pretty cumbersome.

Along comes the `Writer`

What I’m saying here is that even in pure code, it is sometimes very desirable to have a do notation. For that, however, we need a suitable Monad² to provide the meaning of “invisible semicolon” in a do block.

And Text, obviously, isn’t one. Neither is [Text] (lines of text), nor any other type we’d use to represent the final output of formatting & printing. They are unsuitable, because they cannot encode the computation that eventually produces said output — either the top-level one (ppRecord) or any of its building blocks (like the ppOrder or ppAddress), down to a most elementary putStrLn. The only thing they can stand for is the result itself.

Fortunately, the pattern of executing code and occassionally producing some “additional” output has been abstracted over in the Haskell standard library. This is exactly the use case for the Writer monad!

The definition of Writer is roughly equivalent to the following:

newtype Writer w a = ... -- omitted

Of the two type parameters it takes, the w one signifies what output it can produce “on the side”. This is contrasted with a which is the regular result of a monadic expression or function. In our case, a will basically always be () (unit/”empty” type), but it is nonetheless necessary for the Writer to behave as a monad.

To complement the above definition, Writer comes with several useful functions. Among those, the most interesting one is tell:

tell :: w -> Writer w ()

write would’ve probably been a better name for it, as it’s definitely the main and defining operation of Writer.

Looking at its signature, we can see it takes a bit of the Writer‘s output (w) and results in a Writer action. Internally, it will simply add the argument to the already accumulated output of the writer³.

To make everything more concrete, here’s a literal “Hello world” example coded very verbosly as a Writer action:

import Control.Monad.Writer

hello :: Writer Text ()
hello = do
    tell "Hello"
    tell " "
    tell "world"

main :: IO ()
main = do
    let (_, greeting) = runWriter hello
    Text.putStrLn greeting

It also contains the last element of the Writer puzzle:

runWriter :: Writer w a -> (a, w)

Like its name suggests, this function will “run” any Writer action that we give it, returning both the “regular” result (a) plus any output passed in tells (w).⁴

My little monad: transformers are magic

The last example may be very simple, but it contains all the building blocks for many of the printing functions we need. If we define a convenience wrapper for tell:

putLn :: Text -> Writer Text ()
putLn line = tell $ line <> "\n"

then both ppAddress and ppOrder can be translated through a mere mechanical substitution of putStrLn with putLn:

ppAddress Address{..} = do
    putLn $ addrFirstName <> " " <> addrLastName
    putLn addrLine1
    whenJust addrLine2 putLn
    putLn $ addrCity <> ", " <> addrPostalCode

-- ppOrder omitted

Unfortunately, a bare Writer like this can only work for pure code, which isn’t a luxury we can expect in every situtation. In my case, some of the printing logic was tied pretty strongly to IO, and it would be difficult and time consuming to decouple it.

Thankfully, the reliance on IO isn’t a complete deal breaker. While we cannot ensure that nothing calls putStrLn anymore, we can provide the tell/putLn capabilities alongside whatever other IO calls our code has to make (for now).

To achieve that, we need to create a monad stack with WriterT:

newtype WriterT w m a = ... -- omitted

WriterT is a monad transformer, one of those scary Haskell concepts that are actually simpler than they appear on the surface. This is because transfomers like WriterT are mere wrappers. The only difference between it and a regular Writer is the additional m parameter, which is the inner monad we’re packaging inside a new Writer.

Here (and in many other cases), m will be substituted with IO:

type Printer a = WriterT Text IO a  -- w == Text, m == IO

thus creating the titular Printer monad. This hybrid beast can both output Text through the Writer API, as well as perform any additional IO operations that the code may (still) require.

Below is an example; the User record requires an I/O call to get the size of its $HOME directory:

import Control.Monad.IO.Class (liftIO)
import System.Directory (getFileSize)

-- To print this data type nicely, we sadly require I/O :(
data User = User { usrName :: Text
                 , usrHomeDir :: FilePath
                 }

ppUser :: User -> IO Text
ppUser User{..} = snd <$> runWriterT $ do
    putLn $ "Name: " <> usrName
    homeSize <- liftIO $ getFileSize usrHomeDir
    putLn $ "$HOME: " <> showt usrHomeDir <> "(" <> showt homeSize <> " bytes)"

As a bit of necessary cruft, we have to use liftIO to “lift” (wrap) IO actions such as getFileSize in a full Printer monad before executing them. Besides everything else you can think of, this is yet another argument for eventually getting rid of the IO :)

Making the monads coexist

But our job isn’t done yet. Despite looking very reasonable, this version of ppUser doesn’t actually compile! The actual type error may vary a little, but it all boils down to a difference between WriterT Text IO () (i.e. Printer ()) and Writer Text () at each call site of putLn.

GHC is obviously correct. However, the problem lies not in how we’re calling putLn, but rather the way it’s been defined:

putLn :: Text -> Writer Text ()

This type can only produce a specific, pure Writer action. But to fit inside the do block of our compound monad, we need the Writer + IO combo from WriterT Text IO (i.e. Printer).

We can try to address the mismatch by changing the signature to:

putLn :: Text -> WriterT Text IO ()  -- or just: Printer ()

but this will only result in the opposite problem. Now, all the pure printers like ppAddress are facing the fact that putLn is a (wrapped) IO action, despite not actually doing any I/O whatsoever.

The obvious question is, can we have something that fits both?

Earlier on, I’ve said that both vanilla Writer and the IO-spruced Printer support the “Writer API”, most notably the tell function. This notion of a “monadic interface” isn’t just hand-waving, though, and Haskell (obviously!) provides a way to express it programmatically.

Meet the MonadWriter typeclass:

class (Monad m, Monoid w) => MonadWriter w m

Any monad that can work as a Writer will be an instance of it, regardless of whether it wraps over IO or anything else. Functions like tell are defined to be polymorphic over it, enabling us to leverage the same technique they use when we define putLn:

{-# LANGUAGE FlexibleContexts #-}

import Control.Monad.Writer.Class

putLn :: MonadWriter Text m => Text -> m ()
putLn line = tell $ line <> "\n"

If you aren’t very familiar with this syntax, the part before => is a typeclass constraint, or context. It defines the requirements to be satisfied by types which are later used in the function signature.

Here, we request a MonadWriter instance — one where Text is the output but anything can be the inner monad. We refer to that unknown monad only as m, a type variable. The compiler will figure out what to substitute for it at every call site of putLn.

As a result, both a pure Writer and the IO-bound Printer can now use it. In the second case, the relevant instance of MonadWriter will, naturally, have IO fill in the m position.

But curiously, the “pure” Writer also has an inner monad. It just literally does nothing but wrap some other value:

newtype Identity a = Identity { runIdentity :: a }

In most cases, this fact is hidden behind the real definition of Writer, though runIdentity may sometimes come handy for some on-the-spot type hacks⁵.

The wrap

The many things we’ve talked about here could of course be a starting point for even more advanced stuff, but obviously we have to stop somewhere! But don’t worry: knowing about MonadWriter and other monad typeclasses like this is enough to write quite idiomatic code…

…at least until you learn about free monads, effects, and the like ;-)

In any case, you can check this gist for the complete code from this post.

IO is still necessary due to ad-hoc network fetches and syscalls mentioned earlier. ↩
Or at least an Applicative, via the ApplicativeDo GHC extension. ↩
The adding is done via mappend, requiring w to be a Monoid. ↩
There is also the execWriter variant which is actually more practical here as it only returns the accumulated output. ↩
We could, for example, use it alongside mapWriterT to “fix” the calls to putLn if we didn’t have control over its definition. ↩

Extension traits in Rust

Posted on Tue 20 June 2017 in Code • Tagged with Rust, C#, methods, extension methods, traits • Leave a comment

In a few object-oriented languages, it is possible to add methods to a class after it’s already been defined.

This feature arises quite naturally if the language has a dynamic type system that’s modifiable at runtime. In those cases, even replacing existing methods is perfectly possible¹.

In addition to that, some statically typed languages — most notably in C# — offer extension methods as a dedicated feature of their type systems. The premise is that you would write standalone functions whose first argument is specially designated (usually by this keyword) as a receiver of the resulting method call:

public static int WordCount(this String str) {
    return str.Split(new char[] { ' ', '.', '?' },
                     StringSplitOptions.RemoveEmptyEntries).Length;
}

At the call site, the new method is indistinguishable from any of the existing ones:

string s = "Alice has a cat.";
int n = s.WordCount();

That’s assuming you have imported both the original class (or it’s a built-in like String), as well as the module in which the extension method is defined.

Rewrite it in Rust

The curious thing about Rust‘s type system is that it permits extension methods solely as a side effect of its core building block: traits.

In this post, I’m going to describe a certain design pattern in Rust which involves third-party types and user-defined traits. Several popular crates — like itertools or unicode-normalization — utilize it very successfully to add new, useful methods to the language standard types.

I’m not sure if this pattern has an official or widely accepted name. Personally, I’ve taken to calling it extension traits.

Let’s have a look at how they are commonly implemented.

Ingredients

We can use the extension trait pattern if we want to have additional methods in a type that we don’t otherwise control (or don’t want to modify).

Common cases include:

Rust standard library types, like Result, String, or anything else inside the std namespace
types imported from third-party libraries
types from the current crate if additional methods only make sense in certain scenarios (e.g. conditional compilation / testing)²

The crux of this technique is really simple. Like with most design patterns, however, it involves a certain degree of boilerplate and duplication.

So without further ado… In order to “patch” some new method(s) into an external type you will need to:

Define a trait with signatures of all the methods you want to add.
Implement it for the external type.
There is no step three.

As an important note on the usage side, the calling code needs to import your new trait in addition to the external type. Once that’s done, it can proceed to use the new methods is if they were there to begin with.

I’m sure you are keen on seeing some examples!

Broadening your `Option`s

We’re going to add two new methods to Rust’s standard Option type. The goal is to make it more convenient to operate on mutable Options by allowing to easily replace an existing value with another one³.

Here’s the appropriate extension trait⁴:

/// Additional mutation methods for `Option`.
pub trait OptionMutExt<T> {
    /// Replace the existing `Some` value with a new one.
    ///
    /// Returns the previous value if it was present, or `None` if no replacement was made.
    fn replace(&mut self, val: T) -> Option<T>;

    /// Replace the existing `Some` value with the result of given closure.
    ///
    /// Returns the previous value if it was present, or `None` if no replacement was made.
    fn replace_with<F: FnOnce() -> T>(&mut self, f: F) -> Option<T>;
}

It may feel at little bit weird to implement it.
You will basically have to pretend you are inside the Option type itself:

impl<T> OptionMutExt<T> for Option<T> {
    fn replace(&mut self, val: T) -> Option<T> {
        self.replace_with(move || val)
    }

    fn replace_with<F: FnOnce() -> T>(&mut self, f: F) -> Option<T> {
        if self.is_some() {
            let result = self.take();
            *self = Some(f());
            result
        } else {
            None
        }
    }
}

Unfortunately, this is just an illusion. Extension traits grant no special powers that’d allow you to bypass any of the regular visibility rules. All you can use inside the new methods is still just the public interface of the type you’re augmenting (here, Option).

In our case, however, this is good enough, mostly thanks to the recently introduced Option::take.

To use our shiny new methods in other places, all we have to do is import the extension trait:

use ext::rust::OptionMutExt;  // assuming you put it in ext/rust.rs

// ...somewhere...
let mut opt: Option<u32> = ...;
match opt.replace(42) {
    Some(x) => debug!("Option had a value of {} before replacement", x),
    None => assert_eq!(None, opt),
}

It doesn’t matter where it was defined either, meaning we can ship it away to crates.io and let it accrue as many happy users as Itertools has ;-)

Are you `hyper::Body` ready?

Our second example will demonstrate attaching more methods to a third-party type.

Last week, there was a new release of Hyper, a popular Rust framework for HTTP servers & clients. It was notable because it marked a switch from synchronous, straightforward API to a more complex, asynchronous one (which I incidentally wrote about a few weeks ago).

Predictably, there has been some confusion among its new and existing users.

We’re going to help by pinning a more convenient interface on hyper’s Body type. Body here is a struct representing the content of an HTTP request or response. After the ‘asyncatastrophe’, it doesn’t allow to access the raw incoming bytes as easily as it did before.

Thanks to extension traits, we can fix this rather quickly:

use std::error::Error;

use futures::{BoxFuture, future, Future, Stream};
use hyper::{self, Body};

pub trait BodyExt {
    /// Collect all the bytes from all the `Chunk`s from `Body`
    /// and return it as `Vec<u8>`.
    fn into_bytes(self) -> BoxFuture<Vec<u8>, hyper::Error>;

    /// Collect all the bytes from all the `Chunk`s from `Body`,
    /// decode them as UTF8, and return the resulting `String`.
    fn into_string(self) -> BoxFuture<String, Box<Error + Send>>;
}

impl BodyExt for Body {
    fn into_bytes(self) -> BoxFuture<Vec<u8>, hyper::Error> {
        self.concat()
            .and_then(|bytes| future::ok::<_, hyper::Error>(bytes.to_vec()))
            .boxed()
    }

    fn into_string(self) -> BoxFuture<String, Box<Error + Send>> {
        self.into_bytes()
            .map_err(|e| Box::new(e) as Box<Error + Send>)
            .and_then(|bytes| String::from_utf8(bytes)
                .map_err(|e| Box::new(e) as Box<Error + Send>))
            .boxed()
    }
}

With these new methods in hand, it is relatively straightforward to implement, say, a simple character-counting service:

use std::error::Error;

use futures::{BoxFuture, future, Future};
use hyper::server::{Service, Request, Response};

use ext::hyper::BodyExt;  // assuming the above is in ext/hyper.rs

pub struct Length;
impl Service for Length {
    type Request = Request;
    type Response = Response;
    type Error = Box<Error + Send>;
    type Future = BoxFuture<Self::Response, Self::Error>;

    fn call(&self, request: Request) -> Self::Future {
        let (_, _, _, _, body) = request.deconstruct();
        body.into_string().and_then(|s| future::ok(
            Response::new().with_body(s.len().to_string())
        )).boxed()
    }
}

Replacing Box<Error + Send> with an idiomatic error enum is left as an exercise for the reader :)

Extra credit bonus explanation

Reading this section is not necessary to use extension traits.

So far, we have seen what extension traits are capable of. It is only right to mention what they cannot do.

Indeed, this technique has some limitations. They are a conscious choice on the part of Rust authors, and they were decided upon in an effort to keep the type system coherent.

Coherence isn’t an everyday topic in Rust, but it becomes important when working with traits and types that cross package boundaries. Rules of trait coherence (described briefly towards the end of this section of the Rust book) state that the following combinations of “local” (this crate) and “external” (other crates⁵) are legal:

implement a local trait for a local type.
This is common in larger programs that use polymorphic abstractions.
implement an external trait for a local type.
We do this often to integrate with third-party libraries and frameworks, just like with hyper above.
implement a local trait for an external type.
That’s extension traits for you!

What is not possible, however, is to:

implement an external trait for an external type

This case is prohibited in order to make the choice of trait implementations more predictable, both for the compiler and for the programmer. Without this rule in place, you could introduce many instances of impl Trait for Type (same Trait and same Type), each one with different functionality, leaving the compiler to “guess” the right impl for any given situation⁶.

The decision was thus made to disallow the impl ExternalTrait for ExternalType case altogether. If you like, you can read some more extensive backstory behind it.

Bear in mind, however, that this isn’t the unequivocally “correct” solution. Some languages choose to allow this so-called orphan case, and try to resolve the potential ambiguities in various different ways. It is a genuinely useful feature, too, as it makes easier it to glue together two unrelated libraries⁷.

Thankfully for extension traits, the coherence restriction doesn’t apply as long as you keep those traits and their impls in the same crate.

This practice is often referred to as monkeypatching, especially in Python and Ruby. ↩
In this case, a more common solution is to just open another impl Foo block, annotated with #[cfg(test)] or similar. An extension trait, however, makes it easier to extract Foo into a separate crate along with some handy, test-only API. ↩
Note that this is not the same as the unstable (as of 1.18) Option methods guarded behind the options_entry feature gate. ↩
My own convention is to call those traits FooExt if they are meant to enhance the interface of type Foo. The other practice is to mirror the name of the crate that the trait is packaged in; both Itertools and UnicodeNormalization are examples of this style. ↩
Standard library (std or core namespaces) counts as external crate for this purpose. ↩
Or throw an error. However, trait impls are always imported implicitly, so this could essentially prevent some combination of different modules/libraries in the ecosystem from being used together, and generally create an unfathomable mess. ↩
The usual workaround for coherence/orphan rules in Rust involves creating a wrapper around the external type in order to make it “local”, and therefore allow external trait impls for it. This is called the newtype pattern and there are some crates to support it. ↩

Iteration patterns for Result & Option

Posted on Mon 10 April 2017 in Code • Tagged with Rust, iterators • Leave a comment

Towards the end of my previous post about for loops in Rust, I mentioned how those loops can often be expressed in a more declarative way. This alternative approach involves chaining methods of the Iterator trait to create specialized transformation pipelines:

let odds_squared: Vec<_> = (1..100)
    .filter(|x| x % 2 != 0)
    .map(|x| x * x)
    .collect();

Playground link

Code like this isn’t unique to Rust, of course. Similar patterns are prevalent in functional languages such as F#, and can also be found in Java (Streams), imperative .NET (LINQ), JavaScript (LoDash) and elsewhere.

This saying, Rust also has its fair share of unique iteration idioms. In this post, we’re going to explore those arising on the intersection of iterators and the most common Rust enums: Result and Option.

filter_map()

When working with iterators, we’re almost always interested in selecting elements that match some criterion or passing them through a transformation function. It’s not even uncommon to want both of those things, as demonstrated by the initial example in this post.

You can, of course, accomplish those two tasks independently: Rust’s filter and map methods work just fine for this purpose. But there exists an alternative, and in some cases it fits the problem amazingly well.

Meet filter_map. Here’s what the official docs have to say about it:

Creates an iterator that both filters and maps.

Well, duh.

On a more serious note, the common pattern that filter_map simplifies is unwrapping a series of Options. If you have a sequence of maybe-values, and you want to retain only those that are actually there, filter_map can do it in a single step:

// Get the sequence of all files matching a glob pattern via the glob crate.
let some_files = glob::glob("foo.*").unwrap().map(|x| x.unwrap());
// Retain only their extensions, e.g. ".txt" or ".md".
let file_extensions = some_files.filter_map(|p| p.extension());

The equivalent that doesn’t use filter_map would have to split the checking & unwrapping of Options into separate steps:

let file_extensions = some_files.map(|p| p.extension())
    .filter(|e| e.is_some()).map(|e| e.unwrap());

Because of this check & unwrap logic, filter_map can be useful even with a no-op predicate (.filter_map(|x| x)) if we already have the Option objects handy. Otherwise, it’s often very easy to obtain them, which is exactly the case for the Result type:

// Read all text lines from a file:
let lines: Vec<_> = BufReader::new(fs::File::open("file.ext")?)
    .lines().filter_map(Result::ok).collect();

With a simple .filter_map(Result::ok), like above, we can pass through a sequence of Results and yield only the “successful” values. I find this particular idiom to be extremely useful in practice, as long as you remember that Errors will be discarded by it¹.

As a final note on filter_map, you need to keep in mind that regardless of how great it often is, not all combinations of filter and map should be replaced by it. When deciding whether it’s appropriate in your case, it is helpful to consider the equivalence of these two expressions:

iter.filter(f).map(m)
iter.filter_map(|x| if f(x) { Some(m(x)) } else { None })

Simply put, if you find yourself writing conditions like this inside filter_map, you’re probably better off with two separate processing steps.

collect()

Let’s go back to the last example with a sequence of Results. Since the final sequence won’t include any Erroneous values, you may be wondering if there is a way to preserve them.

In more formal terms, the question is about turning a vector of results (Vec<Result<T, E>>) into a result with a vector (Result<Vec<T>, E>). We’d like for this aggregated result to only be Ok if all original results were Ok. Otherwise, we should just get the first Error.

Believe it or not, but this is probably the most common Rust problem!²

Of course, that doesn’t necessarily mean the problem is particularly hard. Possible solutions exist in both an iterator version:

let result = results.into_iter().fold(Ok(vec![]), |mut v, r| match r {
    Ok(x) => { v.as_mut().map(|v| v.push(x)); v },
    Err(e) => Err(e),
});

and in a loop form:

let mut result = Ok(vec![]);
for r in results {
    match r {
        Ok(x) => result.as_mut().map(|v| v.push(x)),
        Err(e) => { result = Err(e); break; },
    };
}

but I suspect not many people would call them clear and readable, let alone pretty³.

Fortunately, you don’t need to pollute your codebase with any of those workarounds. Rust offers an out-of-the-box solution which solves this particular problem, and its only flaw is one that I hope to address through this very post.

So, here it goes:

let result: Result<Vec<_>, _> = results.collect();

Yep, that’s all of it.

The background story is that Result<Vec<T>, E> simply “knows” how to construct itself from a sequence of Results. Unfortunately, this API is hidden behind Rust’s iterator abstraction, and specifically the fact that Result implements FromIterator in this particular manner. The way the documentation page for Result is structured, however — with trait implementations at the very end — ensures this useful fact remains virtually undiscoverable.

Because let’s be honest: no one scrolls that far.

Incidentally, Option offers analogous functionally: a sequence of Option<T> can be collected into Option<Vec<T>>, which will be None if any of the input elements were. As you may suspect, this fact is equally hard to find in the relevant docs.

But the good news is: you know about all this now! :) And perhaps thanks to this post, those handy tricks become a little better in a wider Rust community.

partition()

The last technique I wanted to present here follows naturally from the other idioms that apply to Results. Instead of extracting just the Ok values with flat_map, or keeping only the first error through collect, we will now learn how to retain all the errors and all the values, both neatly separated.

The partition method, as this is what the section is about, is essentially a more powerful variant of filter. While the latter only returns items that do match a predicate, partition will also give us the ones which don’t.

Using it to slice an iterable of Results is straightforward:

let (oks, fails): (Vec<_>, Vec<_>) = results.partition(Result::is_ok);

The only thing that remains cumbersome is the fact that both parts of the resulting tuple still contain just Results. Ideally, we would like them to be already unwrapped into values and errors, but unfortunately we need to do this ourselves:

let values: Vec<_> = oks.into_iter().map(Result::unwrap).collect();
let errors: Vec<_> = fails.into_iter().map(Result::unwrap_err).collect();

As an alternative, the partition_map method from the itertools crate can accomplish the same thing in a single step, albeit a more verbose one.

A symmetrical technique is to use .filter_map(Result::err) to get just the Error objects, but that’s probably much less useful as it drops all the successful values. ↩
Based on my completely unsystematic and anecdotal observations, someone asks about this on the #rust-beginners IRC approximately every other day. ↩
The fold variant is also rife with type inference traps, often requiring explicit type annotations, a “no-op” Err arm in match, or both. ↩

Older Posts