Karol Kuczmarski's Blog

Taking string arguments in Rust

Posted on Tue 24 December 2019 in Code • Tagged with Rust, strings, arguments, borrowing, ownership • Leave a comment

Strings of text seem to always be a complicated topic when it comes to programming. This counts double for low-level languages which expose the programmer to the full complexity of memory management and allocation.

Rust is, obviously, one of those languages. Strings in Rust are therefore represented using two distinct types: str (the string slice) and String (the owned/allocated string). Learning how to juggle those types is something you need to do very early if you want to be productive in the language.

But even after you’ve programmed in Rust for some time, you may still trip on some more subtle issues with string handling. In this post, I will concentrate on just one common task: writing a function that takes a string argument. We’ll see that even there, we can encounter a fair number of gotchas.

Just reading it

Let’s start with a simple case: a function which merely reads its string argument:

fn hello(name: &str) {
    println!("Hello, {}!", name);
}

As you’re probably well aware, using str rather than String is the idiomatic approach here. Because a &str reference is essentially an address + length, it can point to any string wheresoever: a 'static literal, a heap-allocated String, or any portion or substring thereof:

hello("world");
hello(&String::from("Alice"));
hello(&"Dennis Ritchie"[0..6]);

Contrast this with an argument of type &String:

fn hello(name: &String) {
    println!("Hello, {}!", name);
}

which mandates an actual, full-blown String object:

hello(&String::from("Bob"));
// (the other examples won't work)

There are virtually no circumstances when you would want to do this, as it potentially forces the caller to needlessly put the string on the heap. Even if you anticipate all function calls to involve actual String objects, the automatic Deref coercion from &String to &str should still allow you to use the more universal, str-based API.

Hiding the reference

If rustc can successfully turn a &String into &str, then perhaps it should also be possible to simply use String when that’s more convenient?

In general, this kind of “reverse Deref” doesn’t happen in Rust outside of method calls with &self. It seems, however, that it would sometimes be desirable; one reasonable use case involves chains of iterator adapters, most importantly map and for_each:

let strings: Vec<String> = vec!["Alice".into(), "Bob".into()];
strings.into_iter().for_each(hello);

Since the compiler doesn’t take advantage ofDeref coercions when inferring closure types, their argument types have to match exactly. As a result, we often need explicit |x| foo(x) closures which suffer from poorer readability in long Iterator or Stream-based expressions.

We can make the above code work — and also retain the ability to make calls like hello("Charlie"); — by using one of the built-in traits that generalize over the borrowing relationships. The one that works best for accepting string arguments is called AsRef¹:

fn hello<N: AsRef<str>>(name: N) {
    println!("Hello, {}!", name.as_ref());
}

Its sole method, AsRef::as_ref, returns a reference to the trait’s type parameter. In the case above, that reference will obviously be of type &str, which circles back to our initial example, one with a direct &str argument.

The difference is, however, that AsRef<str> is implemented for all interesting string types — both in their owned and borrowed versions. This obviates the need for Deref coercions and makes the API more convenient.

Own it

Things get a little more complicated when the string parameter is needed for more than just reading. For storage and potential mutation, a &str reference is not enough: you need an actual, full-blown String object.

Now, you may think this is not a huge obstacle. After all, it’s pretty easy to “turn” &str into a String:

struct Greetings {
    Vec<String> names,
}

impl Greetings {
    // Don't do this!
    pub fn hello(&mut self, name: &str) {
        self.names.push(name.clone());
    }
}

But I strongly advise against this practice, at least in public APIs. If you expose such function to your users, you are essentially tricking them into thinking their input will only ever be read, not copied, which has implications on both performance and memory usage.

Instead, if you need to take ownership of the resulting String, it is much better to indicate this in the function signature directly:

pub fn hello(&mut self, name: String) {
    self.names.push(name);
}

This shifts the burden on creating the String onto the caller, but that’s not necessarily a bad thing. On their side, the added boilerplate can pretty minimal:

let mut greetings = Greetings::new();
grettings.hello(String::from("Dylan"));  // uhm...
greetings.hello("Eva".to_string());      // somewhat better...
grettings.hello("Frank".to_owned());     // not too bad
greetings.hello("Gene".into());          // good enough

while clearly indicating where does the memory allocation happen.

It is also idiomatically used for functions taking Path parameters, i.e. AsRef<Path>. ↩

Add examples to your Rust libraries

Posted on Wed 28 February 2018 in Code • Tagged with Rust, Cargo, examples, documentation, packaging • Leave a comment

When you’re writing a library for other programs to depend on, it is paramount to think how the developers are going to use it in their code.

The best way to ensure they have a pleasant experience is to put yourself in their shoes. Forget the internal details of your package, and consider only its outward interface. Then, come up with a realistic use case and just implement it.

In other words, you should create complete, end-to-end, and (somewhat) usable example applications.

Examples are trouble

You may think this is asking a lot, and I wouldn’t really disagree here.

In most languages and programming platforms, it is indeed quite cumbersome to create example apps. This happens for at least several different reasons:

It typically requires bootstrapping an entire project from scratch. If you are lucky, you will have something like create-react-app to get you going relatively quickly. Still, you need to wire up the new project so that it depends on the source code of your library rather than its published version, and this tends to be a non-standard option — if it is available at all.
It’s unclear where should the example code live. Should you just throw it away, once it has served its immediate purpose? I’m sure this would discourage many people from creating examples in the first place. It’s certainly better to keep them in the version control, allowing their code to serve as additional documentation.

But if you intend to do this, you need to be careful not to deploy the example along with your library when you upload it to the package registry for your language. This may require maintaining an explicit blacklist and/or whitelist, in the vein of MANIFEST files in Python.
Examples may break as the library changes. Although example apps aren’t integration tests that have a clear, expected outcome, they should at the very least compile correctly.

The only way to ensure that is to include them in the build/test pipeline of your library. To accomplish this, however, you may need to complicate your CI setup, perhaps by introducing additional languages like Bash or Python.
It’s harder to maintain quality of example code. Any linters and static analyzers that you’re normally running will likely need to be configured to also apply to the examples. On the other hand, however, you probably don’t want those checkers to be too strict (it’s just example code, after all), so you may want to turn off some of the warnings, adjust the level of others, and so on.

So essentially, writing examples involves quite a lot of hassle. It would be great if the default tooling of your language helped to lessen the burden at least a little bit.

Well, good news! If you’re a Rust programmer, the language has basically got you covered.

Cargo — the standard build tool and package manager for Rust — has some dedicated features to support examples as a first-class concept. While it doesn’t completely address all the pain points outlined above, it goes a long way towards minimizing them.

What are Cargo examples?

In Cargo’s parlance, an example is nothing else but a Rust source code of a standalone executable¹ that typically resides in a single .rs file. All such files should be places in the examples/ directory, at the same level as src/ and the Cargo.toml manifest itself².

Here’s the simplest example of, ahem, an example:

// examples/hello.rs
fn main() {
    println!("Hello from an example!");
}

You can run it through the typical cargo run command; simply pass the example name after the --example flag:

$ cargo run --example hello
Hello from an example!

It is also possible to run the example with some additional arguments:

$ cargo run --example hello2 -- Alice
Hello, Alice!

which are relayed directly to the underlying binary:

// examples/hello2.rs
use std::env;

fn main() {
    let name = env::args().skip(1).next();
    println!("Hello, {}!", name.unwrap_or("world".into()));
}

As you can see, the way we run examples is very similar to how we’d run the src/bin binaries, which some people use as normal entry points to their Rust programs.

The important thing is that you don’t have to worry what to do with your example code anymore. All you need to do is drop it in the examples/ directory, and let Cargo do the rest.

Dependency included

Of course in reality, your examples will be at least a little more complicated than that. For one, they will surely call into your library to use its API, which means they need to depend on it & import its symbols.

Fortunately, this doesn’t complicate things even one bit.

The library crate itself is already an implied dependency of any code inside the examples/ directory. This is automatically handled by Cargo, so you don’t have to modify Cargo.toml (or do anything else really) to make it happen.

So without any additional effort, you can just to link to your library crate in the usual manner, i.e. by putting extern crate on top of the Rust file:

// examples/real.rs
extern crate mylib;

fn main() {
    let thing = mylib::make_a_thing();
    println!("I made a thing: {:?}", thing);
}

This goes even further, and extends to any dependency of the library itself. All such third-party crates are automatically available to the example code, which proves handy in common cases such as Tokio-based asynchronous APIs:

// example/async.rs
extern crate mylib;
extern crate tokio_core;  // assuming it's in mylib's [dependencies]

fn main() {
    let mut core = tokio_core::reactor::Core::new().unwrap();
    let thing = core.run(mylib::make_a_thing_asynchronously()).unwrap();
    println!("I made a thing: {:?}", thing);
}

More deps

Sometimes, however, it is very useful to pull in an additional package or two, just for the example code.

A typical case may involve logging.

If your library uses the usual log crate to output debug messages, you probably want to see them printed out when you run your examples. Since the log crate is just a facade, it doesn’t offer any built-in way to pipe log messages to standard output. To handle this part, you need something like the env_logger package:

// example/with_logging.rs
extern crate env_logger;
extern crate mylib;

fn main() {
    env_logger::init();
    println("{:?}", mylib::make_a_thing());
}

To be able to import env_logger like this, it natually has to be declared as a dependency in our Cargo.toml.

We won’t put it in the [dependencies] section of the manifest, however, as it’s not needed by the library code. Instead, we should place it in a separate section called [dev-dependencies]:

[dev-dependencies]
env_logger = "0.5"

Packages listed there are shared by tests, benchmarks, and — yes, examples. They are not, however, linked into regular builds of your library, so you don’t have to worry about bloating it with unnecessary code.

Growing bigger

So far, we have seen examples that span just a single Rust file. Practical applications tend to be bigger than that, so it’d be nice if we could provide some multi-file examples as well.

This is easily done, although for some reason it doesn’t seem to be mentioned in the official docs.

In any case, the approach is identical to executables inside src/bin/. Basically, if we have a single foo.rs file with executable code, we can expand it to a foo/ subdirectory with foo/main.rs as the entry point. Then, we can add whatever other submodules we want — just like we would do for a regular Rust binary crate:

// examples/multifile/main.rs
extern crate env_logger;
extern crate mylib;

mod util;

fn main() {
    env_logger::init();
    let ingredient = util::create_ingredient();
    let thing = mylib::make_a_thing_with(ingredient);
    println("{:?}", thing);
}

// examples/multifile/util.rs

pub fn create_ingredient() -> u64 {
    42
}

Of course, it won’t be often that examples this large are necessary. Showing how a library can scale to bigger applications can, however, be very encouraging to potential users.

Maintaining maintainability

Thus far, we have discussed how to create small and larger examples, how to use additional third-party crates in example programs, and how to easily build & run them using built-in Cargo commands.

All this effort spent on writing examples would be of little use if we couldn’t ensure that they work.

Like every type of code, examples are prone to breakage whenever the underlying API changes. If the library is actively developed, its interface represents a moving target. It is quite expected that changes may sometimes cause old examples to stop compiling.

Thankfully, Cargo is very dilligent in reporting such breakages. Whenever you run:

$ cargo test

all examples are built simultaneously with the execution of your regular test suite³. You get the compilation guarantee for your examples essentially for free — there is no need to even edit your .travis.yml, or to adjust your continuous integration setup in any other way!

Pretty neat, right?

This saying, you should keep in mind that simply compiling your examples on a regular basis is not a foolproof guarantee that their code never becomes outdated. Examples are not integration tests, and they won’t catch important changes in your implementation that aren’t breaking the interface.

Examples-Driven Development?

You may be wondering then, what’s exactly the point of writing examples? If you got tests on one hand to verify correctness, and documentation on the other hand to inform your users, then having a bunch of dedicated executable examples may seem superfluous.

To me, however, an impeccable test suite and amazing docs — which also remain comprehensive and awesome for an entire lifetime of the library! — sound a bit too much like a perfect world :) Adding examples to the mix can almost always improve things, and their maintenance burden should, in most cases, be very minimal.

But I have also found out that starting off with examples early on is a great way to validate the interface design.

Once the friction of creating small test programs has been eliminated, they become indispensable for prototyping new features. Wanna try out that new thing you’ve just added? Simple: just make a quick example for it, run it, and see what happens!

In many ways, doing this feels similar to trying out things in a REPL — something that’s almost exclusive to dynamic/interpreted languages. But unlike mucking around in Python shell, examples are not throwaway code: they become part of your project, and remain useful for both you & your users.

It is also possible to create examples which are themselves just libraries. I don’t think this is particularly useful, though, since all you can do with such examples is build them, so they don’t provide any additional value over normal tests (and especially doc tests). ↩
Because they are outside of the src/ directory, examples do not become a part of your library’s code, and are not deployed to crates.io. ↩
You can also run cargo build --examples to only compile the examples, without running any kind of tests. ↩

Unfolding a Stream of paginated items

Posted on Wed 24 January 2018 in Code • Tagged with Rust, Tokio, streams, HTTP • Leave a comment

My most recent Rust crate is an API client for the Path of Exile’s public stash tabs. One problem that I had to solve while writing it was to turn a sequence of paginated items (in this case, player stash tabs) into a single, asynchronous Stream.

In this post, I’ll explain how to use the Stream interface, along with functions from the futures crate, to create a single Stream from multiple batches of entities.

Pagination 101

To divide a long list of items into pages is a very common pattern in many HTTP-based APIs.

If the client requests a sequence of entities that would be too large to serve as a single response, there has to be some way to split it over multiple HTTP roundtrips. To accomplish that, API servers will often return a constant number of items at first (like 50), followed by some form of continuation token:

$ curl http://api.example.com/items
{
    "items": [
        {...},
        {...},
        {...}
    ],
    "continuationToken": "e53c68db0ee412ac239173db147a02a0"
}

Such token is preferably an opaque sequence of bytes, though sometimes it can be an explicit offset (index) into the list of results¹. Regardless of its exact nature, clients need to pass the token with their next request in order to obtain another batch of results:

$ curl 'http://api.example.com/items?after=e53c68db0ee412ac239173db147a02a0'
{
    "items": [
        {...},
        {...}
    ],
    "continuationToken": "4e3986e4c7f591b8cb17cf14addd40a6"
}

Repeat this procedure for as long as the response contains a continuation token, and you will eventually go through the entire sequence. If it’s really, really long (e.g. it’s a Twitter firehose for a popular hashtag), then you may of course hit some problems due to the sheer number of requests. For many datasets, however, this pagination scheme is absolutely sufficient while remaining relatively simple for clients to implement.

Stream it in Rust

What the client code would typically do, however, is to hide the pagination details completely and present only the final, unified sequence of items. Such abstraction is useful even for end-user applications, but it’s definitely expected from any shared library that wraps the third-party API.

Depending on your programming language of choice, this abstraction layer may be very simple to implement. Here’s how it could be done in Python, whose concepts of iterables and generators are a perfect fit for this task²:

import requests

def iter_items(after=None):
    """Yield items from an example API.
    :param after: Optional continuation token
    """
    while True:
        url = "http://api.example.com/items"
        if after is not None:
            url += "?after=%s" % after
        response = requests.get(url)
        response.raise_for_status()
        for item in response.json()['items']:
            yield item
        after = response.json().get("continuationToken")
        if after is None:
            break

# consumer
for item in iter_items():
    print(item)

In Rust, you can find their analogues in the Iterator and Stream traits, so we’re off to a pretty good start. What’s missing, however, is the equivalent of yield: something to tell the consumer “Here, have the next item!”, and then go back to the exact same place in the producer function.

This ability to jump back and forth between two (or more) functions involves having a language support for coroutines. Not many mainstream languages pass this requirement, although Python and C# would readily come to mind. In case of Rust, there have been some nightly proposals and experiments, but nothing seems to be stabilizing anytime soon.

DIY streaming

But of course, if you do want a Stream of paginated items, there is at least one straightforward solution: just implement the Stream trait directly.

This is actually quite a viable approach, very similar to rolling out a custom Iterator. Some minor differences stem mostly from a more complicated state management in Stream::poll compared to Iterator::next. While an iterator is either exhausted or not, a stream can also be waiting for the next item to “arrive” (Ok(Async::NotReady)), or have errored out permanently (Err(e)). As a consequence, the return value of Stream::poll is slightly more complex than just plain Option, but nevertheless quite manageable.

Irrespective of difficulty, writing a custom Stream from scratch would inevitably involve a lot of boilerplate. You may find it necessary in more complicated applications, of course, but for something that’s basically a glorified while loop, it doesn’t seem like a big ask to have a more concise solution.

The stream unfolds

Fortunately there is one! Its crucial element is the standalone stream::unfold function from the futures crate:

pub fn unfold<T, F, Fut, It>(init: T, f: F) -> Unfold<T, F, Fut> where
    F: FnMut(T) -> Option<Fut>,
    Fut: IntoFuture<Item = (It, T)>,

Reading through the signature of this function can be a little intimidating at first. Part of it is Rust’s verbose syntax for anything that involves both trait bounds and closures³, making stream::unfold seem more complicated than it actually is. Indeed, if you’ve ever used Iterator adapters like .filter_map or .fold, the unfold function will be pretty easy to understand. (And if you haven’t, don’t worry! It’s really quite simple :))

If you look closely, you’ll see that stream::unfold takes the following two arguments:

first one is essentially an arbitrary initial value, called a seed
second one is a closure that receives the seed and returns an optional pair of values

What are those values?… Well, the entire purpose of the unfold function is to create a Stream, and a stream should inevitably produce some items. Consequently, the first value in the returned pair will be the next item in the stream.

And what about the second value? That’s just the next state of the seed! It will be received by the very same closure when someone asks the Stream to produce its next item. By passing around a useful value — say, a continuation token — you can create something that’s effectively a while loop from the Python example above.

The last important bits about this pair of values is the wrapping.

First, it is actually a Future, allowing your stream to yield objects that it doesn’t quite have yet — for example, those which ultimately come from an HTTP response.

Secondly, its outermost layer is an Option. This enables you to terminate the stream when the underlying source is exhausted by simply returning None. Until then, however, you should return Some with the (future of) aforementioned pair of values.

Paginate! Paginate!

If you have doubts about how all those pieces of stream::unfold fit in, then looking at the usage example in the docs may give you some idea of what it enables you to do. It’s a very artificial example, though: the resulting Stream isn’t waiting for any asynchronous Futures, which is the very reason you’d use a Stream over an Iterator in the first place⁴.

We can find a more natural application for unfold if we go back to our original problem. To reiterate, we want to repeatedly query an HTTP API for a long list of items, giving our callers a Stream of such items they can process at their leisure. At the same time, all the details about pagination and handling of continuation tokens or offsets should be completely hidden from the caller.

To employ stream::unfold for this task, we need two things: the initial seed, and an appropriate closure.

I have hinted already at using the continuation token as our seed, or the state that we pass around from one closure invocation to another. What remains is mostly making the actual HTTP request and interpreting the JSON response, for which we’ll use the defacto standard Rust crates: hyper, Serde, and serde_json:

use std::error::Error;

use futures::{future, Future, stream, Stream};
use hyper::{Client, Method};
use hyper::client::Request;
use serde_json;
use tokio_core::reactor::Handle;

const URL: &str = "http://api.example.com/items";

fn items(
    handle: &Handle, after: Option<String>
) -> Box<Stream<Item=Item, Error=Box<Error>>>
{
    let client = Client::new(handle);
    Box::new(stream::unfold(after, move |cont_token| {
        let url = match cont_token {
            Some(ct) => format!("{}?after={}", URL, ct),
            None => return None,
        };
        let req = Request::new(Method::Get, url.parse().unwrap());
        Some(client.request(req).from_err().and_then(move |resp| {
            let status = resp.status();
            resp.body().concat2().from_err().and_then(move |body| {
                if status.is_success() {
                    serde_json::from_slice::<ItemsResponse>(&body)
                        .map_err(Box::<Error>::from)
                } else {
                    Err(format!("HTTP status: {}", status).into())
                }
            })
            .map(move |items_resp| {
                (stream::iter_ok(items_resp.items), items_resp.continuation_token)
            })
        }))
    })
    .flatten())
}

#[derive(Deserialize)]
struct ItemsResponse {
    items: Vec<Item>,
    #[serde(rename = "continuationToken")]
    continuation_token: Option<String>,
}

While this code may be a little challenging to decipher at first, it’s not out of line compared to how working with Futures and Streams looks like in general. In either case, you can expect a lot of .and_then callbacks :)

There is one detail here that I haven’t mentioned previously, though. It relates to the stream::iter_ok and Stream::flatten calls which you may have already noticed.
The issue with stream::unfold is that it only allows to yield an item once per closure invocation. For us, this is too limiting: a single batch response from the API will contain many such items, but we have no way of “splitting” them.

What we can do instead is to produce a Stream of entire batches of items, at least at first, and then flatten it. What Stream::flatten does here is to turn a nested Stream<Stream<Item>> into a flat Stream<Item>. The latter is what we eventually want to return, so all we need now is to create this nested stream of streams.

How? Well, that’s actually pretty easy.

We can already deserialize a Vec<Item> from the JSON response — that’s our item batch! — which is essentially an iterable of Items⁵. Another utility function from the stream module, namely stream::iter_ok, can readily turn such iterable into a “immediate” Stream. Such Stream won’t be asynchronous at all — its items will have been ready from the very beginning — but it will still conform to the Stream interface, enabling it to be flattened as we request.

But wait! There is a bug!

So in the end, is this the solution we’re looking for?…

Well, almost. First, here’s the expected usage of the function we just wrote:

let mut core = tokio_core::reactor::Core::new().unwrap();
core.run({
    let continuation_token = None;  // start from the beginning
    items(&core.handle(), continuation_token).for_each(|item| {
        println!("{:?}", item);
        Ok(())
    })
}).unwrap();

While this is more complicated than the plain for loop in Python, most of it is just Tokio boilerplate. The notable part is the invocation of items(), where we pass None as a continuation token to indicate that we want the entire sequence, right from its beginning.

And since we’re talking about fetching long sequences, we would indeed expect a lot of items. So it is probably quite surprising to hear that the stream we’ve created here will be completely empty.

…What? How?!

If you look again at the source code of items(), the direct reason should be pretty easy to find. The culprit lies in the return None branch of the first match. If we don’t pass Some(continuation_token) as a parameter to items(), this branch will be hit immediately, terminating the stream before it had a chance to produce anything.

It may not be very clear how to fix this problem. After all, the purpose of the match was to detect the end of the sequence, but it apparently prevents us from starting it in the first place!

Looking at the problem from another angle, we can see we’ve conflated two distinct states of our stream — “before it has started” and “after it’s ended” — into a single one (“no continuation token”). Since we obviously don’t want to make the after parameter mandatory — users should be able to say “Give me everything!” — we need another way of telling those two states apart.

In terms of Rust types, it seems that Option<String> is no longer sufficient for encoding all possible states of our Stream. Although we could try to fix that in some ad-hoc way (e.g. by adding another bool flag), it feels cleaner to just define a new, dedicated type. For one, this allows us to designate a name for each of the states in question, improving the readability and maintainability of our code:

enum State {
    Start(Option<String>),
    Next(String),
    End,
}

Note that we can put this definition directly inside the items() function, without cluttering the module namespace. All the relevant details of our Stream are thus nicely contained within a single function:

fn items(
    handle: &Handle, after: Option<String>
) -> Box<Stream<Item=Item, Error=Box<Error>>>
{
    // (definition of State enum can go here)

    let client = Client::new(handle);
    Box::new(stream::unfold(State::Start(after), move |state| {
        let cont_token = match state {
            State::Start(opt_ct) => opt_ct,
            State::Next(ct) => Some(ct),
            State::End => return None,
        };
        let url = match cont_token {
            Some(ct) => format!("{}?after={}", URL, ct),
            None => URL.into(),
        };
        let req = Request::new(Method::Get, url.parse().unwrap());
        Some(client.request(req).from_err().and_then(move |resp| {
            let status = resp.status();
            resp.body().concat2().from_err().and_then(move |body| {
                if status.is_success() {
                    serde_json::from_slice::<ItemsResponse>(&body)
                        .map_err(Box::<Error>::from)
                } else {
                    Err(format!("HTTP status: {}", status).into())
                }
            })
            .map(move |items_resp| {
                let next_state = match items_resp.continuation_token {
                    Some(ct) => State::Next(ct),
                    None => State::End,
                };
                (stream::iter_ok(items_resp.items), next_state)
            })
        }))
    })
    .flatten())
}

Sure, there is a little more bookkeeping required now, but at least all the items are being emitted by the Stream as intended.

You can see the complete source in the playground here.

Furthermore, the token doesn’t have to come as part of the HTTP response body. Some API providers (such as GitHub) may use the Link: header to point directly to the next URL to query. ↩
This example uses “traditional”, synchronous Python code. However, it should be easy to convert it to the asynchronous equivalent that works in Python 3.5 and above, provided you can replace requests with some async HTTP library. ↩
If you are curious whether other languages could express it better, you can check the Data.Conduit.List.unfold function from the Haskell’s conduit package. For most intents and purposes, it is their equivalent of stream::unfold. ↩
Coincidentally, you can create iterators in the very same manner through the itertools::unfold function from the itertools crate. ↩
In more technical Rust terms, it means Vec implements the IntoIterator trait, allowing anyone to get an Iterator from it. ↩

Terminating a Stream in Rust

Posted on Sat 16 December 2017 in Code • Tagged with Rust, streams, Tokio, async • Leave a comment

Here’s a little trick that may be useful in dealing with asynchronous Streams in Rust.

When you consume a Stream using the for_each method, its default behavior is to finish early should an error be produced by the stream:

use futures::prelude::*;
use futures::stream;
use tokio_core::reactor::Core;

let s = stream::iter_result(vec![Ok(1), Ok(2), Err(false), Ok(3)]);
let fut = s.for_each(|n| {
    println!("{}", n);
    Ok(())
});

In more precise terms, it means that the Future returned by for_each will resolve with the first error from the underlying stream:

// Prints 1, 2, and then panics with "false".
Core::new().unwrap().run(fut).unwrap();

For most purposes, this is perfectly alright; errors are generally meant to propagate, after all.

Certain kinds of errors, however, are better off silenced. Perhaps they are expected to pop up during normal program operation, or maybe their occurrence should merely affect program execution in a particular way, and not halt it outright. In a simple case like above, you can of course check what for_each itself has returned, but that doesn’t scale to building larger Stream pipelines.

I encountered a situation like this myself when using the hubcaps library. The code I was writing was meant to search for GitHub issues within a specific repository. In GitHub API, this is accomplished by sending a search query like repo:$OWNER/$NAME, which may result in a rather obscure HTTP error (422 Unprocessable Entity) if the given repository doesn’t actually exist. But I didn’t care about this error; should it occur, I’d simply return an empty stream, because doing so was more convenient for the larger bit of logic that was consuming it.

Unfortunately, the Stream trait offers no interface that’d target this use case. There are only a few methods that even allow to look at errors mid-stream, and even fewer that can end it prematurely. On the flip side, at least we don’t have to consider too many combinations when looking for the solution ;)

Indeed, it seems there are only two Stream methods that are worthy of our attention:

Stream::then, because it allows for a closure to receive all stream values (items and errors)
Stream::take_while, because it accepts a closure that can end the stream early (but only based on items, not errors)

Combining them both, we arrive at the following recipe:

Inside a .then closure, look for Errors that you consider non-fatal and replace them with a special item value. The natural choice for such a value is None. As a side effect, this forces us to convert the regular (“successful”) items into Some(item), effectively transforming a Stream<Item=T> into Stream<Item=Option<T>>.
Looks for the special value (i.e. None) in the .take_while closure and terminate the stream when it’s been found.
Finally, convert the wrapped items back into their original form using .map, thus giving us back a Stream of T‘s.

Applying this technique to our initial example, we get something that looks like this:

let s = stream::iter_result(vec![Ok(1), Ok(2), Err(false), Ok(3)])
    .then(|r| match r {
        Ok(r) => Ok(Some(r)),  // no-op passthrough of items
        Err(false) => Ok(None) // non-fatal error, terminate the stream
        Err(e) => Err(e),      // no-op passthrough of other errors
    })
    .take_while(|x| future::ok(x.is_some()))
    .map(Option::unwrap);

If we now try to consume this stream like before:

Core::new().run(
    s.for_each(|n| { println!("{}", n); Ok(()) })
).unwrap();

it will still end after the first two items, but without producing any errors afterwards.

For a more reusable version of the trick, you can check this gist; it adds a Stream::take_while_err method through an extension trait.

This isn’t a perfect solution, however, because it requires Boxing even on nightly Rust¹. We can fix that by introducing a dedicated TakeWhileErr stream type, similarly to what native Stream methods do. I leave that as an exercise for the reader ;-)

This is due to a limitation in the impl Trait feature which prevents it from being used as a return type of trait methods. ↩

Recap of the gisht project

Posted on Fri 24 November 2017 in Programming • Tagged with Rust, gisht, CLI, GitHub, Python, testing • Leave a comment

In this post, I want to discuss some of the experiences I had with a project that I recently finished, gisht. By “finished” I mean that I don’t anticipate developing any new major features for it, though smaller things, bug fixes, or non-code stuff, is of course still very possible.

I’m thinking this is as much “done” as most software projects can ever hope to be. Thus, it is probably the best time for a recap / summary / postmortem / etc. — something to recount the lessons learned, and assess the choices made.

Some context

The original purpose of gisht was to facilitate download & execution of GitHub gists straight from the command line:

$ gisht Xion/git-outgoing  # run the https://gist.github.com/Xion/git-outgoing gist

I initially wrote its first version in Python because I’ve accumulated a sizable number of small & useful scripts (for Git, Unix, Python, etc.) which were all posted as gists. Sure, I could download them manually to ~/bin every time I used a new machine but that’s rather cumbersome, and I’m quite lazy.

Well, lazy and impatient :) I noticed pretty fast that the speed tax of Python is basically unacceptable for a program like gisht.

What I’m referring to here is not the speed of code execution, however, but only the startup time of Python interpreter. Irrespective of the machine, operating system, or language version, it doesn’t seem to go lower than about one hundred milliseconds; empirically, it’s often 2 or 3 times higher than that. For the common case of finding a cached gist (no downloads) and doing a simple fork+exec, this startup time was very noticeable and extremely jarring. It also precluded some more sophisticated uses for gisht, like putting its invocation into the shell’s $PROMPT¹.

Speed: delivered

And so the obvious solution emerged: let’s rewrite it in Rust!…

Because if I’m executing code straight from the internet, I should at least do it in a safe language.

But jokes aside, it is obvious that a language compiling to native code is likely a good pick if you want to optimize for startup speed. So while the choice of Rust was in large part educational (gisht was one of my first projects to be written in it), it definitely hasn’t disappointed there.

Even without any intentional optimization efforts, the app still runs instantaneously. I tried to take some measurements using the time command, but it never ticked into more than 0.001s. Perceptively, it is at least on par with git, so that’s acceptable for me :)

Can’t segfault if your code doesn’t build

Achieving the performance objective wouldn’t do us much good, however, if the road to get there involved excessive penalties on productivity. Such negative impact could manifest in many ways, including troublesome debugging due to a tricky runtime², or difficulty in getting the code to compile in the first place.

If you had even a passing contact with Rust, you’d expect the latter to be much more likely than the former.

Indeed, Rust’s very design eschews runtime flexibility to a ridiculous degree (in its “safe” mode, at least), while also forcing you to absorb subtle & complex ideas to even get your code past the compiler. The reward is increased likelihood your program will behave as intended — although it’s definitely not on the level of “if it compiles, it works” that can be offered by Haskell or Idris.

But since gisht is hardly mission critical, I didn’t actually care too much about this increased reliability. I don’t think it’s likely that Rust would buy me much over something like modern C++. And if I were to really do some kind of cost-benefit analysis of several languages — rather than going with Rust simply to learn it better — then it would be hard to justify it over something like Go.

It scales

So the real question is: has Rust not hampered my productivity too much? Having the benefit of hindsight, I’m happy to say that the trade-off was definitely acceptable :)

One thing I was particularly satisfied with was the language’s scalability. What I mean here is the ability to adapt as the project grows, but also to start quickly and remain nimble while the codebase is still pretty small.

Many languages (most, perhaps) are naturally tailored towards the large end, doing their best to make it more bearable to work with big codebases. In turn, they often forget about helping projects take off in the first place. Between complicated build systems and dependency managers (Java), or a virtual lack of either (C++), it can be really hard to get going in a “serious” language like this.

On the other hand, languages like Python make it very easy to start up and achieve relatively impressive results. Some people, however, report having encountered problems once the code evolves past certain size. While I’m actually very unsympathetic to those claims, I realize perception plays a significant role here, making those anecdotal experiences into a sort of self-fulfilling prophecy.

This perception problem should almost certainly spare Rust, as it’s a natively compiled and statically typed language, with a respectable type system to boot. There is also some evidence that the language works well in large projects already. So the only question that we might want to ask is: how easy it is to actually start a project in Rust, and carry it towards some kind of MVP?

Based on my experiences with gisht, I can say that it is, in fact, quite easy. Thanks mostly to the impressive Swiss army knife of cargo — acting as both package manager and a rudimentary build system — it was almost Python-trivial to cook a “Hello World” program that does something tangible, like talk to a JSON API. From there, it only took a few coding sessions to grow it into a functioning prototype.

Abstractions galore

As part of rewriting gisht from Python to Rust, I also wanted to fix some longstanding issues that limited its capabilities.

The most important one was the hopeless coupling to GitHub and their particular flavor of gists. Sure, this is where the project even got its name from, but people use a dozen of different services to share code snippets and it should very possible to support them all.

Here’s where it became necessary to utilize the abstraction capabilities that Rust has to offer. It was somewhat obvious to define a Host trait but of course its exact form had to be shaped over numerous iterations. Along the way, it even turned out that Result<Option<T>> and Option<Result<T>> are sometimes both necessary as return types :)

Besides cleaner architecture, another neat thing about an explicit abstraction is the ability to slice a concept into smaller pieces — and then put some of them back together. While the Host trait could support a very diverse set of gist services and pastebins, many of them turned out to be just a slight variation of one central theme. Because of this similarity, it was possible to introduce a single Basic implementation which handles multiple services through varying sets of URL patterns.

Devices like these aren’t of course specific to Rust: interfaces (traits) and classes are a staple of OO languages in general. But some other techniques were more idiomatic; the concept of iterators, for example, is flexible enough to accommodate looping over GitHub user’s gists, even as they read directly from HTTP responses.

Hacking time

Not everything was sunshine and rainbows, though.

Take clap, for example. It’s mostly a very good crate for parsing command line arguments, but it couldn’t quite cope with the unusual requirements that gisht had. To make gisht Foo/bar work alongside gisht run Foo/bar, it was necessary to analyze argv before even handing it over to clap. This turned out to be surprisingly tricky to get right. Like, really tricky, with edges cases and stuff. But as it is often the case in software, the answer turned out to be yet another layer of indirection plus a copious amount of tests.

In another instance, however, a direct library support was crucial.

It so happened that hyper, the crate I’ve been using for HTTP requests, didn’t handle the Link: response header out of the box³. This was a stumbling block that prevented the gist iterator (mentioned earlier) from correctly handling pagination in the responses from GitHub API. Thankfully, having the Header abstraction in hyper meant it was possible to add the missing support in a relatively straighforward manner. Yes, it’s not a universal implementation that’d be suitable for every HTTP client, but it does the job for gisht just fine.

Test-Reluctant Development

And so the program kept growing steadily over the months, most notably through more and more gist hosts it could now support.

Eventually, some of them would fall into a sort of twilight zone. They weren’t as complicated as GitHub to warrant writing a completely new Host instance, but they also couldn’t be handled via the Basic structure alone. A good example would be sprunge.us: mostly an ordinary pastebin, except for its optional syntax highlighting which may add some “junk” to the otherwise regular URLs.

In order to handle those odd cases, I went for a classic wrapper/decorator pattern which, in its essence, boils down to something like this:

pub struct Sprunge {
    inner: Basic,
}

impl Sprunge {
    pub fn new() -> {
        Sprunge{inner: Basic::new(ID, "sprunge.us",
                                  "http://sprunge.us/${id}", ...)}
    }
}

impl Host for Sprunge {
    // override & wrap methods that require custom logic:
    fn resolve_url(&self, url: &str) -> Option<io::Result<Gist>> {
        let mut url_obj = try_opt!(Url::parse(url).ok());
        url_obj.set_query(None);
        inner.resolve_url(url_obj.to_string().as_str())
    }

    // passthrough to the `Basic` struct for others:
    fn fetch_gist(&self, gist: &Gist, mode: FetchMode) -> io::Result<()> {
        self.inner.fetch_gist(gist, mode)
    }
    // (etc.)
}

Despite the noticeable boilerplate of a few pass-through methods, I was pretty happy with this solution, at least initially. After a few more unusual hosts, however, it became cumbersome to fix all the edge cases by looking only at the final output of the inner Basic implementation. The code was evidently asking for some tests, if only to check how the inner structure is being called.

Shouldn’t be too hard, right?… Yeah, that’s what I thought, too.

The reality, unfortunately, fell very short of those expectations. Stubs, mocks, fakes — test doubles in general — are a dark and forgotten corner of Rust that almost no one seems to pay any attention to. Absent a proper library support — much less a language one — the only way forward was to roll up my sleeves and implement a fake Host from scratch.

But that was just the beginning. How do you seamlessly inject this fake implementation into the wrapper so that it replaces the Basic struct for testing? If you are not careful and go for the “obvious” solution — a trait object:

pub struct Sprunge {
    inner: Box<Host>,
}

you’ll soon realize that you need not just a Box, but at least an Rc (or maybe even Arc). Without this kind of shared ownership, you’ll lose your chance to interrogate the test double once you hand it over to the wrapper. This, in turn, will heavily limit your ability to write effective tests.

What’s the non-obvious approach, then? The full rationale would probably warrant a separate post, but the working recipe looks more or less like this:

First, parametrize the wrapper with its inner type: pub struct Sprunge<T: Host> { inner: T }.

Put that in an internal module with the correct visibility setup:

mod internal {
    pub struct Sprunge<T: Host> {
        pub(super) inner: T,
    }
}

Make the regular (“production”) version of the wrapper into an alias, giving it the type parameter that you’ve been using directly⁴:
```
pub type Sprunge = internal::Sprunge<Basic>;
```
Change the new constructor to instantiate the internal type.
In tests, create the wrapper with a fake inner object inside.

As you can see in the real example, this convoluted technique removes the need for any pointer indirection. It also permits you to access the out-of-band interface that a fake object would normally expose.

It’s a shame, though, that so much work is required for something that should be very simple. As it appears, testing is still a neglected topic in Rust.

Packing up

It wasn’t just Rust that played a notable role in the development of gisht.

Pretty soon after getting the app to a presentable state, it became clear that a mere cargo build won’t do everything that’s necessary to carry out a complete build. It could do more, admittedly, if I had the foresight to explore Cargo build scripts a little more thoroughly. But overall, I don’t regret dropping back to my trusty ol’ pick: Python.

Like in a few previous projects, I used the Invoke task runner for both the crucial and the auxiliary automation tasks. It is a relatively powerful tool — and probably the best in its class in Python that I know of — though it can be a bit capricious if you want to really fine-tune it. But it does make it much easier to organize your automation code, to reuse it between tasks, and to (ahem) invoke those tasks in a convenient manner.

In any case, it certainly beats a collection of disconnected Bash scripts ;)

What have I automated in this way, you may ask? Well, a couple of small things; those include:

embedding of the current Git commit hash into the binary, to help identify the exact revision in the logs of any potential bug reports⁵
after a successful build, replacing the Usage section in README with the program’s --help output
generating completion scripts for popular shells by invoking the binary with a magic hidden flag (courtesy of clap)

Undoubtedly the biggest task that I relegated to Python/Invoke, was the preparation of release packages. When it comes to the various Linuxes (currently Debian and Red Hat flavors), this wasn’t particularly complicated. Major thanks are due to the amazing fpm tool here, which I recommend to anyone who needs to package their software in a distro-compatible manner.

Homebrew, however — or more precisely, OS X itself — was quite a different story. Many, many failed attempts were needed to even get it to build on Travis, and the additional dependency on Python was partially to blame. To be fair, however, most of the pain was exclusively due to OpenSSL; getting that thing to build is always loads of “fun”, especially in such an opaque and poorly debuggable environment as Travis.

The wrap

There’s probably a lot of minor things and tidbits I could’ve mentioned along the way, but the story so far has most likely covered all the important topics. Let’s wrap it up then, and highlight some interesting points in the classic Yay/Meh/Nay manner.

Yay

It was definitely a good choice to rewrite gisht specifically in Rust. Besides all the advantages I’ve mentioned already, it is also worth noting that the language went through about 10 minor version bumps while I was working on this project. Of all those new releases, I don’t recall a single one that would introduce a breaking change.
Most of the Rust ecosystem (third-party libraries) was a joy to use, and very easy to get started with. Honorable mention goes to serde_json and how easy it was to transition the code from rustc_serialize that I had used at first.
With a possible exception of sucking in node.js as a huge dependency of your project and using Grunt, there is probably no better way of writing automation & support code than Python. There may eventually be some Rust-based task runners that could try to compete, but I’m not very convinced about using a compiled language for this purpose (and especially one that takes so long to build).

Meh

While the clap crate is quite configurable and pretty straightforward to use, it does lack at least one feature that’d be very nice for gisht. Additionally, working with raw clap is often a little tedious, as it doesn’t assist you in translating parsed flags into your own configuration types, and thus requires shuffling those bits manually⁶.
Being a defacto standard for continuous integration in open-source projects, Travis CI could be a little less finicky. In almost every project I decide to use it for, I end up with about half a dozen commits that frantically try to fix silly configuration issues, all before even a simple .travis.yml works as intended. Providing a way to test CI builds locally would be an obvious way to avoid this churn.

Nay

Testing in Rust is such a weird animal. On one hand, there is a first-class, out-of-the-box support for unit tests (and even integration tests) right in the toolchain. On the other hand, the relevant parts of the ecosystem are immature or lacking, as evidenced by the dreary story of mocking and stubbing. It’s no surprise that there is a long way to catch up to languages with the strongest testing culture (Java and C#/.NET⁷), but it’s disappointing to see Rust outclassed even by C++.
Getting anything to build reliably on OSX in a CI environment is already a tall order. But if it involves things as OpenSSL, then it quickly goes from bad to terrible. I’m really not amused anymore how this “Just Works” system often turns out to hardly work at all.

Since I don’t want to end on such a negative note, I feel compelled to state the obvious fact: every technology choice is a trade-off. In case of this project, however, the drawbacks were heavily outweighed by the benefits.

For this reason, I can definitely recommend the software stack I’ve just described to anyone developing non-trivial, cross-platform command line tools.

This is not an isolated complaint, by the way, as the interpreter startup time has recently emerged as an important issue to many developers of the Python language. ↩
Which may also include a practical lack thereof. ↩
It does handle it now, fortunately. ↩
Observant readers may notice that we’re exposing a technically private type (internal::Sprunge) through a publicly visible type alias. If that type was actually private, this would trigger a compiler warning which is slated to become a hard error at some point in the future. But, amusingly, we can fool the compiler by making it a public type inside a private module, which is exactly what we’re doing here. ↩
This has since been rewritten and is now done in build.rs — but that’s only because I implemented the relevant Cargo feature myself :) ↩
For an alternative approach that doesn’t seem to have this problem, check the structopt crate. ↩
Dynamically typed languages, due to their rich runtime, are basically a class of their own when it comes to testing ease, so it wouldn’t really be fair to hold them up for comparison. ↩

Small Rust crates I (almost) always use

Posted on Tue 31 October 2017 in Code • Tagged with Rust, libraries • Leave a comment

Alternative clickbait title: My Little Crates: Rust is Magic :-)

Due to its relatively scant standard library, programming in Rust inevitably involves pulling in a good number of third-party dependencies.

Some of them deal with problems that are solved with built-ins in languages that take a more “batteries included” approach. A good example would be the Python’s re module, whose moral equivalent in the Rust ecosystem is the regex crate.

Things like regular expressions, however, represent comparatively large problems. It isn’t very surprising that dedicated libraries exist to address them. It is less common for a language to offer small packages that target very specialized applications.

As in, one function/type/macro-kind of specialized, or perhaps only a little larger than that.

In this post, we’ll take a whirlwind tour through a bunch of such essential “micropackages”.

either

Rust has the built-in Result type, which is a sum¹ of an Ok outcome or an Error. It forms the basis of a general error handling mechanism in the language.

Structurally, however, Result<T, E> is just an alternative between the types T and E. You may want to use such an enum for other purposes than representing results of fallible operations. Unfortunately, because of the strong inherent meaning of Result, such usage would be unidiomatic and highly confusing.

This is why the either crate exists. It contains the following Either type:

enum Either<L, R> {
    Left(L),
    Right(R),
}

While it is isomorphic to Result, it carries no connotation to the entrenched error handling practices². Additionally, it offers symmetric combinator methods such as map_left or right_and_then for chaining computations involving the Either values.

lazy_static

As a design choice, Rust doesn’t allow for safe access to global mutable variables. The semi-standard way of introducing those into your code is therefore the lazy_static crate.

However, the most important usage for it is to declare lazy initialized constants of more complex types:

lazy_static! {
    static ref TICK_INTERVAL: Duration = Duration::from_secs(7 * 24 * 60 * 60);
}

The trick isn’t entirely transparent³, but it’s the best you can do until we get a proper support for compile-time expressions in the language.

maplit

To go nicely with the crate above — and to act as a natural syntactic follow-up to the standard vec![] macro — we’ve got the maplit crate.

What it does is add HashMap and HashSet “literals” by defining some very simple hashmap! and hashset! macros:

lazy_static! {
    static ref IMAGE_EXTENSIONS: HashMap<&'static str, ImageFormat> = hashmap!{
        "gif" => ImageFormat::GIF,
        "jpeg" => ImageFormat::JPEG,
        "jpg" => ImageFormat::JPG,
        "png" => ImageFormat::PNG,
    };
}

Internally, hashmap! expands to the appropriate amount of HashMap::insert calls, returning the finished hash map with all the keys and values given.

try_opt

Before the ? operator was introduced to Rust, the idiomatic way of propagating erroneous Results was the try! macro.

A similar macro can also be implemented for Option types so that it propagates the Nones upstream. The try_opt crate is doing precisely that, and the macro can be used in a straightforward manner:

fn parse_ipv4(s: &str) -> Option<(u8, u8, u8, u8)> {
    lazy_static! {
        static ref RE: Regex = Regex::new(
            r"^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$"
        ).unwrap();
    }
    let caps = try_opt!(RE.captures(s));
    let a = try_opt!(caps.get(1)).as_str();
    let b = try_opt!(caps.get(2)).as_str();
    let c = try_opt!(caps.get(3)).as_str();
    let d = try_opt!(caps.get(4)).as_str();
    Some((
        try_opt!(a.parse().ok()),
        try_opt!(b.parse().ok()),
        try_opt!(c.parse().ok()),
        try_opt!(d.parse().ok()),
    ))
}

Until Rust supports ? for Options (which is planned), this try_opt! macro can serve as an acceptable workaround.

exitcode

It is a common convention in basically every mainstream OS that a process has finished with an error if it exits with a code different than 0 (zero), Linux divides the space of error codes further, and — along with BSD — it also includes the sysexits.h header with some more specialized codes.

These have been adopted by great many programs and languages. In Rust, those semi-standard names for common errors can be used, too. All you need to do is add the exitcode crate to your project:

fn main() {
    let options = args::parse().unwrap_or_else(|e| {
        print_args_error(e).unwrap();
        std::process::exit(exitcode::USAGE);
    });

In addition to constants like USAGE or TEMPFAIL, the exitcode crate also defines an ExitCode alias for the integer type holding the exit codes. You can use it, among other things, as a return type of your top-level functions:

    let code = do_stuff(options);
    std::process::exit(code);
}

fn do_stuff(options: Options) -> exitcode::ExitCode {
    // ...
}

enum-set

In Java, there is a specialization of the general Set interface that works for enum types: the EnumSet class. Its members are represented very compactly as bits rather than hashed elements.

A similar (albeit slightly less powerful⁴) structure has been implemented in the enum-set crate. Given a #[repr(u32)] enum type:

#[repr(u32)]
#[derive(Clone, Copy, Debug Eq, Hash, PartialEq)]
enum Weekday {
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,
}

you can create an EnumSet of its variants:

let mut weekend: EnumSet<Weekday> = EnumSet::new();
weekend.insert(Weekday::Saturday);
weekend.insert(Weekday::Sunday);

as long as you provide a simple trait impl that specifies how to convert those enum values to and from u32:

impl enum_set::CLike for Weekday {
    fn to_u32(&self) -> u32            { *self as u32 }
    unsafe fn from_u32(v: u32) -> Self { std::mem::transmute(v) }
}

The advantage is having a set structure represented by a single, unsigned 32-bit integer, leading to O(1) complexity of all common set operations. This includes membership checks, the union of two sets, their intersection, difference, and so on.

antidote

As part of fulfilling the promise of Fearless Concurrency™, Rust offers multiple synchronization primitives that are all defined in the std::sync module. One thing that Mutex, RwLock, and similar mechanisms there have in common is that their locks can become “poisoned” if a thread panicks while holding them. As a result, acquiring a lock requires handling the potential PoisonError.

For many programs, however, lock poisoning is not even a remote, but a straight-up impossible situation. If you follow the best practices of concurrent resource sharing, you won’t be holding locks for more than a few instructions, devoid of unwraps or any other opportunity to panic!(). Unfortunately, you cannot prove this to the Rust compiler statically, so it will still require you to handle a PoisonError that cannot happen.

This is where the aptly named antidote crate crate offers help. In it, you can find all the same locks & guards API that is offered by std::sync, just without the PoisonError. In many cases, this removal has radically simplified the interface, for example by turning Result<Guard, Error> return types into just Guard.

The caveat, of course, is that you need to ensure all threads holding these “immunized” locks either:

don’t panic at all; or
don’t leave guarded resources in an inconsistent state if they do panic

Like it’s been mentioned earlier, the best way to make that happen is to keep lock-guarded critical sections minimal and infallible.

matches

Pattern matching is one of the most important features of Rust, but some of the relevant language constructs have awkward shortcomings. The if let conditional, for example, cannot be combined with boolean tests:

if let Foo(_) = x && y.is_good() {

and thus requires additional nesting, or a different approach altogether.

Thankfully, to help with situations like this, there is the matches crate with a bunch of convenient macros. Besides its namesake, matches!:

if matches!(x, Foo(_)) && y.is_good() {

it also exposes assertion macros (assert_match! and debug_assert_match!) that can be used in both production and test code.

This concludes the overview of small Rust crates, at least for now.

To be certain, these crates are by far not the only ones that are small in size and simultaneously almost indispensable. Many more great libraries can be found e.g. in the Awesome Rust registry, though obviously you could argue if all of them are truly “micro” ;-)

If you know more crates in the similar vein, make sure to mention them in the comments!

A sum type consists of several alternatives, out of which only one has been picked for a particular instance. The other common name for it is a tagged union. ↩
Unless you come from Haskell, that is, where Either is the equivalent of Rust’s Result :) ↩
You will occasionally need an explicit * to trigger the Deref coercion it uses. ↩
It only supports unitary enums of up to 32 variants. ↩

Extension traits in Rust

Posted on Tue 20 June 2017 in Code • Tagged with Rust, C#, methods, extension methods, traits • Leave a comment

In a few object-oriented languages, it is possible to add methods to a class after it’s already been defined.

This feature arises quite naturally if the language has a dynamic type system that’s modifiable at runtime. In those cases, even replacing existing methods is perfectly possible¹.

In addition to that, some statically typed languages — most notably in C# — offer extension methods as a dedicated feature of their type systems. The premise is that you would write standalone functions whose first argument is specially designated (usually by this keyword) as a receiver of the resulting method call:

public static int WordCount(this String str) {
    return str.Split(new char[] { ' ', '.', '?' },
                     StringSplitOptions.RemoveEmptyEntries).Length;
}

At the call site, the new method is indistinguishable from any of the existing ones:

string s = "Alice has a cat.";
int n = s.WordCount();

That’s assuming you have imported both the original class (or it’s a built-in like String), as well as the module in which the extension method is defined.

Rewrite it in Rust

The curious thing about Rust‘s type system is that it permits extension methods solely as a side effect of its core building block: traits.

In this post, I’m going to describe a certain design pattern in Rust which involves third-party types and user-defined traits. Several popular crates — like itertools or unicode-normalization — utilize it very successfully to add new, useful methods to the language standard types.

I’m not sure if this pattern has an official or widely accepted name. Personally, I’ve taken to calling it extension traits.

Let’s have a look at how they are commonly implemented.

Ingredients

We can use the extension trait pattern if we want to have additional methods in a type that we don’t otherwise control (or don’t want to modify).

Common cases include:

Rust standard library types, like Result, String, or anything else inside the std namespace
types imported from third-party libraries
types from the current crate if additional methods only make sense in certain scenarios (e.g. conditional compilation / testing)²

The crux of this technique is really simple. Like with most design patterns, however, it involves a certain degree of boilerplate and duplication.

So without further ado… In order to “patch” some new method(s) into an external type you will need to:

Define a trait with signatures of all the methods you want to add.
Implement it for the external type.
There is no step three.

As an important note on the usage side, the calling code needs to import your new trait in addition to the external type. Once that’s done, it can proceed to use the new methods is if they were there to begin with.

I’m sure you are keen on seeing some examples!

Broadening your `Option`s

We’re going to add two new methods to Rust’s standard Option type. The goal is to make it more convenient to operate on mutable Options by allowing to easily replace an existing value with another one³.

Here’s the appropriate extension trait⁴:

/// Additional mutation methods for `Option`.
pub trait OptionMutExt<T> {
    /// Replace the existing `Some` value with a new one.
    ///
    /// Returns the previous value if it was present, or `None` if no replacement was made.
    fn replace(&mut self, val: T) -> Option<T>;

    /// Replace the existing `Some` value with the result of given closure.
    ///
    /// Returns the previous value if it was present, or `None` if no replacement was made.
    fn replace_with<F: FnOnce() -> T>(&mut self, f: F) -> Option<T>;
}

It may feel at little bit weird to implement it.
You will basically have to pretend you are inside the Option type itself:

impl<T> OptionMutExt<T> for Option<T> {
    fn replace(&mut self, val: T) -> Option<T> {
        self.replace_with(move || val)
    }

    fn replace_with<F: FnOnce() -> T>(&mut self, f: F) -> Option<T> {
        if self.is_some() {
            let result = self.take();
            *self = Some(f());
            result
        } else {
            None
        }
    }
}

Unfortunately, this is just an illusion. Extension traits grant no special powers that’d allow you to bypass any of the regular visibility rules. All you can use inside the new methods is still just the public interface of the type you’re augmenting (here, Option).

In our case, however, this is good enough, mostly thanks to the recently introduced Option::take.

To use our shiny new methods in other places, all we have to do is import the extension trait:

use ext::rust::OptionMutExt;  // assuming you put it in ext/rust.rs

// ...somewhere...
let mut opt: Option<u32> = ...;
match opt.replace(42) {
    Some(x) => debug!("Option had a value of {} before replacement", x),
    None => assert_eq!(None, opt),
}

It doesn’t matter where it was defined either, meaning we can ship it away to crates.io and let it accrue as many happy users as Itertools has ;-)

Are you `hyper::Body` ready?

Our second example will demonstrate attaching more methods to a third-party type.

Last week, there was a new release of Hyper, a popular Rust framework for HTTP servers & clients. It was notable because it marked a switch from synchronous, straightforward API to a more complex, asynchronous one (which I incidentally wrote about a few weeks ago).

Predictably, there has been some confusion among its new and existing users.

We’re going to help by pinning a more convenient interface on hyper’s Body type. Body here is a struct representing the content of an HTTP request or response. After the ‘asyncatastrophe’, it doesn’t allow to access the raw incoming bytes as easily as it did before.

Thanks to extension traits, we can fix this rather quickly:

use std::error::Error;

use futures::{BoxFuture, future, Future, Stream};
use hyper::{self, Body};

pub trait BodyExt {
    /// Collect all the bytes from all the `Chunk`s from `Body`
    /// and return it as `Vec<u8>`.
    fn into_bytes(self) -> BoxFuture<Vec<u8>, hyper::Error>;

    /// Collect all the bytes from all the `Chunk`s from `Body`,
    /// decode them as UTF8, and return the resulting `String`.
    fn into_string(self) -> BoxFuture<String, Box<Error + Send>>;
}

impl BodyExt for Body {
    fn into_bytes(self) -> BoxFuture<Vec<u8>, hyper::Error> {
        self.concat()
            .and_then(|bytes| future::ok::<_, hyper::Error>(bytes.to_vec()))
            .boxed()
    }

    fn into_string(self) -> BoxFuture<String, Box<Error + Send>> {
        self.into_bytes()
            .map_err(|e| Box::new(e) as Box<Error + Send>)
            .and_then(|bytes| String::from_utf8(bytes)
                .map_err(|e| Box::new(e) as Box<Error + Send>))
            .boxed()
    }
}

With these new methods in hand, it is relatively straightforward to implement, say, a simple character-counting service:

use std::error::Error;

use futures::{BoxFuture, future, Future};
use hyper::server::{Service, Request, Response};

use ext::hyper::BodyExt;  // assuming the above is in ext/hyper.rs

pub struct Length;
impl Service for Length {
    type Request = Request;
    type Response = Response;
    type Error = Box<Error + Send>;
    type Future = BoxFuture<Self::Response, Self::Error>;

    fn call(&self, request: Request) -> Self::Future {
        let (_, _, _, _, body) = request.deconstruct();
        body.into_string().and_then(|s| future::ok(
            Response::new().with_body(s.len().to_string())
        )).boxed()
    }
}

Replacing Box<Error + Send> with an idiomatic error enum is left as an exercise for the reader :)

Extra credit bonus explanation

Reading this section is not necessary to use extension traits.

So far, we have seen what extension traits are capable of. It is only right to mention what they cannot do.

Indeed, this technique has some limitations. They are a conscious choice on the part of Rust authors, and they were decided upon in an effort to keep the type system coherent.

Coherence isn’t an everyday topic in Rust, but it becomes important when working with traits and types that cross package boundaries. Rules of trait coherence (described briefly towards the end of this section of the Rust book) state that the following combinations of “local” (this crate) and “external” (other crates⁵) are legal:

implement a local trait for a local type.
This is common in larger programs that use polymorphic abstractions.
implement an external trait for a local type.
We do this often to integrate with third-party libraries and frameworks, just like with hyper above.
implement a local trait for an external type.
That’s extension traits for you!

What is not possible, however, is to:

implement an external trait for an external type

This case is prohibited in order to make the choice of trait implementations more predictable, both for the compiler and for the programmer. Without this rule in place, you could introduce many instances of impl Trait for Type (same Trait and same Type), each one with different functionality, leaving the compiler to “guess” the right impl for any given situation⁶.

The decision was thus made to disallow the impl ExternalTrait for ExternalType case altogether. If you like, you can read some more extensive backstory behind it.

Bear in mind, however, that this isn’t the unequivocally “correct” solution. Some languages choose to allow this so-called orphan case, and try to resolve the potential ambiguities in various different ways. It is a genuinely useful feature, too, as it makes easier it to glue together two unrelated libraries⁷.

Thankfully for extension traits, the coherence restriction doesn’t apply as long as you keep those traits and their impls in the same crate.

This practice is often referred to as monkeypatching, especially in Python and Ruby. ↩
In this case, a more common solution is to just open another impl Foo block, annotated with #[cfg(test)] or similar. An extension trait, however, makes it easier to extract Foo into a separate crate along with some handy, test-only API. ↩
Note that this is not the same as the unstable (as of 1.18) Option methods guarded behind the options_entry feature gate. ↩
My own convention is to call those traits FooExt if they are meant to enhance the interface of type Foo. The other practice is to mirror the name of the crate that the trait is packaged in; both Itertools and UnicodeNormalization are examples of this style. ↩
Standard library (std or core namespaces) counts as external crate for this purpose. ↩
Or throw an error. However, trait impls are always imported implicitly, so this could essentially prevent some combination of different modules/libraries in the ecosystem from being used together, and generally create an unfathomable mess. ↩
The usual workaround for coherence/orphan rules in Rust involves creating a wrapper around the external type in order to make it “local”, and therefore allow external trait impls for it. This is called the newtype pattern and there are some crates to support it. ↩

Rust as a gateway drug to Haskell

Posted on Tue 13 June 2017 in Programming • Tagged with Rust, Haskell, traits, typeclasses, monads, ADTs, FP • Leave a comment

For work-related reasons, I had to recently get up to speed on programming in Haskell.

Before that, I had very little actual experience with the language, clocking probably at less than a thousand lines of working code over a couple of years. Nothing impressive either: some wrapper script here, some experimental rewrite there…

These days, I heard, there are a few resources for learning Haskell¹ that don’t require having a PhD in category theory². They may be quite helpful when your exposure to the functional programming is limited. In my case, however, the one thing that really enabled me to become (somewhat) productive was not even related to Haskell at all.

It was Rust.

In theory, this shouldn’t really make much of a sense. If you compare both languages by putting checkmarks in a feature chart, you won’t find them to have much in common.

Some of the obvious differences include:

predominantly functional vs. mostly imperative
garbage collection vs. explicit memory management
lazy vs. eager evaluation
rich runtime³ vs. almost no runtime
global vs. localized type inference
indentation vs. braces
two decades (!) vs. barely two years since release

Setting aside syntax, most of those differences are pretty significant.

You probably wouldn’t use Haskell for embedded programming, for instance, both for performance (GC) and memory usage reasons (laziness). Similarly, Rust’s ownership system can be too much of a hassle for high level code that isn’t subject to real time requirements.

But if you look a little deeper, beyond just the surface descriptions of both languages, you can find plenty of concepts they share.

Traits: they are typeclasses, essentially

Take Haskell’s typeclasses, for example — the cornerstone of its rich and expressive type system.

A typeclass is, simply speaking, a list of capabilities: it defines what a type can do. There exist analogs of typeclasses in most programming languages, but they are normally called interfaces or protocols, and remain closely tied to the object-oriented paradigm.

Not so in Haskell.

Or in Rust for that matter, where the equivalent concept exists under the name of traits. What typeclasses and traits have in common is that they’re used for all kinds of polymorphism in their respective languages.

Generics

For example, let’s consider parametrized types, sometimes also referred to as templates (C++) or generics (C#).

In many cases, a generic function or type requires its type arguments to exhibit certain characteristics. In some languages (like the legacy C++), this is checked only implicitly: as long as the template type-checks after its expansion, everything is okay:

template <typename T> T min(T a, T b) {
    return a > b ? b : a;
}

struct Foo {};

int main() {
    min(1, 2);  // OK
    min(Foo(), Foo());  // ERROR, no operator `>`
}

More advanced type systems, however, allow to specify the generic constraints explicitly. This is the case in Rust:

fn min<T: Ord>(a: T, b: T) -> T {
    if a > b { b } else { a }
}

as well as in Haskell:

min :: (Ord a) => a -> a -> a
min a b = if a > b then b else a

In both languages, the notion of a type supporting certain operations (like comparison/ordering) is represented as its own, first-class concept: a trait (Rust) or a typeclass (Haskell). Since the compiler is aware of those constraints, it can verify that the min function is used correctly even before it tries to generate code for a specific substitution of T.

Dynamic dispatch

On the other hand, let’s look at runtime polymorphism: the one that OO languages implement through abstract base classes and virtual methods. It’s the tool of choice if you need a container of objects of different types, which nevertheless all expose the same interface.

To offer it, Rust has trait objects, and they work pretty much exactly like base class pointers/references from Java, C++, or C#.

// Trait definition
trait Draw {
    fn draw(&self);
}

// Data type implementing the trait
struct Circle { radius: i32 }
impl Draw for Circle {
    fn draw(&self) { /* omitted */ }
}

// Usage
fn draw_all(objects: &Vec<Box<Draw>>) {
    for &obj in objects {
        obj.draw();
    }
}

The Haskell analogue is, in turn, based on typeclasses, though the specifics can be a little bit trickier:

{-# LANGUAGE ExistentialQuantification #-}

-- Typeclass definition
class Draw a where
    draw :: a -> IO ()

-- Polymorphic wrapper type
data Draw' = forall a. Draw a => Draw' a
instance Draw Draw' where
    draw (Draw' d) = draw d

-- Data types instantiating ("implementing") the typeclass
data Circle = Circle ()
instance Draw Circle where draw = undefined -- omitted
data Square = Square ()
instance Draw Square where draw = undefined -- omitted

-- Usage
drawAll :: (Draw a) => [a] -> IO ()
drawAll ds = mapM_ draw ds

main = do
    let shapes = [Draw' Circle (), Draw' Square ()]
    drawAll shapes

Here, the generic function can use typeclass constraints directly ((Draw a) => ...), but creating a container of different object types requires a polymorphic wrapper⁴.

Differences

All those similarities do not mean that Rust traits and Haskell typeclasses are one and the same. There are, in fact, quite a few differences, owing mostly to the fact that Haskell’s type system is more expressive:

Rust lacks higher kinded types, making certain abstractions impossible to encode as traits. It is possible, however, to implement a trait for infinitely many types at once if the implementation itself is generic (like here).
When defining a trait in Rust, you can ask implementors to provide some auxiliary, associated types in addition to just methods⁵. A similar mechanism in Haskell is expanded into type families, and requires enabling a GHC extension.
While typeclasses in Haskell can be implemented for multiple types simultaneously via a GHC extension, Rust’s take on this feature is to make traits themselves generic (e.g. trait Foo<T>). The end result is roughly similar; however, the “main implementing type” (one after for in impl ... for ...) is still a method receiver (self), just like in OO languages.
Rust enforces coherence rules on trait implementations. The topic is actually rather complicated, but the gist is about local (current package) vs. remote (other packages / standard library) traits and types.
Without too much detail, coherence demands that there be a local type or trait somewhere in the impl ... for ... construct. Haskell doesn’t have this limitation, although it is recommended not to take advantage of this.

The M-word

Another area of overlap between Haskell and Rust exists in the data model utilized by those languages. Both are taking heavy advantage of algebraic data types (ADT), including the ability to define both product types (“regular” structs and records) as well as sum types (tagged unions).

`Maybe` you’d like `Some(T)`?

Even more interestingly, code in both languages makes extensive use of the two most basic ADTs:

Option (Rust) or Maybe (Haskell) — for denoting a presence or absence of a value
Result (Rust) or Either (Haskell) — for representing the alternative of “correct” and “erroneous” value

These aren’t just simple datatypes. They are deeply interwoven into the basic semantics of both languages, not to mention their standard libraries and community-provided packages.

The Option/Maybe type, for example, is the alternative to nullable references: something that’s been heavily criticized for making programs prone to unexpected NullReferenceExceptions. The idea behind both of those types is to make actual values impossible to confuse with nulls by encoding the potential nullability into the type system:

enum Option<T> { Some(T), None }

data Maybe a = Just a | Nothing

Result and Either, on the other hand, can be thought as an extension of this idea. They also represent two possibilities, but the “wrong” one isn’t just None or Nothing — it has some more information associated with it:

enum Result<T, E> { Ok(T), Err(E) }

data Either e a = Left e | Right a

This dichotomy between the Ok (or Right) value and the Error value (or the Left one) makes it a great vehicle for carrying results of functions that can fail.

In Rust, this replaces the traditional error handling mechanisms based on exceptions. In Haskell, the exceptions are present and sometimes necessary, but Either is nevertheless the preferred approach to dealing with errors.

What to `do`?

One thing that Haskell does better is composing those fallible functions into bigger chunks of logic.

Relatively recently, Rust has added the ? operator as a replacement for the try! macro. This is now the preferred way of error propagation, allowing for a more concise composition of functions that return Results:

/// Read an integer from given file.
fn int_from_file(path: &Path) -> io::Result<i32> {
    let mut file = fs::File::open(path)?;
    let mut s = String::new();
    file.read_to_string(&mut s)?;
    let result = s.parse().map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e))?;
    Ok(result)
}

But Haskell had it for much longer, and it’s something of a hallmark of the language and functional programming in general — even though it looks thoroughly imperative:

intFromFile :: FilePath -> IO Int
intFromFile path = do
    s <- readFile path
    i <- readIO s
    return i

If you haven’t seen it before, this is of course a monad — the IO monad, to be precise. While discussing monads in detail is way outside of the scope of this article, we can definitely notice some analogies with Rust. The do notation with <- arrows is evidently similar to how in Rust you’d assign the result of a fallible operation after “unpacking” it with ?.

But of course, there’s plenty of different monads in Haskell: not just IO, but also Either, Maybe, Reader, Writer, Cont, STM, and many others. In Rust (at least as of 1.19), the ? operator only works for Result types, although there is some talk about extending it to Option as well⁶.

Eventually, we may see the language adopt some variant of the do notation, though the motivation for this will most likely come from asynchronous programming with Futures rather than plain Results. General monads, however, require support for higher kinded types which isn’t coming anytime soon.

A path through Rust?

Now that we’ve discussed those similarities, the obvious question arises.

Is learning Rust worthwhile if your ultimate goal is getting proficient at functional programming in general, or Haskell in particular?

My answer to that is actually pretty straightforward.

If “getting to FP” is your main goal, then Rust will not help you very much. Functional paradigm isn’t the main idea behind the language — its shtick is mostly memory safety, and zero-cost abstractions. While it succeeds somewhat at being “Haskell Lite”, it really strives to be safer C++⁷.

But if, on the other hand, you regard FP mostly as a curiosity that seems to be seeping into your favorite imperative language at an increasing rate, Rust can be a good way to gain familiarity with this peculiar beast.

At the very least, you will learn the functional way of modeling programs, with lots of smart enums/unions and structs but without inheritance.

And the best part is: you will be so busy fighting the borrow checker you won’t even notice when it happens ;-)

Just ask in #haskell-beginners on Freenode if you’re interested. ↩
Though ironically, I found the CT lectures by Bartosz Milewski very helpful in developing the right intuitions, even though they’re very abstract. ↩
For example, Haskell has green threads (created with forkIO) which are somewhat similar to goroutines from Go. To get anything remotely similar in Rust, you need to use external libraries. ↩
Note that such containers aren’t very idiomatic Haskell. A more typical solution would be to just curry the draw function, implicitly putting the Draw object inside its closure. ↩
This mechanisms expands to associated constants in Rust 1.20. ↩
Those two types also have a form of monadic bind (>>= in Haskell) exposed as the and_then method. ↩
If you want another language for easing into the concept of functional programming, I’ve heard that Scala fills that niche quite well. ↩

Older Posts