Karol Kuczmarski's Blog

In Microsoft we trust

Posted on Fri 08 April 2016 in Thoughts • Tagged with Microsoft, Windows, GitHub, Apple, Facebook, Google, tech culture • Leave a comment

Just like many other people, I was following news from the last week’s BUILD conference with piqued interest. The ability to run Linux userland programs on Windows — including, of course, bash — is something to be excited about. If nothing else, it should dramatically improve Windows support of new programming languages that seem to pop up all the time.

There was something else, however, that I couldn’t help but notice. The reactions of tech communities to this and similar developments focused very frequently on Microsoft itself.

The beloved “new Microsoft”, as some call it, embraces open source, supports Linux, and generally does almost a full 180 with their stance on proprietary vs. free software. The circumstances fit this narrative rather snuggly, too: a new CEO makes a clean break with the past to pivot the company in this new world ruled by mobile and cloud.

Still… love? Even considering how Internet comments are grotesquely exaggerated most of the time, that’s quite a declaration. The sentiment is nowhere near isolated either. But it’s not a question whether this infatuation has a rational merit, or whether those feelings will eventually turn out to be misplaced.

The question is: why does it exist at all? We are talking about a company here, a for-profit organization. How can such a language even enter the picture?

Then I realized this is not really a new phenomenon. Quite the opposite: the broad developer community seems to always need a company to champion its core values. Nowadays, we’re simply trying to find someone new to carry the standard.

Why? Because we feel that our old heroes have forsaken us.

Hall of past fame

Take GitHub for example. Once a darling of the open source community, it’s been suffering sharp criticism for many months now. Rightfully or not, many people aren’t exactly excited — to put it mildly — about changes to the policy and atmosphere at GitHub, typified by the ill-fated meritocracy rug. The widely backed and long standing plea for a few critical features has only recently stopped falling on deaf ears. And in the background, there are always concerns about GitHub simply becoming too big, and exterting too much control over the open source ecosystem.

Nota bene, the very same ecosystem it had once been lauded for nurturing.

Among the other flagship tech companies, Apple or Facebook have never garnered much good will. Sure, they are recognized for the pure utilitarian value of hardware they produce; or convenience, broad applicability, and stability of APIs they offer. Facebook may be scoring some additional points for trying to sort out the mess of frontend development, but many say it’s not doing anybody any favors. Neither company is easy to portray as a paragon of openness, though, and it’s probably easier to argue the exact opposite.

And then there is Google, of course¹. Some time in the past few years, a palpable shift occurred in how the company is perceived by the techie crowd. It’s difficult to pinpoint the exact pivotal moment, and the one event that leaps to attention doesn’t seem sufficient to explain it. But the tone has been set, allowing news to be molded to fit it. Add this to the usual backdrop of complaints about the onerous interview process and general fearmongering, and the picture doesn’t look very bright.

Shades of grey

No similar woes seem to have been plaguing Microsoft as of late, even though their record of recent “unwholesome” deeds isn’t exactly clean either. Does it mean they are indeed the most fitting candidate for a (new) enterprise ally of the hacker community?…

Or maybe we can finally recognize the whole notion as the utterly silly concept it actually is. Hard to shake though it be, this quasi-Manicheic mentality of assigning labels of virtue or sin is, at best, a naive idealism. At worst, it’s a peculiar kind of harmful partisanship that technologists are particularly susceptible to. You may recognize it as something that has a very long tradition in the hacker community.

But it doesn’t mean it’s a tradition worth keeping.

I have no illusions I will be attributed any objectivity, but it bears mentioning again that nothing I say here is representative of anything but my own opinions. ↩

Pipe `xargs` into `find`

Posted on Sun 03 April 2016 in Code • Tagged with shell scripting, Bash, xargs, find, zsh • Leave a comment

Here’s a trick that’s hardly new, but if you haven’t heard about, it will save you a trip to a man page or two.

Assuming you’re a person who mostly prefers the terminal over some fancy GUI, you’ve probably used the find command along with xargs at least a few times. It’s very common, for example, to use the results of find as arguments to some other program. It could something as simple as figuring out which modules in your project have grown slightly too large:

$ find . -name '*.py' | xargs wc -l | sort -hr
1467 total
 322 callee/base.py
 261 callee/general.py
 251 callee/collections.py
# etc.

We find them all first, and then use xargs to build a long wc invocation, and we finally display results in the reverse order. Pretty easy stuff: I don’t usually have to try more than a dozen times to get it right!¹

But how about the opposite situation? Let’s say you have a list of directories you want to search through with find. Doing so may seem easy enough²:

$ cat packagedirs.txt | xargs find -name '__init__.py'

Except it’s not going to work. Like a few other Unix commands, find is very particular about the order of arguments it receives. Not only are the predicate flags (like -name) considered in sequence, but they also have to appear after the directories we want to search through.

But in the xargs invocation above, essentially the opposite is going to happen.

The replacement flag

So how to remedy this? Enter the -I flag to xargs:

$ cat packagedirs.txt | xargs -I{} find {} -name '__init__.py'

This flag will tell xargs quite a few things.

The most important one is to stop putting the arguments at the end of the command invocation. Instead, it shall place them wherever it sees the replacement string — here, pair of braces³: {}. And because we placed the braces where find is normally expecting the list of directories to search through, the command will now get us exactly the results we wanted.

What’s almost impossible to see, however, is that it may not use the exact way we intended to obtain those results. The difference is easier to spot when we replace find with echo:

$ cat >/tmp/list
foo
bar
$ cat /tmp/list | xargs echo
foo bar
$ cat /tmp/list | xargs -I{} echo {}
foo
bar

or, better yet, use xargs with the -t flag to print the commands on stderr before executing them:

$ cat packagedirs.txt | xargs -I{} -t find {} -name '__init__.py' >/dev/null
find callee -name '__init__.py'
find tests -name '__init__.py'

As you can see, we actually have more than one find invocation here!

This is the second effect of -I: it causes xargs to execute given command line for each argument separately. It so happens that it doesn’t really make any difference for our usage of find, which is why it wasn’t at all obvious we were running it multiple times.

To avoid problems, though, you should definitely be cognizant of this fact when calling other programs with xargs -I.

Make arguments spaced again

Incidentally, I’m not aware of any method that’d actually make xargs produce find foo bar -name ... calls. If you need this exact form, probably the easiest way is to use plain old shell variables:

$ (d=$(cat packagedirs.txt); find $d -name '*.py')

This takes advantage of the word splitting feature of Bash and a few other compatible shells. Caveat is, you may be using a shell where this behavior is disabled by default. The result would be making find interpret the content of $d as a single directory name: foo bar rather than foo and bar.

zsh is one such shell. Although probably a good thing overall, in times like these you’d want to bring the “normal” behavior back. In zsh, it’s fortunately pretty simple:

$ (d=$(cat packagedirs.txt); find ${=d} -name '*.py')

What about a portable solution? As far as I can tell, the only certain way you can ensure word splitting occurs is to use eval. Here, the xargs command can actually come in handy again, albeit only as a prop:

$ (d=$(cat packagedirs.txt | xargs echo); eval "find $d -name '*.py'")

One would hope such hacks aren’t needed very often.

A completely kosher version would also use the -print0 flag to find and the -0 flag to xargs. It’s not necessary here because Python module files cannot contain spaces. ↩
Purists shall excuse my use of cat here, it’s merely for illustrative purposes. ↩
This use of braces in find has of course nothing to do with the other possible occurrences of {} there, like in the -exec flag. Since you cannot force find to expect a different placeholder, you should use something else for xargs in those cases, .e.g: xargs -I^ find ^ -name '__main__.py' -exec 'python {}' \;. ↩

Package managers’ appreciation day

Posted on Sat 26 March 2016 in Programming • Tagged with packages, package manager, node.js, npm, C++ • Leave a comment

By now you have probably heard about the infamous “npm-gate” that swept through the developer community over the last week. It has been brought up, discussed, covered, meta-discussed, satirized, and even featured by some mainstream media. Evidently the nerds have managed to stir up some serious trouble again, and it only took them 11 lines of that strange thing they call “code”.

No good things in small packages

When looking for a culprit, the one party that everyone pounced on immediately was of course the npm itself. With its myriad of packages that could each fit in a tweet, it invites to create the exact house of cards we’ve seen collapse.

This serves as a good wake-up call, of course. But it also compels to throw the baby out with the bathwater, and draw a conclusion that may be a little too far-fetched. Like perhaps declaring the entire idea of managing dependencies “the npm way” suspect. If packages tend to degenerate into something as ludicrous as isArray — to say nothing of left-pad, which started the whole debacle — then maybe this approach to software reusability has simply bankrupted itself?

A world without *pm

I’m right away responding to that with a resounding “No!”. Package management as a concept is not responsible for the poor decision making of one specific developer collective. And anyone who might think tools like npm do more harm than good I ask: have you recently written any C++?

See, C++ is the odd one among languages that at least pretend to be keeping up with the times. It doesn’t present a package management story at all. That’s right — the C++ “ecosystem”, as it stands now, has:

no package manager
no repository of packages
no unified way of managing dependencies
no way to isolate development environments of different projects from one another

Adding any kind of third-party dependency to a C++ project — especially a portable one, which is allegedly one of C++’s strengths — is a considerable pain, even when it doesn’t require any additional libraries by itself. And environment isolation? Some people are using Linux containers (!) for this, which is like dealing with a mosquito by shooting it with a howitzer.

Billions upon billions of lines
To build a C++ binary, you must first build the userspace.

But hey, at least they can use apt-get, right?…

So, string padding incidents aside, package managers are absolutely essential. Sure, we can and should discuss the merits of their particular flavors and implementation details —like whether it’s prudent to allow “delisting” of packages. As a whole, however, package managers deserve recognition as a crucial part of modern language tooling that we cannot really do without.

Introducing callee: matchers for unittest.mock

Posted on Sun 20 March 2016 in News • Tagged with Python, testing, mocking, argument matchers • Leave a comment

Combined with dynamic typing, the lax object model of Python makes it pretty easy to write unit tests that involve mocking some parts of the tested code. Still, there’s plenty of third party libraries — usually with “mock” somewhere in the name — which promise to make it even shorter and simpler. There’s evidently been a need here that the standard library didn’t address adequately.

It changed, however, with Python 3.3. This version introduced a new unittest.mock module which essentially obsoleted all the other, third party solutions. The module, available also as a backport for Python 2.6 and above, is flexible, easy to use, and offers most if not all of the requisite functionality.

Called it, maybe?

Well, with one notable exception. If we mock a function, or any other callable object:

# (production code)
class ProductionClass(object):
    def foo(self):
        self.florb(...)

# (test code)
something = ProductionClass()
something.florb = mock.Mock()

we have very limited capabilities when it comes to verifying how they were called. Sure, we can be extremely specific:

something.florb.assert_called_with('this particular string')

or very lenient:

something.florb.assert_called_with(mock.ANY)  # works for _any_ argument

but there is virtually nothing in between. Suppose we don’t care too much what exact string the method was called with, only that it was some string that’s long enough? Things get awkward very fast then:

for posargs, kwargs in something.florb.call_args_list:
    arg = posargs[0]
    self.assertIsInstance(arg, str)
    self.assertGreaterEqual(len(arg), 16)
    break
else:
    self.fail('required call to %r not found' % (something.florb,))

Yes, this is what’s required: an actual loop through all the mock’s calls¹; unpacking tuples of positional and keyword arguments; and manual assertions that won’t even emit useful error messages when they fail.

Eww! Surely Python can do better than that?

Meet `callee`

Why, of course it can! Today, I’m releasing callee: a library of argument matchers compatible with unittest.mock. It enables you to write much simpler, more readable, and delightfully declarative assertions:

from callee import String, LongerOrEqual
something.florb.assert_called_with(String() & LongerOrEqual(16))

The wide array of matchers offered by callee includes numeric, string, and collection ones. And if none suits your particular needs, it’s a trivial matter to implement a custom one.

Check the library’s documentation for more examples, or jump right in to the installation guide or a full matcher reference.

A loop is not necessary if we simulate assert_called_once_with rather than assert_called_with, but we still have to dig into mock.call objects to eke out the arguments. ↩

Tricks with ownership in Rust

Posted on Mon 07 March 2016 in Code • Tagged with Rust, borrow checker, reference counting, traits • Leave a comment

…or how I learned to stop worrying and love the borrow checker.

Having no equivalents in other languages, the borrow checker is arguably the most difficult thing to come to terms with when learning Rust. It’s easy to understand why it’s immensely useful, especially if you recall the various vulnerabities stemming from memory mismanagement. But that knowledge doesn’t exactly help when the compiler is whining about what seems like a perfectly correct code.

Let’s face it: it will take some time to become productive writing efficient and safe code. It’s not entirely unlike adjusting to a different paradigm such as functional programming when you’ve been writing mostly imperative code. Before that happens, though, you can use some tricks to make the transition a little easier.

Just `clone` it

Ideally, we’d want our code to be both correct and fast. But if we cannot quite get to the “correctness” part yet — because our program doesn’t, you know, compile — then how about paying for it with a small (and refundable) performance hit?

This is where the clone method comes in handy. Many problems with the borrow checker stem from trying to spread object ownership too thin. It is a precious resource and it’s not very cheap to “produce”, which is why good Rust code often deals with just immutable or mutable references.

But if that proves difficult, then “making more objects” is a great intermediate solution. Incidentally, this is what higher level languages are doing all the time, and often transparently. To ease the transition to Rust from those languages, we can start off by replicating their behavior.

As an example, consider a function that tries to convert some value to String:

struct Error;

fn maybe_to_string<T>(v: T) -> Result<String, Error> {
    // omitted
}

If we attempt to build upon it and create a Vector version:

fn maybe_all_to_string<T>(v: Vec<T>) -> Result<Vec<String>, Error> {
    let results: Vec<_> = v.iter().map(maybe_to_string).collect();
    if let Some(res) = results.iter().find(|r| r.is_err()) {
        return Err(Error);
    }
    Ok(results.iter().map(|r| r.ok().unwrap()).collect())
}

then we’ll be unpleasantly surprised by a borrow checker error:

error: cannot move out of borrowed content [E0507]
    Ok(results.iter().map(|r| r.ok().unwrap()).collect())
                              ^

Much head scratching will ensue, and we may eventually find an idiomatic and efficient solution. However, a simple stepping stone in the shape of additional clone() call can help move things forward just a little quicker:

#[derive(Clone)]
struct Error;

// ...
Ok(results.iter().map(|r| r.clone().ok().unwrap()).collect())

The performance tradeoff is explicit, and easy to find later on with a simple grep clone or similar. When you learn to do things the Rusty way, it won’t be hard to go back to your “hack” and fix it properly.

Refcounting to the rescue

Adding clone() willy-nilly to make the code compile is a valid workaround when we’re just learning. Sometimes, however, even some gratuitous cloning doesn’t quite solve the problem, because the clone() itself can become an issue.

For one, it requires our objects to implement the Clone trait. This was apparent even in our previous example, since we had to add a #[derive(Clone)] attribute to the struct Error in order to make it clone-able.

Fortunately, in the vast majority of cases this will be all that’s necessary, as most built-in types in Rust implement Clone already. One notable exception are function traits (FnOnce, Fn, and FnMut) which are used to store and refer to closures¹. Structures and other custom types that contain them (or those which may contain them) cannot therefore implement Clone through a simple #[derive] annotation:

/// A value that's either there already
/// or can be obtained by calling a function.
#[derive(Clone)]
enum LazyValue<T: Clone> {
    Immediate(T),
    Deferred(Fn() -> T),
}

error: the trait `core::marker::Sized` is not implemented for the type `core::ops::Fn() -> T + 'static` [E0277]
    #[derive(Clone)]
             ^~~~~

What can we do in this case, then? Well, there is yet another kind of performance concessions we can make, and this one will likely sound familiar if you’ve ever worked with a higher level language before. Instead of actually cloning an object, you can merely increment its reference counter. As the most rudimentary kind of garbage collection, this allows to safely share the object between multiple “owners”, where each can behave as if it had its own copy of it.

Rust’s pointer type that provides reference counting capabilities is called std::rc::Rc. Conceptually, it is analogous to std::shared_ptr from C++, and it similarly keeps the refcount updated when the pointer is “acquired” (clone-ed) and “released” (drop-ed). Because no data is moved around during either of those two operations, Rc can refer even to types whose size isn’t known at compilation time, like abstract closures:

use std::rc::Rc;

#[derive(Clone)]
enum LazyValue<T: Clone> {
    Immediate(T),
    Deferred(Rc<Fn() -> T>),
}

Wrapping them in Rc therefore makes them “cloneable”. They aren’t actually cloned, of course, but because of the inherent immutability of Rust types they will appear so to any outside observer².

Move it!

Ultimately, most problems with the borrow checker boil down to unskillful mixing of the two ways you handle data in Rust. There is ownership, which is passed around by moving the values; and there is borrowing, which means operating on them through references.

When you try to switch from one to the other, some friction is bound to occur. Code that uses references, for example, has to be copiously sprinkled with & and &mut, and may sometimes require explicit lifetime annotations. All these have to be added or removed, and changes like that tend to propagate quite readily to the upper layers of the program’s logic.

Therefore it is generally preferable, if at all possible, to deal with data directly and not through references. To maintain efficiency, however, we need to learn how to move the objects through the various stages of our algorithms. It turns out it’s surprisingly easy to inadvertently borrow something, hindering the possibility of producing a moved value.

Take our first example. The intuitively named Vec::iter method produces an iterator that we can map over, but does it really go over the actual items in the vector? Nope! It gives us a reference to each one — a borrow, if you will — which is exactly why we originally had to use clone to get out of this bind.

Instead, why not just get the elements themselves, by moving them out of the vector? Vec::into_iter allows to do exactly this:

Ok(results.into_iter().map(|r| r.ok().unwrap()).collect())

and enables us to remove the clone() call. The family of similar into_X (or even just into) methods can be reliably counted on at least in the standard library. They are also part of a more-or-less official naming convention that you should also follow in your own code.

Note how this is different from function types, i.e. fn(A, B, C, ...) -> Ret. It is because plain functions do not carry their closure environments along with them. This makes them little more than just pointers to some code, and those can be freely Clone-d (or even Copy-ed). ↩
If you want both shared ownership (“fake cloneability”) and the ability to mutate the shared value, take a look at the RefCell type and how it can be wrapped in Rc to achieve both. ↩

Requirements for Python’s pip

Posted on Sun 21 February 2016 in Code • Tagged with Python, pip, packages, dependencies • Leave a comment

In this post I’ll describe all (hopefully all!) the various ways you can specify a single dependency for a Python package.

This assumes pip is used for installation. The list of dependencies then goes either in install_requires= parameter of the setup function within setup.py, or as a separate requirements.txt file. Commonly, it will actually go in both places, with the latter being the canonical source of truth:

from setuptools import setup

with open('requirements.txt') as rf:
    setup(
        # ...
        install_requires=rf.readlines(),
    )

More details about this approach can be found in one of my previous posts.

Here, I will concentrate on the format of a single line in requirements.txt that defines a dependency. There are numerous variants that pip supports, and they are all described in excruciating detail in PEP 440. This post shall serve as a short reference on the most useful ones.

Package name (and version)

The simplest and most common option is to identify a dependency by its package name:

SQLAlchemy

This will locate it in a global index of packages, which is sometimes called a “cheese shop”. Currently, by far the most popular package registry for Python is PyPI, and pip uses it by default¹.

Without any further modifiers, pip will download and install the “current” version of the package — either the newest, or the one designated explicitly by a maintainer. This obviously makes the dependency somewhat unpredictable, for it can mean unintended upgrades that introduce breaking changes to your code.

To prevent this, you’d normally pin the dependency to an exact version²:

SQLAlchemy==0.9.10

Other comparison operators are also available:

SQLAlchemy>=0.9.10
SQLAlchemy<1.0.0

and can even be combined:

SQLAlchemy>=0.9,<1.0.0

Specs like that will make pip find the newest version that’s within given range. Assuming your dependency follows the semantic versioning scheme, this will allow you to stay on top of any minor bugfixes and improvements to an older release (0.9.x here), without the risk of accidentally upgrading to a new one (1.x) that your code is not compatible with yet.

Repository URL

Sometimes you want to live on the bleeding edge, though, and depend not just on the latest release, but the head commit to the package’s repository. This makes sense especially in large systems that are distributed among multiple repos, and where development happens in lockstep.

For those occasions, and a few others, pip can recognize direct repository URLs. They are in the format:

$VCS+$PROTOCOL://$URL@$LABEL#egg=$PACKAGE

where the $PROTOCOL part can be optional if the version control system has a default there. That’s for example the case for git, which is of course the most important VCS you’d be interested in³:

git://git.example.com/somepackage#egg=somepackage

Note that the #egg=$PACKAGE part is not a part of the $URL, and it’s only there to give a local name for the package distribution. This is what makes it possible to refer to it later via pip, if only to remove it with pip uninstall $PACKAGE. Of course, the sanest practice is to use the PyPI moniker if possible.

When no $LABEL is given, pip will use the HEAD, trunk, tip, or the equivalent default/current revision from the repo. Often though (at least in case of Git), you would also pick a branch, tag, or even a particular commit hash:

git+https://github.com/Xion/unmatcher.git@0.1.3.1#egg=unmatcher
git+ssh://github.com/You/yourpackage.git@master#egg=yourpackage
git+https://github.com/mitsuhiko/jinja2.git@5b498453b5898257b2287f14ef6c363799f1405a#egg=Jinja2

The last two options could be a good choice even with third party packages, when you don’t want to wait for a new PyPI release to get a necessary feature or an urgent bug fix.

Local filesystem

Lastly, you can ask pip to install a package from a local directory or archive. The former option is often used with the -e (--editable) flag for pip install. This installs the package in the so-called development mode, allowing you to edit its source code in-place:

$ pip install -e /home/me/Code/myotherpackage

You almost certainly don’t want to put this line in requirements.txt: you should still be pulling the other package from PyPI. But if it’s your own one — maybe a self-contained utility library used by your main program — this setup will be very helpful for making changes to it, informed by your own usage of the package.

This can be changed with --index_url flag to pip install. Running local indexes is a good practice for Python shops, especially those that rely on pip install as part of their deployment process. ↩
If the package uses semantic versioning, a possible alternative to == is ~=, which means “compatible” version. The precise meaning of this is somewhat complicated, but it roughly means that upgrades are permitted as long as nothing in the public interface changes. ↩
Other options include hg (Mercurial), svn, and bzr (Bazaar). ↩

Moving out of a container in Rust

Posted on Fri 05 February 2016 in Code • Tagged with Rust, vector, borrow checker, references • Leave a comment

To prevent the kind of memory errors that plagues many C programs, the borrow checker in Rust tracks how data is moved between variables, or accessed via references. This is all done at compile time, with zero runtime overhead, and is a sizeable part of Rust’s value offering.

Like all rigid and automated systems, however, it is necessarily constrained and cannot handle all situations perfectly. One of its limitations is treating all objects as atomic. It’s impossible for a variable to own a part of some bigger structure, neither is it possible to maintain mutable references to two or more elements of a collection.

If we nonetheless try:

fn get_name() -> String {
    let names = vec!["John".to_owned(), "Smith".to_owned()];
    join(names[0], names[1])
}

fn join(a: String, b: String) -> String {
    a + " " + &b
}

we’ll be served with a classic borrow checker error:

<anon>:3:25: 3:33 error: cannot move out of indexed content [E0507]
<anon>:3     let fullname = join(names[0], names[1]);
                                 ^~~~~~~~

Behind its rather cryptic verbiage, it informs us that we tried to move a part of the names vector — its first element — to a new variable (here, a function parameter). This isn’t allowed, because in principle it would render the vector invalid from the standpoint of strict memory safety. Rust would no longer guarantee names[0] to be a legal String: its internal pointer could’ve been invalidated by the code which the element moved to (the join function)¹.

But while commendable, this guarantee isn’t exactly useful here. Even though names[0] would technically be invalid, there isn’t anyone to actually notice this fact. The names vector is inaccessible outside of the function it’s defined in, and even the function itself doesn’t look at it after the move. In its present form, the program ~~is inarguably correct~~² could’ve been accepted if partial moves from Vec were allowed by the borrow checker.

Pointers to the rescue?

Vectors wouldn’t be very useful or efficient, though, if we could only obtain copies or clones of their elements. As this is an inherent limitation of Rust’s memory model, and applies to all compound types (structs, hashmaps, etc.), it’s been recognized and countermeasures are available.

However, the idiomatic practice is to actually leave the elements be and access them solely through references:

fn get_name() -> String {
    let names = vec!["John".to_owned(), "Smith".to_owned()];
    join(&names[0], &names[1])
}

fn join(a: &String, b: &String) -> String {
    a.clone() + " " + b
}

The obvious downside of this approach is that it requires an interface change to join: it now has to accept pointers instead of actual objects³. And since the result is a completely new String, we have to either bite the bullet and clone, or write a more awkward join_into(a: &mut String, b: &String) function.
In general, making an API switch from actual objects to references has an annoying tendency to percolate up the call stacks and abstraction layers.

Vector solution

If we still insist on moving the elements out, at least in case of vector we aren’t completely out of luck. The Vec type offers several specialized methods that can slice, dice, and splice the collection in various ways. Those include:

split_first (and split_first_mut) for cutting right after the first element
split_last (and split_last_mut) for a similar cut right before the last element
split_at (and split_at_mut), generalized versions of the above methods
split_off, a partially-in-place version of split_at_mut
drain for moving all elements from a specified range

Other types may offer different methods, depending on their particular data layout, though drain should be available on any data structure that can be iterated over.

Structural advantage

What about user-defined types, such as structs?

Fortunately, these are covered by the compiler itself. Since accessing struct fields is a fully compile-time operation, it is possible to track the ownership of each individual object that makes up the structure. Thus there are no obstacles to simply moving all the fields:

struct Person {
    first_name: String,
    last_name: String,
}

fn get_name() -> String {
    let p = Person{first_name: "John".to_owned(),
                   last_name: "Smith".to_owned()};
    join(p.first_name, p.last_name)
}

If all else fails…

This leaves us with some rare cases when the container’s interface doesn’t quite support the exact subset of elements we want to move out. If we don’t want to drain them all and inspect every item for potential preservation, it may be time to skirt around the more dangerous areas of the language.

But I don’t necessarily mean going all out with unsafe blocks, pointers, and (let’s be honest) segfaults. Instead, we can look at the gray zone between them and the regular, borrow-checked Rust code.

Some of the functions inside the std::mem module can be said to fall into this category. Most notably, mem::swap and mem::replace allow us to operate directly on the memory blocks that back every Rust object, albeit without the dangerous ability to freely modify them.

What those functions enable is a small sleight of hand — a quick exchange of two variables or objects while the borrow checker “isn’t looking”. Possessing such an ability, we can smuggle any item out of a container as long as we’re able to provide a suitable replacement:

use std::mem;

/// Pick only the items under indices that are powers of two.
fn pick_powers_of_2<T: Default>(mut v: Vec<T>) -> Vec<T> {
    let mut result: Vec<T> = Vec::new();
    let mut i = 1;
    while i < v.len() {
        let elem = mem::replace(&mut v[i], T::default());
        result.push(elem);
        i *= 2;
    }
    result
}

Swap!
Pictured: implementation of mem::replace.

The Default value, if available, is usually a great choice here. Alternately, a Copy or Clone of some other element can also work if it’s cheap to obtain.

In Rust jargon, it is sometimes said that the object has been “consumed” there. ↩
As /u/Gankro points out on /r/rust, since Vec isn’t a part of the language itself, it doesn’t get to bend the borrow checking rules. Therefore speaking of counterfactual correctness is a bit too far-fetched in this case. ↩
For Strings specifically, the usual practice is to require a more generic &str type (string slice) instead of &String. ↩

Retry idiom for Python

Posted on Wed 27 January 2016 in Code • Tagged with Python, exceptions, else • Leave a comment

A relatively little known feature of Python is the else block for control flow statements other than if.

If you haven’t heard about it before, you can provide such a block for both while and for loops, as well as any variant of the try statement, Its functionality is roughly analogous in both cases:

in loops, the else block is executed if the loop didn’t exit abnormally (i.e. with break)
in try constructs, the else block runs if no exception happened

Likely because of the unique semantics that don’t exist in other languages, neither of those constructs has been seen much in real world code. Recently, however, I’ve found they can be combined into a very pythonic pattern that’s also quite useful.

The trick

Say you have a task that won’t always succeed. Perhaps it’s a request made to a janky server, or other network operartion that’s prone to timeouts. Since failures are likely to be transient, you’d like to retry it several more times before giving up permanently.

With try/except block, you can detect those half-expected failures. With a simple loop, you can repeat the attempt for as many times as you deem feasible. Combined, they can solve the problem rather neatly:

for _ in range(MAX_RETRIES):
    try:
        # ... do stuff ...
    except SomeTransientError:
        # ... log it, sleep, etc. ...
        continue
    else:
        break
else:
    raise PermanentError()

But why?

What’s the deal with the elses here, though? Are they both necessary?

The simple answer is of course no. else after either a loop or try/except block is always a syntactic sugar. Any code that contains it can be transformed into an equivalent snippet that utilizes different techniques to achieve the same effect.

But this view isn’t very useful, for many of the essential features in any programming languages can be dismissed as superfluous using this reasoning. The real question is whether the above idiom is more readable and understandable than the alternatives.

To that, I posit, the answer is: absolutely.

Desugaring

Without the double else, this example would have to be written in a considerably more convoluted way:

retries = MAX_RETRIES
while retries > 0:
    try:
        # ... do stuff ...
        break
    except SomeTransientError:
        # ... log it, sleep, etc. ...
        retries -= 1
if retries == 0:
    raise PermanentError()

Although at first glance the difference may be minuscule, this version adds significant extra busywork the programmer has to pay careful attention to:

The retries variable now has to be explicit, because the final conditional statement must look at its value.
We can’t use a for loop anymore (e.g. for retries in range(MAX_RETRIES)), because we wouldn’t distinguish the “success at last try” and “retry limit exceeded” cases: they’d both result in retries equal to MAX_RETRIES - 1 after the loop¹.
As a result, we have to remember to decrement the counter ourselves upon an error.

Additionally, the break is easy to miss amidst the actual logic within the try block, both for developer who writes the code and for any subsequent readers. An alternative is to move it outside of the try/except clause, but that in turn reintroduces continue into the except branch and further complicates the whole flow.

In short, the desugared version is more error-prone (all those off-by-ones!) and also quite inscrutable.

Or to 1, if we count from MAX_RETRIES down to zero. ↩

Older Posts Newer Posts