Karol Kuczmarski's Blog

Better location for unit tests in Rust

Posted on Fri 06 January 2017 in Code • Tagged with Rust, unit tests, testing, modules • Leave a comment

For a unit test to be comprehensive, it must often access some private symbols from the module it checks.

In Rust, this is permitted for submodules: they can freely refer to anything defined “upwards” in the module hierarchy. The only requirement is that they import it explicitly by name, using statements such as use super::foo.

To illustrate this, here’s an example of a ridiculously well-factored FizzBuzz along with its accompanying unit test:

use std::borrow::Cow;

pub fn fizzbuzz(n: u32) {
    for i in 1..n+1 {
        println!("{}", fizzbuzz_string(i));
    }
}

fn fizzbuzz_string(i: u32) -> Cow<'static, str> {
    let by3 = i % 3 == 0;
    let by5 = i % 5 == 0;
    if by3 && by5 { "FizzBuzz".into() }
    else if by3   { "Fizz".into() }
    else if by5   { "Buzz".into() }
    else          { format!("{}", i).into() }
}


#[cfg(test)]
mod tests {
    use super::fizzbuzz_string;

    #[test]
    fn single_numbers() {
        assert_eq!("1", fizzbuzz_string(1));
        assert_eq!("2", fizzbuzz_string(2));
        assert_eq!("Fizz", fizzbuzz_string(3));
        assert_eq!("Buzz", fizzbuzz_string(5));
        assert_eq!("7", fizzbuzz_string(7));
        assert_eq!("Fizz", fizzbuzz_string(9));
        assert_eq!("Buzz", fizzbuzz_string(10));
        assert_eq!("FizzBuzz", fizzbuzz_string(15));
        # etc.
    }
}

The internal function, as shown above, can be imported and verified independently of the public one. This is done through a #[test] procedure in an inline submodule.

Such factorization and granular testing is commonplace, especially when the public API may cause unwanted side effects, such as printing stuff to stdout here.

The issue of length

But if you are like me and prefer your modules to be short and sweet, you may feel justifiably concerned about this inline submodule business.

In the toy example above, tests have already taken at least as many lines as the actual code. Real world usually matches this ratio. A module with a couple hundred lines of regular code starts to be measured in KLOCs if we also include its tests.

While this could be taken as a strong hint to split things up, it can just as easily disincentivize testing instead.

The obvious solution is to move those tests somewhere else. What is not so evident is how to preserve this crucial module-submodule relation, enabling us to write comprehensive tests in the first place.

Looking for inspiration

I must quickly disappoint anyone who would like to round up all their unit tests and sequester them in some distant tests/ directory. Such layout is reserved for crate-level (“integration”) tests. Unit tests, on the other hand, are predestined to live among production code¹.

So let’s at least relocate them to separate files.

To make this goal more concrete, we will try to emulate the project layout described in Google’s C++ style guide. By this convention, a conceptual “module” or “unit” consists of the following files:

foo.h
foo.cc
foo_test.cc

Translating this to Rust, we get:

foo.rs
foo_test.rs

The first one is obviously our production code. The second file, foo_test.rs, contains all the tests we would previously put in the mod tests { } construct.

Seems pretty clean and straightforward, right? Unfortunately, Rust will not accept this setup without some convincing.

Family problems

To understand why, recall that the mere presence of some .rs files is not enough for the Rust compiler to care. If we want them picked up and included in the project, we also need to add some module declarations first.

In other words, there must also be a mod.rs file in this directory, containing at the very least the following content:

// (mod.rs)

mod foo;
#[cfg(test)]
mod foo_test;

Now it should be clearer that something is wrong.

We got two modules here, but they are siblings. Both foo and foo_test are on the same level, children of whatever parent module contains them both. More to the point, it’s foo_test that’s not a child module of foo, meaning it can only see the public symbols of the latter.

This is not quite enough to write a proper unit test. It definitely isn’t for our initial FizzBuzz example, because the fizzbuzz_string function cannot even be imported!

Existential crises

Okay, so how about we move the mod foo_test; declaration to foo.rs? This should be enough to establish the proper hierarchy. After all, this is how the module tree is normally reconstructed: from the appropriate placement of the mod statements.

So, here we go:

// (foo.rs)

#[cfg(test)]
mod foo_test;

error: cannot declare a new module at this location
  --> src/parent/foo.rs:4:5
   |
 4 | mod foo_test;

…Really?

Well, yes. A declaration like this simply isn’t allowed. The reason for this is actually much less arbitrary than the error message would indicate.

To put it bluntly, foo_test simply cannot exist if it’s introduced there. To deliver on its declaration promise, the submodule would have to reside within foo itself. But of course, foo.rs is just a file, so this setup is evidently impossible.

All in all, Rust seems to be looking for our module in all the wrong places.

Perhaps we can just tell it where it should be going instead?…

The right path

Enter the #[path] attribute, which fulfills this exact purpose:

// (foo.rs)

#[cfg(test)]
#[path = "./foo_test.rs"]
mod foo_test;

#[path] tells the Rust compiler where to look for the module it is attached to. Its argument is relative to the location of the outer module (like foo here), and can be either a single file, or a directory with mod.rs.

Conceptually, this is similar to a custom ClassLoader in Java, or the common sys.path hacks in Python. Unlike those two languages, however, the #[path] attribute is only relevant at compile time.

Additionally, and somewhat confusingly, #[path] can also be applied retroactively to a module that the compiler has already located. In such case, it will affect the lookup of any child modules by making rustc search for them in the new location.

With #[path] handy, it is therefore possible to implement custom layouts of regular source modules and test files.

But like with every tool that can be used to defy conventions, it should be used with the appropriate care. While a straightforward and self-documenting approach described here is unlikely to raise any eyebrows, rewriting module paths willy-nilly is most certainly a bad idea.

Okay, technically it is possible to completely isolate them, essentially by abusing the approach I describe later in this post. ↩

all and wild imports in Python

Posted on Mon 26 December 2016 in Code • Tagged with Python, modules, imports, testing • Leave a comment

An often misunderstood piece of Python import machinery is the __all__ attribute. While it is completely optional, it’s common to see modules with the __all__ list populated explicitly:

__all__ = ['Foo', 'bar']

class Foo(object):
    # ...

def bar():
    # ...

def baz():
    # ...

Before explaining what the real purpose of __all__ is (and how it relates to the titular wild imports), let’s deconstruct some common misconceptions by highlighting what it isn’t:

__all__ doesn’t prevent any of the module symbols (functions, classes, etc.) from being directly imported. In our the example, the seemingly omitted baz function (which is not included in __all__), is still perfectly importable by writing from module import baz.
Similarly, __all__ doesn’t influence what symbols are included in the results of dir(module) or vars(module). So in the case above, a dir call would result in a ['Foo', 'bar', 'baz'] list, even though 'baz' does not occur in __all__.

In other words, the content of __all__ is more of a convention rather than a strict limitation. Regardless of what you put there, every symbol defined in your module will still be accessible from the outside.

This is a clear reflection of the common policy in Python: assume everyone is a consenting adult, and that visibility controls are not necessary. Without an explicit __all__ list, Python simply puts all of the module “public” symbols there anyway¹.

The meaning of it `all`

So, what does __all__ actually effect?

This is neatly summed up in this brief StackOverflow answer. Simply speaking, its purpose is twofold:

It tells the readers of the source code — be it humans or automated tools — what’s the conventional public API exposed by the module.
It lists names to import when performing the so-called wild import: from module import *.

Because of the default content of __all__ that I mentioned earlier, the public API of a module can also be defined implicitly. Some style guides (like the Google one) are therefore relying on the public and _private naming exclusively. Nevertheless, an explicit __all__ list is still a perfectly valid option, especially considering that no approach offers any form of actual access control.

Import star

The second point, however, has some real runtime significance.

In Python, like in many other languages, it is recommended to be explicit about the exact functions and classes we’re importing. Commonly, the import statement will thus take one of the following forms:

import random
import urllib.parse
from random import randint
from logging import fatal, warning as warn
from urllib.parse import urlparse
# etc.

In each case, it’s easy to see the relevant name being imported. Regardless of the exact syntax and the possible presence of aliasing (as), it’s always the last (qualified) name in the import statement, before a newline or comma.

Contrast this with an import that ends with an asterisk:

from itertools import *

This is called a star or wild import, and it isn’t so straightforward. This is also the reason why using it is generally discouraged, except for some very specific situations.

Why? Because you cannot easily see what exact names are being imported here. For that you’d have to go to the module’s source and — you guessed it — look at the __all__ list².

Taming the wild

Barring some less important details, the mechanics of import * could therefore be expressed in the following Python (pseudo)code:

import module as __temp
for __name in module:
    globals()[name] = getattr(__temp, __name)
del __temp
del __name

One interesting case to consider is what happens when __all__ contains a wrong name.

What if one of the strings there doesn’t correspond to any name within the module?…

# foo.py
__all__ = ['Foo']

def bar():
    pass

>>> import foo
>>> from foo import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'Foo'

Quite predictably, import * blows up.
Notice, however, that regular import still works.

All in all (ahem), this hints at a cute little trick which is also very self-evident:

__all__ = ['DO_NOT_WILD_IMPORT']

Put this in a Python module, and no one will be able to import * from it!
Much more effective than any lint warning ;-)

Test `all` the things

Jokes aside, this phenomenon (__all__ with an out-of-place name in it) can also backfire. Especially when reexporting, it’s relatively easy to introduce stray 'name' into __all__: one which doesn’t correspond to any name that’s actually present in the namespace.

If we commit such a mishap, we are inadvertently lying about the public API of our package. What’s worse is that this mistake can propagate through documentation generators, and ultimately mislead our users.

While some linters may be able to catch this, a simple test like this one:

def test_all(self):
    """Test that __all__ contains only names that are actually exported."""
    import yourpackage

    missing = set(n for n in yourpackage.__all__
                  if getattr(yourpackage, n, None) is None)
    self.assertEmpty(
        missing, msg="__all__ contains unresolved names: %s" % (
            ", ".join(missing),))

is a quick & easy way to ensure this never happens.

“Public” symbols have names that don’t begin with underscore (_). Of course, “non-public” ones are still accessible but are treated as implicitly unstable & discouraged. ↩
Or check what symbols there don’t have a leading underscore. ↩

Optional loading of RequireJS modules

Posted on Tue 29 September 2015 in Code • Tagged with JavaScript, RequireJS, modules, Web Workers, DOM, AJAX • Leave a comment

RequireJS is a module loader for JavaScript. Similar to its alternatives such as Browserify, it tries to solve an important problem on the web front(end): dividing JavaScript code into modules for better maintainability while still loading them correctly and efficiently without manual curation of the <script> tags.

Once it’s configured correctly (which can be rather non-trivial, though), modules in RequireJS are simply defined as functions that return arbitrary JavaScript objects:

define([
    'jquery',
    'lodash',

    'myapp/dep1',
    'myapp/dep2',
], function($, _, dep1, dep2) {
    // ... all of the module's code ...

    return {
        exportedSymbol1: ...,
        exportedSymbol2: ...,
    };
});

Before executing the function, RequireJS loads all the specified dependencies, repeating the process recursively and asynchronously. Return values from module functions are passed as parameters to the next module function, and thus the whole mechanism clicks, serving as a crafty workaround for the lack of proper import functionality¹.

Relative failures

If, at some point in the chain, the desired module cannot be found or loaded, the entire process grinds to a halt with an error. Most of the time, this is perfectly acceptable (or even desirable) behavior, equivalent to an incorrect import statement, invalid #include directive, or similar mistake in other languages.

But there are situations when we’d like to proceed with a missing module, because the dependent code is prepared to handle it. The canonical example are Web Workers. Unlike traditional web application code, Web Worker scripts operate outside of a context of any single page, having no access to the DOM tree (because which DOM tree would it be?). Correspondingly, they have no document nor window objects in their global scope.

Unfortunately, some libraries (*cough* jQuery *cough*) require those objects as a hard (and usually implicit) dependency. This doesn’t exactly help if we’d like to use them in worker code for other features, not related to DOM. In case of jQuery, for example, it could be the API for making AJAX calls, which is still decidedly more pleasant than dealing with bare XMLHTTPRequest if we’re doing anything non-trivial.

Due to this hard dependency on DOM, however, Web Workers cannot require jQuery. No biggie, you may think: browsers supporting workers also offer an excellent, promise-based Fetch API that largely replaces the old AJAX, so we may just use it in worker code. Good thinking indeed, but it doesn’t solve the issue of sharing code between main (“UI”) part of the app and Web Workers.

Suppose you have the following dependency graph:

The common module has some logic that we’d want reused between regular <script>-based code and a Web Worker, but its dependency on jQuery makes it impossible. It would work, however, if this dependency was a soft one. If common could detect that jQuery is not available and fall back to other solutions (like the Fetch API), we would be able to require it in both execution environments.

The `optional` plugin

What we need, it seems, is an ability to say that some dependencies (like 'jquery') are optional. They can be loaded if they’re available but otherwise, they shouldn’t cause the whole dependency structure to crumble. RequireJS does not support this functionality by default, but it’s easy enough to add it via a plugin.

There are already several useful plugins available for RequireJS that offer some interesting features. As of this writing, however, optional module loading doesn’t seem to be among them. That’s not a big problem: rolling out our own² plugin turns out to be relatively easy.

RequireJS plugins are themselves modules: you create them as separate JavaScript files having code wrapped in define call. They can also declare their own dependencies like any other module. The only requirement is that they export an object with certain API: at minimum, it has to include the load method. Since our optional plugin is very simple, load is in fact the only method we have to implement:

/* Skeleton of a simple RequireJS plugin module. */

define([], function() {

function load(moduleName, parentRequire, onload, config) {
    // ...
}

return {
    load: load,
};

});

As its name would hint, load carries out the actual module loading which a plugin is allowed to influence, modify, or even replace with something altogether different. In our case, we don’t want to be too invasive, but we need to detect failure in the original loading procedure and step in.

I mentioned previously that module loading is asynchronous, which JavaScript often translates to “callbacks”. Here, load receives the onload callback which we eventually need to invoke. It also get the mysterious parentRequire argument; this is simply a regular require function that’d normally be used if our plugin didn’t stand in the way.

Those two are the most important pieces of the puzzle, which overall has a pretty succinct solution:

/**
 * RequireJS plugin for optional module loading.
 */
define ([], function() {


/** Default value to return when a module failed to load. */
var DEFAULT = null;

function load(moduleName, parentRequire, onload) {
    parentRequire([moduleName], onload, function (err) {
        var failedModule = err.requireModules && requireModules[0];
        console.warn("Could not load optional module: " + failedModule);
        requirejs.undef(failedModule);

        define(failedModule, [], function() { return DEFAULT; });
        parentRequire([failedModule], onload);
    });
}

return {
    load: load,
};

});

The logic here is as follows:

First, try to load the module normally (via the outer parentRequire call).
If it succeeds, onload is called and there is nothing for us to do.
If it fails, we log the failedModule and cleanup some internal RequireJS state with requirejs.undef.
Most importantly, we define the module as a trivial shim that returns some DEFAULT (here, null).
As a result, when we require it again (through the inner parentRequire call), we know it’ll be loaded successfully.

Usage

Plugins in RequireJS are invoked on a per-module basis. You can specify that a certain dependency 'bar' shall be loaded through a plugin 'foo' by putting 'foo!bar' on the dependency list:

define([ 'foo!bar'], function(bar) {
    // ...
});

Both 'foo' and 'bar' represent module paths here: the first one is the path to the plugin module, while the second one is the actual dependency. In a more realistic example — like when our optional loader is involved — both of them would most likely be multi-segments paths:

define([
    'myapp/ext/require/optional!myapp/common/buttons/awesome-button',
], function(AwesomeButtonController) {
    // ...
});

As you can see, they can get pretty unreadable rather quickly. It would be better if the plugin prefix consisted of just one segment (i.e. optional!) instead. We can make that happen by adding a mapping to the RequireJS config:

requirejs.config({
    // ...
    map: {
        '*': {
            'optional': 'myapp/ext/require/optional',
        }
    }
})

With this renaming in place, the loading of non-mandatory dependencies becomes quite a bit clearer:

define([
    'optional!myapp/common/buttons/awesome-button',
], function(AwesomeButtonController) {

// ...
if (!AwesomeButtonController) {
    // ... (some work around) ...
}

});

Of course, you still need to actually code around the potential lack of an optional dependency. The if statement above is just an illustrative example; you may find it more sensible to provide some shim instead:

AwesomeButtonController = AwesomeButtonController || function() {
    // ...
};

Either way, I recommend trying to keep the size of such conditional logic to a minimum. Ideally, it should be confined to a single place, or — better yet — abstracted behind a function.

An actual import statement has made it into the ES6 (ECMAScript 2015) standard but, as of this writing, no browser implements it. ↩
Most of the code for the plugin presented here is based on this StackOverflow answer. ↩