Rust #6: Exploring crates

I often install tools via cargo and use crates for my code that have many dependencies. If you're like me, you are wondering when downloading and compiling what all those crates do. I could just look at the top N popular crates on http://crates.io but I thought that was boring. Rather, I thought I'd clone the exa command-line tool from Github and see what crates it used. Better to look at a released tool that is out in the wild?

The 'exa' tool is a Rust-built drop-in replacement for the Unix command ls. It allows me to produce listings like:

A simple command shows us the crate dependency hierarchy for this tool:

$ cargo tree

The top-level list of dependencies are:

  • ansi_term
  • datetime
  • git2
  • glob
  • lazy_static
  • libc
  • locale
  • log
  • natord
  • num_cpus
  • number_prefix
  • scoped_threadpool
  • term_grid
  • term_size
  • unicode-width
  • users
  • zoneinfo_compiled

Some interesting crates that I recognise and are dependencies of the above also appeared:

  • bitflags
  • byteorder
  • matches
  • pad
  • tinyvec
  • url

So now I will boot up my favourite browser, go to http://docs.rs and figure out what all these crates do and see if any are useful. At least cursory knowledge of them will stop me from reinventing the wheel if I need their functionality.

ansi_term

This is a library that allows the generation of ANSI control codes that allow for colour and formatting (i.e. bold, italic, etc.) on your terminal. If you are not sure what ANSI control codes are, this link here at Wikipedia should explain it. It provides styles and colours via the builder pattern or a colour enum. For example:

use ansi_term::Colour::Yellow;

println!("My name is {}", Yellow.bold().paint("Matt"));

There's support for blink, bold, italic, underline, inverse (confusingly call reverse here), 256 colours and 24-bit colours. This is a very useful crate for wrapping strings with the ANSI control codes that you need.

But I have seen crates that do this with different syntax using extension traits on &str and String. The one that I use frequently is colored. For example, to recreate the last snippet using colored, it is:

use colored::Colorize;

println!("My name is {}", "Matt".yellow().bold());

I prefer the latter form, but it is totally subjective.

datetime

This is one crate I am familiar with. Unfortunately, I found the standard's library time and date support lacking for a tool I was writing. Some research unearthed this crate that provides a lot more functionality. Be careful though, there is another crate called date_time that does similar stuff. The datetime library here provides structures for representing a calendar date and time of day in both timezone and local form. It also provides formatting functions that convert a date or time into a string and back again, but unfortunately no documentation on how to use it. The format function takes a time value but also takes a locale of type Time but no information on how to generate that. I couldn't figure it myself.

My go-to date and time crate that I use is chrono. It is fully featured, efficient and better documented. You can do UTC, local and fixed offset times (times in a certain timezone). It can format to and parse from strings containing times and dates. Times also have nanosecond accuracy. Times and dates are complicated things and chrono is a useful weapon in your coding arsenal.

bitflags

This is a very useful crate for generating enums that are bitmasks. You often see this in low-level programming or when dealing the operating system interfaces. Unlike normal enums, bitflag enums can be combined using logical operators. This is trivial to do in C but is not supported in Rust. In C, you could do something like:

typedef enum Flags {
    A = 0x01,
    B = 0x02,
    C = 0x04,
    ABC = A | B | C,
};

int ab = A | B;

In Rust, enums are distinct and cannot be combined like that. With bitflags you can:

use bitflags::bitflags;

bitflags! {
    struct Flags: u8 {
        const A = 0b001;
        const B = 0b010;
        const C = 0b100;
        const ABC = Self::A.bits | Self::B.bits | Self::C.bits;
    }
}

let ab = Flags::A | Flags::C;

Not as eloquent as C, but at least the enumerations are scoped like C++'s enum class. However, Rust's bitflags crate does support set difference using the - operator. This would go very wrong in C and C++ as - would be treated as a normal integer subtract. For example:

let ac = Flags::ABC - Flags::B;
let we_dont_want_b = ac - Flags::B;

would do the right thing. The equivalent code in C would not.

byteorder

This simple crate allows reading and writing values in little-endian or big-endian byte order. The standard library does have some support with the to_le_bytes et al. family of functions on the integer primitive types so a lot of this crate is redundant now. Where this crate is useful is with implementing the Read and Write interfaces.

If you're wondering what endian means, it refers to how computers store numeric values that require more than a byte to store. For example, with u32 it takes 4 bytes of storage. There are 2 conventional ways of storing this value. You could put the high bytes first (big-endian) or the low bytes first (little-endian) into memory. So for example, the number 42 could be stored as the bytes 42, 0, 0, 0 or the bytes 0, 0, 0, 42. Most modern CPUs, by default, support the former, which is little-endian. However, data that goes over the network is usually big-endian. So these routines are critical for putting the data in the correct form. There is also a third convention called native-endian that is either little or big depending on the CPU's preferred form.

git2

This crate offers bindings over the C-based libgit2 library. exa uses this to implement its .gitignore support. This is a large crate and way beyond the scope of this article.

glob

One of the main obvious jobs of exa is to iterate over all the files in a directory. glob does this with the bonus that you can use the wildcards * and **. It provides a single function glob that takes a file pattern and gives back an iterator returning paths. For example:

// Read in all the markdown articles under all folders in my blogs folder.
use glob::glob;

for entry in glob("~/blogs/**/*.md").unwrap() {
    match entry {
        // path is a PathBuf
        Ok(path) => println!("{}", path),

        // e is a GlobError
        Err(e) => eprintln!("{}", e),
    }
}

Of course, being an iterator, you can run it through all the iteration operators, such as filter, map etc. But there is no need to sort since paths are yielded in alphabetical order.

lazy_static

Now this is a crate that is often used. Normally, static variables cannot have run-time calculated values. Try this:

fn main() {
    println!("Number = {}", MY_STATIC);
}

static MY_STATIC: u32 = foo();

fn foo() -> u32 {
    42
}

You will be greeted with the error:

error[E0015]: calls in statics are limited to constant functions, tuple structs and tuple variants
 --> src/main.rs:5:25
  |
5 | static MY_STATIC: u32 = foo();
  |                         ^^^^^

Using lazy_static it becomes:

use lazy_static::lazy_static;

fn main() {
    println!("Number = {}", *MY_STATIC);
}

lazy_static! {
    static ref MY_STATIC: u32 = foo();
}

fn foo() -> u32 {
    42
}

There are three main changes to the code. Firstly, there's the lazy_static! macro to wrap the static declarations. Secondly, there's an added ref keyword. The statics returned here are references to your type. Using them invokes the Deref trait. This means that thirdly, I had to dereference it so that the Display trait was detectable for u32. In Rust, Deref is not invoked when looking for traits so I had to do it manually.

libc

There is a vast sea of C code out there implementing many useful libraries. To speed up Rust's adoption, it was required not to rewrite many of these libraries in Rust. Fortunately, the designers of Rust realised that and made it easy to interoperate with C. libc provides more support to interoperate with C code. It adds type definitions (like c_int), constants and function headers for standard C functions (e.g. malloc).

locale

This crate is documented as mostly useless as it is being rewritten for its version 0.3. This provides information on how to format numbers and time.

use locale::*;

fn main() {
    let mut l = user_locale_factory();

    let numeric_locale = l.get_numeric().unwrap();
    println!(
        "Numbers: decimal sep: {} thousands sep: {}",
        numeric_locale.decimal_sep, numeric_locale.thousands_sep
    );

    let time = l.get_time().unwrap();
    println!("Time:");
    println!(
        "  January: Long: {}, Short: {}",
        time.long_month_name(0),
        time.short_month_name(0)
    );
    println!(
        "  Monday: Long: {}, Short: {}",
        time.long_day_name(1),
        time.short_day_name(1)
    );
}

This outputs:

Numbers: decimal sep: . thousands sep: ,
Time:
  January: Long: January, Short: Jan
  Monday: Long: Mon, Short: Mon

Mmmm... there seems to be a bug at the time of writing. Surely time.long_day_name(1) should return Monday and not Mon. Whether this is an operating system issue or a problem with locale, I am not sure.

log

This crate provides an interface for logging. The user is expected to provide the implementation of the logger. This can be done through other crates such as env_logger, simple_logger and few other crates. log is not used directly by exa itself, but rather some of its dependencies.

Essentially, it provides a few macros such as error and warn! to pass formatted messages to a logger. There are multiple levels of logging and they range from trace! to error! in order of rising priority:

  • trace!
  • debug!
  • info!
  • warn!
  • error!

I think this crate is missing a fatal! as most logging systems contain these levels.

Log messages can be filtered by log level, with the lowest level restricting more lower-level messages and the highest level showing all messages. This is set using the set_max_level function. By default, it is set to Off and no messages are sent to loggers. Levels can also be set at compile-time using various features. All of this is described in the documentation.

Loggers implement the Log trait and users install them by calling the set_logger function.

How should you use the logging levels? Below I provide some opinionated guidance:

Trace

Very fine-grained information is provided at this level. This is very verbose and high traffic. You could use this to annotate each step of an algorithm. Or for logging function parameters.

Debug

Used for everyday use and diagnosing issues. You should rarely submit code that outputs to debug level. At the very least it shouldn't output in release builds.

Info

The standard logging level for describing changes in application state. For example, logging that a user has been created. This should be purely informative and not contain important information.

Warn

This describes that something unexpected happened in the application. However, this does not mean that the application failed and as a result work can continue. Perhaps a warning could be a missing file that the application tried to load but does not necessarily require it for running (e.g. a configuration file).

Error

Something bad happened to stop the application from performing a task.

matches

Allows a check to see if an expression matches a Rust pattern via a macro:

// Macro version
let b = macros!(my_expr, Foo::A(_));

// is the same as:
let b = match my_expr {
    Foo::A(_) => true,
    _ => false,
}

// or even:
let b = if let Foo::A(_) = my_expr { true } else { false };

It also provides assert versions as well. It is just a small convenience crate.

natord

If you were to sort these strings using normal method: ["foo169", "foo42", "foo2"], you would get the sequence ["foo169", "foo2", "foo42"]. This might not be the order you would prefer. What you might want is sometimes referred to as normal ordering. You might prefer the order ["foo2", "foo42", "foo169"] where the numbers in the strings increase in value and not by ASCII ordering.

This functionality is what natord provides. Natural ordering can work well for filenames with numbering inside them and IP addresses to name just a couple. For example, natural ordering will handle strings where the numbers are in the middle (e.g. "myfile42.txt") and not just at the end.

I am sure I will make use of this crate at some point in the future.

num_cpus

Short and sweet this one. It provides a get() function to obtain the number of logical cores on the running system, and get_physical() function to obtain the number of physical cores. Very useful if you want to set up a worker thread pool.

number_prefix

This crate determines the prefix of numerical units. Given a number, it determines whether there should be a prefix such as "kilo", "mega", "K", "M" etc., or not. It can even handle prefixes that describe binary multipliers such as 1024, something that many programmers will appreciate, but a prefix like K will not be used. Rather Ki will be used.

And given a prefix, it can be converted into its full name. For example, the prefix for 1000 would be K, but call upper() on it and you will get KILO, call lower() for kilo and caps() for Kilo.

This is very useful for listing file lengths as exa is required to do.

pad

This crate is used for padding strings at run-time. The standard library format! function can do a lot of padding functionality but this crate can do more.

For example, it can add spaces at the front to right-align a string within a set field width:

let s = "to the right!".pad_to_width_with_alignment(20, Alignment::Right);

Neat huh?

You don't have to pad with spaces either. It can use any character you wish. It will also truncate strings that are too long for a particular width.

If you need to format the textual output on a terminal, you may need to look this crate up.

scoped_threadpool

This can produce a struct that manages a pool of threads. These threads can be used to run closures that can access variables in the original scope. This is useful because normally threads can only access values of 'static lifetime or are entirely owned inside the thread.

Let us look at the example in the documentation.

fn main() {
    // We create a pool of 4 threads to be utilised later.
    let mut pool = Pool::new(4);

    // Some data we can do work on and reference to.
    let mut vec = vec![0, 1, 2, 3, 4, 5, 6, 7];

    // Use the pool of threads and give them access to the scope of `main`.
    // `vec` is guaranteed to have a lifetime longer than the work we
    // will do on the threads.
    pool.scoped(|scope| {
        // Now we grab a reference to each element in the vector.
        // Remember `vec` is still around during `pool.scoped()`.
        for e in &mut vec {
            // Do some work on the threads - we move the reference
            // in as its mutably borrowed.  We still cannot mutably
            // borrow a reference in the for loop and the thread.
            scope.execute(move || {
                // Mutate the elements.  This is allowed as the lifetime
                // of the element is guaranteed to be longer than the
                // work we're doing on the thread.
                *e += 1;
            });
        }
    });
}

This crate allows us to create works on threads that we know will not outlive other variables in the same scope. pool.scoped must block until all work is done to allow this to happen.

This is very useful for quickly doing short-lived jobs in parallel.

term_grid

To produce standard ls output, exa must show filenames in a grid formation. Given a width, this crate provides a function fit_into_width that can help to produce a grid of strings. By working out the longest string in a collection, it can calculate how many of those strings can fit in horizontal space, like say, the line on a terminal.

term_size

Very simple but crucial if you want to provide textual output on a terminal in a highly formatted way. It uses ANSI control codes to communicate with your terminal to figure out the size of your terminal view.

let (width, height) = match term_size::dimensions() {
    Some((width, height)) => (width, height),
    None => (80, 25),
}

The terminal, possibly, may not report the dimensions (although most do) and so the result of dimensions() is an Option<(usize, usize)>.

tinyvec

This mainly provides 2 vector types: ArrayVec and TinyVec.

ArrayVec is a safe array-backed drop-in replacement for Vec. This means that a fixed array is allocated to act as storage for ArrayVec. In any other way, ArrayVec acts like a Vec but it cannot reallocate that storage. If the array becomes too big a panic will occur.

This is very useful to avoid reallocations all the time. ArrayVec even allows access to the unused space in the backing array that is currently being used for elements.

Because ArrayVec uses a fixed array, a heap allocation does not occur.

TinyVec is a drop-in replacement for Vec too. However, for small arrays, it starts life as an ArrayVec using a fixed array as the backing store. As soon as it becomes too big, it will automatically revert to use a normal Vec, hence using the heap to store the array.

This requires the alloc feature to be activated.

It is basically an enum that can be an ArrayVec or a Vec:

enum TinyVec<A: Array> {
    Inline(ArrayVec<A>),
    Heap(Vec<A::Item>),
}

You have got to love algebraic types! You have to provide the backing store type and a macro helps you do this:

let my_vec = tiny_vec!([u8; 16] => 1, 2, 3);

This example can only store the first 16 elements on the stack before it switches to a more regular Vec<u8>.

unicode-width

Unicode characters can be wide, and on the terminal, it's important to know how many characters a Unicode character or string might take up. There is a standard for working out that information and it is called the Unicode Standard Annex #11.

This is very useful when displaying filenames with international characters, which is exactly what exa needs to deal with. Fortunately, there is a crate that can provide that information.

The documentation for unicode-width does remind us that the character width may not match the rendered width. So be careful but I don't think there is much you can do in these situations.

url

A typical URL can code much information including:

  • the protocol scheme (e.g. http)
  • the host (e.g. docs.rs)
  • a port number (e.g. the number in http://docs.rs:8080)
  • a username and password
  • a path (e.g. foo/bar in the URL http://docs.rs/foo/bar)
  • a query (e.g. everything after the ? in myurl.com/foo?var=value)

A lot of these are optional and so a URL parsing library will need to handle this too.

This crate provides a Url::parse function that constructs an object describing the various parts of an URL. Various methods can then be called to provide string slices into the URL such as scheme() or host().

This crate also provides serde support too so if you know what that is, you will understand what this means. Maybe I will write about serde in a future article.

users

This is a very Unixy thing and so is useful on Mac OSX too. To handle various permissions in a Unix operating system, a process is assigned an effective user ID. This user ID and the various group IDs the user belongs to determines the permissions a process has within the operating system, including file and directory access.

The users crate provides this functionality by wrapping the standard C library functions that can obtain user and group information.

If you have trace level logging on, this crate will log all interactions with the system.

This crate also has a mocking feature to use pretend users and groups that you can set up for purposes of development. You do not necessarily want to access real IDs.

Even though this is a very Unixy thing, Windows does have similar permission features. So this crate should work on Windows too. Although I haven't tested this and the documentation does not say so. But I have seen exa work on Windows so I am assuming.

zoneinfo_compiled

This crate seems to get information directly from the operating system via zoneinfo files. This information allows you to obtain information on leap seconds and daylight savings time for your current time zone.

This information is maintained by a single person and distributed and stored on your harddisks if you use Unix-based systems. You can find this data in the /usr/share/zoneinfo folder. Each timezone has a binary file and this crate can parse this file and extract the information it holds.

The whole proper handling of time and time zones is beyond me at this moment in time. So I am not sure how exa would be using this crate. It's a very complex topic and something I would love to dig in deeper with and hopefully not go insane at the same time.

Another name for the database is the tz database and you can find more information about it, if you so desire, at Wikipedia.

Conclusion

I hope you enjoyed this little trip into a few crates used by a single project. I encourage all of you to try this simple exercise. I learnt a lot by researching for this article and I still feel I have not even scratched the surface.

Please write in the discussion below about interesting crates that you have found and used. I would love to hear about them.

Until next time!

23