Monoids

Let’s start with monoids. Monoids are awesome because they are extremely general, yet they’re also very simple and there’s an easy intuition to guide your understanding. And even though the interface and properties a monoid gives you are very simple, there are some very useful things you can do with them as monoids.

In short, if a data type is a monoid, you can put two things of that type together and always get a new instance of that type.

With a few restrictions. One is that the order of the items you combine can matter, but nesting can’t. The other is that there has to be an neutral element of that type for that operation, so that combining anything with the neutral element returns the thing.

Getting more formally acquainted.

Formally[1], a data type t with a binary operation * is a moniod iff:

  1. For any two instances a and b of type t, a * b is of type t. This is the rule of closure.
  2. There is an instance i of type t such that for any a of type t, i*a = a*i = a. This is the rule of identity.
  3. For any a, b, and c of type t, (a*b)*c = a*(b*c). This is the rule of associativity.

I will refer to the associativity and identity rules as the monoid laws.

Lots of things are monoids. For real numbers, addition and multiplication are monoids. Boolean and and or are both monoids. String and Array concatenation are both moniods. Set union is a monoid. I’ll come back to these and more later, but first, I need to lay down some code foundations.

Doing it in Javascript

Monoids provide an interface for data types to implement with a single binary operator and a distinguished identity instance. Javascript doesn’t really have explicit interfaces or a rich set of data types. So I’m going to model data types the traditional way they’re modeled in Javascript, with constructor functions and prototypes. And I’m going to make Monoid a function that takes a constructor function, and installs a specific interface on that function and it’s prototype. Here it is[2]:

function Monoid(type, props) {
  var proto = type.prototype;

  proto.id  = props.id;
  proto.dot = props.dot;
  proto._   = proto.dot; //_ alias because js doesn't allow user defined operators

  type.dot = function(a, b) { return a._(b); }
  type.id  = proto.id;

  //optional properties. Used for quick checking laws
  proto.eq  = props.eq;
  type.eq  = function(a, b) { return a.eq(b) };
  type.arb = props.arb;

  type.laws = {
    left_identity:  function left_identity(a) { return type.id._(a).eq(a); },
    right_identity: function right_identity(a){ return a._(type.id).eq(a); },
    associativity:  function associativity(a, b, c){ return a._(b._(c)).eq( (a._(b))._(c) ); }
  };

  Monoid.known.push(type);
  Monoid.byName[type.name] = type;
}

We take a constructor and an object “props" the definitions for the monoid’s interface. props must minimally have a dot and id property, which get added to the type's prototype, with static versions added to the type itself.

There are two optional properties, eq and arb. eq tests instances for equality, and arb generates an arbitrary instance of the monoid. Then it installs the monoid laws as static boolean functions on the type. I’ll have more to say on these properties and their uses in the section on testing.

For convenient introspection, I have the Monoid function itself track all the types that have been declared monoids, with an array known, and a hash of string names to the type constructor functions, byName. We need to initalize these.

Monoid.known = []; //known monoid types.
Monoid.byName = {};//known types by name

And now we can declare some monoids.

Monoid instances[3]:

Numbers are monoids. You can add them.

function Sum (val) { this.val = val; }
Monoid(Sum, {
  id: new Sum(0),
  dot: function(other) { return new Sum(this.val + other.val) },
});

Numbers are also monoids. You can multiply them.

function Product (val) { this.val = val; }
Monoid(Product, {
  id: new Product(1),
  dot: function(other) { return new Product(this.val * other.val) },
});

Numbers can be monoids too, if you take the smaller number each time you put two together.

function Min (val) { this.val = val; }
Monoid(Min, {
  id: new Min(Number.POSITIVE_INFINITY),
  dot: function(other) { return new Min(Math.min(this.val, other.val)) },
});

If you weren’t sure if Numbers were monoids, it turns out they are.

function Max (val) { this.val = val; }
Monoid(Max, {
  id: new Max(-Number.POSITIVE_INFINITY),
  dot: function(other) { return new Max(Math.max(this.val, other.val)) },
});

I want to pause here to reflect on how coding with this monoid API would look. If we want to combine a bunch of numbers in an abstract way, allowing me to swap out monoids, I would have to pass around a monoid variable, and wrap every primitive value in a new monoid(val). Not pleasant.

There are a few approaches to solving this. The proof-of-concept code for this article uses a few strategies. It has a coercion protocol for combining a wrapped value with a primitive transparently. It has a dotPrimitive property to combine primitive values using the monoid’s dot operator (so Sum.dotPrimitive(3, 4) == 7).

An alternative approach is to build monoids around the binary operator itself. With this sort of API, Monoid would be a constructor that takes a binary operator and an identity, and constructs a new monoid. Something like Sum might be:

sum = new Monoid(function(a, b) {return a + b}, {id: 0});
sum.id;     //=> 0
sum(5, 12); //=> 17

I may refactor my monoid code to use the operator-oriented representation in the future. For now, let’s continue with this data-oriented representation.

Boolean also belong to (at least) two monoids:

function Any (val) { this.val = val; }
Monoid(Any, {
  id: new Any(false),
  dot: function(other) { return new Any(this.val || other.val) },
});

function All (val) { this.val = val; }
Monoid(All, {
  id: new All(true),
  dot: function(other) { return new All(this.val && other.val) },
});

Concatenating Strings and Arrays is a monoidal operation:

Monoid(String, {
  id: '',
  dot: function(other) { return this + other }
});

Monoid(Array, {
  id: [],
  dot: Array.prototype.concat
});

The set union operator forms a monoid for sets, with the identity being the empty set. Intersection can also be a monoid, with the Universe (the set of all the items under consideration) as the identity. What sorts of sets do we have in JavaScript, and how can we define monoids around them?

The most obvious sets are object properties. We can define a union monoid quite easily.

function ObjectUnion(object) {
  var self = this;
  Object.keys(object).forEach(function(key) {
    self[key] = object[key];
  });
}
Monoid(ObjectUnion, {
  id: new ObjectUnion({}),
  dot: function(other){
    var result = new ObjectUnion({});
    var self = this;
    Object.keys(self).forEach(function(k) { result[k] = self[k]; });
    Object.keys(other).forEach(function(k) { result[k] = other[k]; });
  }
);

The dot operator is a naive property union, where the value for a property shared by both objects is resolved by letting the right property win. We’ll see a better object union when I talk about using monoids for logging.

Let’s go back to our intuitions about what a monoid is: it’s a way to combine objects, where the right-to-left ordering of the objects might matter but the nesting does it. And a special neutral object that does nothing to whatever you combine it with. When you think of it this way, we can come up with all sorts of ways to to put things together. We can put strings together while inserting a string between them, if we be allow null and are careful to do nothing with null.

function Line(line) { this.val = line }
Monoid(Line, {
  id: new Line(null),
  dot: function(other) {
    if(other.val == null) return this;
    if(this.val == null) return other;
    return new Line(this.val + "\n" + other.val);
  }
});

Now we can combine strings in a nice line oriented manner if we want.

We could just always have the first or last thing win, unless it’s undefined.

function First(val) { this.val = val; }
Monoid(First, {
  id: new First(undefined),
  dot: function(other) { if(this.val === undefined){return other} return this; }
});

function Last(val) { this.val = val; }
Monoid(Last, {
  id: new Last(undefined),
  dot: function(other) { if(other.val === undefined){return this} return other; }
});

And if you want something stupendously useless:

function MaximallyUseless(){}
Monoid(MaximallyUseless, {
  id: new MaximallyUseless(),
  dot: function() { return MaximallyUseless.id; }
});

Generic monoid algorithms.

One thing you can do with any monoid is take a collection of instances and reduce them down to a single value by repeated application of the dot:

function aggregate(array, monoid) {
  return array.reduce(monoid.dot, monoid.id);
}

Here I show it for an array, but we can do this for any container type that has an order. So given a tree and a traversal algorithm (say breadth first or depth first preorder), if the the nodes contain elements of a monoid, we can reduce them down to a single value (which you may hear called the monoidal summary of that tree).

So you can reduce with a monoid operator. That may not be a big deal. But associativity buys us something.

Parallelism

Because the monoidal operation is associative, we can reduce sub parts of the array to their monoidal summary in separate threads. We then gather the results from each thread, reduce this smaller collection, and return the result.

You can find some proof of concept code in the parallel and worker parts of the monoid code. This project also contains the following test code:

var time = process.hrtime();
var answer = m.aggregatePrimitive(arrPrim, Sum);
var diff = process.hrtime(time);
console.log('aggregate took %d seconds and %d nanoseconds', diff[0], diff[1]);

time = process.hrtime();
agg_p(arrPrim, Sum, 6, function(result) {
  var diff = process.hrtime(time);
  console.log('aggregate_p took %d seconds and %d nanoseconds', diff[0], diff[1]);
  console.log('answer: ' + answer);
  console.log('got:    ' + result);
  process.exit(0);
});

Here aggregatePrimitive(array, monoid) reduces an array of primitive values using monoid.dotPrimitive, while agg_p(array, monoid, num_workers, callback) does the same thing in chunks asynchronously across at most num_workers child processes, collects and reduces the results, and passes it on as the argument to callback.

Unfortunately, a parent process can only communicate to a child process by passing messages, which serializes the message data as a JSON value. So to parallelize the aggregation, I have to slice the array to pieces (a memory allocation, and possibly a copy), serialize it into a message with some extra bookkeeping info, get the message in the worker thread, deserialize the message, perform the reduce, send another message back to the parent, and put the results together in the parent.

If that sounds like it can’t possibly be faster than doing a simple summation in the parent, you are right. aggregatePrimitive can sum an array of 20,000,000 numbers in about a second on my computer. agg_p took 32 seconds on 3 cores. On the bright side, it took 16 seconds on 6 cores, and produced correct results in all cases.

But the insight here is important: collections of a moniod are good things to reduce. The identity laws mean we can break the task down without worrying about the base case of an empty container. Having an identity means we have a value to use when our collection happens to be empty. The associative laws mean we can chunk the task into smaller pieces and recombine those pieces, because grouping doesn’t matter. The “Reduce” part of “MapReduce” implicitly uses monoids.

To get an actual performance boost out of agg_p we’d need a native plugin that could pass a reference, offset, and length for the array to the worker threads to save us the memory allocation and serialization that bogs my implementation. Or a library like Intel’s RiverTrail to take care of the hard parts for us.

So parallel reduction implicitly must be using monoids. Why be explicit about it? Testing.

Testing those laws

We have two constraints that apply to all monoids. These are excellent properties to check. If you aren’t familiar with property based testing, check out my primer.

I decided to split the identity law into two properties, left and right identity. Recall that the following properties are added to every monoid you declare:

  type.laws = {
    left_identity:  function left_identity(a) { return type.id._(a).eq(a); },
    right_identity: function right_identity(a){ return a._(type.id).eq(a); },
    associativity:  function associativity(a, b, c){ return a._(b._(c)).eq( (a._(b))._(c) ); }
  };

We can write a simple law checking function that tests that these laws hold:

function check_laws(m) {
  console.log("Testing " + m.name + "'s monoid laws:");
  qc.forAll(m.laws.left_identity,  m.arb);
  qc.forAll(m.laws.right_identity, m.arb);
  qc.forAll(m.laws.associativity,  m.arb, m.arb, m.arb);
  console.log()
}

This function requires that the monoid define arb to generate arbitrary data, and eq to check instances for equality. I’m not going to show arb and eq for most of the monoids we’ve seen so far. arb for the numeric, boolean, and string monoids are all just taken from the quickcheck library. eq for most of these is obvious. For numbers, eq is tricky, but I talk about that quite a bit in the property based testing primer so you can reference that or the project source code on github for the details.

Since we kept track of all the known monoids as we registered them, we can test all the ones fit for testing.

function test_all_monoids() {
  var results = m.aggregate(
    m.Monoid.known
      .filter(function(m) {return m.eq && m.arb;})
      .map(check_laws)
  );
}

It’s that simple.

Most of the monoids I’ve shown you are pretty simple. But square matrices of the same size form a monoid under matrix multiplication, with the identity matrix as the identity element. Testing the identity and associativity laws for matrix multiplication might substantially increase our confidence in our implementation of matrix multiplication, especially since highly optimized algorithms for matrix multiplication are not trivial.

But if you’re like me, you’re looking at this code and thinking it could be better. It could collect the results of those tests, instead of just printing them, and pass them back to the caller, so they could be collected with the results of other tests.

It sounds like we’d want a test log object that summarizes our test run. Here’s a few things that I’d like to do with a test log.

  1. Combine two logs into a new log with a summary of both sets of tests.
  2. We may or may not want the combined summary to preserve the order of the sub-tests. We probably don’t need to track sub-test nesting.
  3. Combining the log from a sub-test with no tests with another test log shouldn’t change that other test log in any way.

This test log should be a monoid!

Logging with Monoids

Here are a few other design requirements for a logging/summary system:

  1. Logs should combine easily. When two logs have summaries of the same thing, they should combine summaries. When they summarize different things, the new log should contain all the different summaries present in both sub-logs.
  2. Code that logs information should not have to know about any other information logged by other code.
  3. Code should not need to be passed a log to properly log its information. It should be able to create its own log to hand off to the code that handles log aggregation.

Logs will simply be objects. Each property of the object is a sub-log, whose value must be in a monoid. When two logs are combined with the dot method, their properties are unioned. The value for each property in the new log is the monoidal dot combination of the values of the two original logs. If one of the original logs has a property that the other doesn’t, then the resulting log’s value is the value from the log containing that property.

function add_log_key(logs, key, value) {
  if( value instanceof Function )
  { //assume we got a constructor
    logs[key] = value.id;
  }
  else
  { //got a starting value
    logs[key] = value;
  }
}

function Log(logs) {
  var self = this;
  Object.keys(logs).forEach(function(key){
    add_log_key(self, key, logs[key]);
  });
}
proto = Log.prototype;
proto.addLog = function(key, inital) {
  if(this === Log.id){ throw "Tried to mutate Log.id" };
  add_log_key(this, key, inital);
  return this;
}
proto.log = function(values) {
  if(this === Log.id){ throw "Tried to mutate Log.id" };
  var self = this;
  Object.keys(values).forEach(function(key){
    self[key] = self[key]._(values[key]);
  });
  return this;
};

This defines a (non-monoidal) public interface for Log. You add a new log to your Log by invoking the addLog method with the property name you want and a monoidal constructor (which defaults to that monoid’s id) or initial monoidal value. You log new data for a log property with the log method, passing in a plain-old js object whose keys are the log properties you want to add. The values of your logs get updated by monoidal concatenation with the values in your object argument. As the guard conditions indicate, both these methods mutate your log object in place. Don’t try to do it with Log.id. Then Log.id would no longer be an identity for Log.dot, and all hell would break loose.

Note also that Object.keys(obj) returns string array of obj's own keys. These functions will work even if you extend Object.prototype without marking your new properties not enumerable.

This will be the most complex monoid in this post[4], so we definitely want an arb method and an eq method to test those monoid laws just in case I’ve done something boneheaded with my dot method[5].

function arbObject() {
  random_prop = String.arb();

  obj = { a: Sum.arb(), b: All.arb(), stuff: Array.arb() };
  obj[random_prop] = String.arb();
  return obj;
}
function deepEq(other) {
  if(this === other) return true;
  var self = this;

  return Object.keys(this).every(function(key) {
    return other[key] != null && self[key].eq(other[key]);
  });
}
function logDot(other) {
  var self = this;
  var result = {};
  Object.keys(this).forEach(function(k) { result[k] = self[k]; });
  Object.keys(other).forEach(function(k) {
    if(result.hasOwnProperty(k)){
      result[k] = result[k].dot(other[k]);
    }
    else { result[k] = other[k]; }
  });
  return new Log(result);
};
Monoid(Log, {id: new Log({}), dot: logDot, arb: arbObject, eq: deepEq});

So let’s see this in action. First we need to hack quickcheck’s forAll function:

var m = require('./monoid');
function forAll(property) {
  var
    generators = Array.prototype.slice.call(arguments, 1),
    fn         = function (f) { return f(); },
    log        = new m.Log({
      total_failed: m.Sum,
      failed: m.Any,
    }),
    i,
    values;

  for (i = 0; i < 100; i ++) {
    values = generators.map(fn);

    if (!property.apply(null, values)) {
      console.log("*** Failed!\n" + values);
      log.info = new m.Log({property: [property.name], args: [values]});
      return log.log({total_failed: 1, failed: true})
    }
  }

  console.log("+++ OK, passed 100 tests.");
  return log;
}

Start with the initialization. We create a log with two properties, total_failed which starts at 0 and sums all the failures, and failed which starts false and flips to true just in case any test fails. If all the tests pass, this initial log is returned unchanged.

The interesting case is when a test fails. First, we add a new property to the log, info, which is itself a log containing property and args. property is an array with the name of the failing property, and args is an array containing an array of the arguments that falsified the property. The nested log allows us to namespace logs. The property and args values are wrapped in arrays so that when we combine two logs summarizing failing tests, the properties and arguments that failed will be concatenated together, while each failing property will be at the same index as its falsifying arguments.

Let’s see the code combining combining logs. Here’s a modified check_laws:

function check_laws(m) {
  var log;
  console.log("Testing " + m.name + "'s monoid laws:");
  log = qc.forAll(m.laws.left_identity,  m.arb);
  log = log._( qc.forAll(m.laws.right_identity, m.arb) );
  log = log._( qc.forAll(m.laws.associativity,  m.arb, m.arb, m.arb) );
  if(log.failed.val){
    log.addLog('groups', [{name: m.name, info: log.info}]);
  };
  console.log()
  return log;
}

Now we combine the information in the three tests with Log's dot method (using it’s _ alias here). Because I’d like to keep the information about failed tests for separate monoids separate, I define a new groups log property that put’s the failing monoids name and it’s failure info together into an array.

Now when we test all our monoids, all we have to do is combine the resulting logs and format the information nicely for the console.

function test_all_monoids() {
  var results = m.aggregate(
    m.Monoid.known
      .filter(function(m) {return m.eq && m.arb;})
      .map(check_laws)
  );

  if(results.failed.val) {
    console.log( results.total_failed.val + ' test(s) failed:');
    results['groups'].forEach(function(group){
      console.log('  '+group.name+' failed:');
      zipWith(
        function(prop, args){ return '    ' +prop + ': ' + args.join(','); },
        group.info.property, group.info.args
      ).forEach(function(msg) { console.log(msg)});
      console.log();
    })
  }
  else {
    console.log('All tests passed.');
  }
  return results;
}

Now, if I tweak Product to divide numbers and All.id to false, I get the following message.

4 test(s) failed:
  Product failed:
    left_identity: 709.1053305605122
    associativity: 709.7295254795905,708.1961540475892,709.2243852080616

  All failed:
    left_identity: true
    right_identity: true

That’s much nicer.

I managed to collect and manipulate test result data with very little violence to my original code. Individual tests can just generate their own result data. Test runners can aggregate those results easily. In a larger example, very different parts of an application can generate logs with overlapping and orthogonal data. The only logging knowledge that has to leak from one component of the application to another is namespacing. The different pieces must put the information with the same meaning under the same namespace in the log, and information with distinct meanings under different names. That’s some pretty nice decoupling of data collecting concerns across an app.

Conclusion

Monoids land in a nice sweet spot for programming. They’re rich enough to be useful, but simple enough to be everywhere. I think parallelizing reduce and logging summaries are some of the killer applications of monoids in programming, but people have found many other uses as well. Cabal and Xmonad both use monoids to manage configuration and combine configs from files, defaults, and command line options in a sensible way. Finger trees are another interesting application of monoids. Their abstractness allows you to swap monoids easily, radically changing the behavior of the data structure’s basic operations.

Further Reading

Haskell Monoids on Wikibooks
Haskell Monoids and their Uses over at A Neighborhood of Inifinity.
Fast incremental regular expression matching with monoids, also from A Neighborhood of Inifinity.

Footnotes

[1] Typically, monoids are characterized in terms of a set, rather than a data type. You can think of a type as a set containing all the possible instances of that type. See the Wikipedia page for the formal definition. (back)
[2] This is a simplified version of a more featureful implementation in the node-algebraic GitHub project. If this project seems useful to you, please let me know. It’s currently just proof-of-conecpt code, and poorly tested, but that can change quickly if there’s interest.(back)
[3] More featureful implementations of most of these, and a few other monoids, can be found in the instances part of the that GitHub project. (back)
[4] The Log monoid is inspired by Haskell’s various tuple instances of Monoid. JavaScript’s dynamic type system allows this to be much more flexible. To get something similar in Haskell, you would have to declare a record type whose fields were the indivdual logs. The benefit to the static type is that accidental name collision would be much less likely to occur. But that one data type would have to know all the types of information logged in the various parts of the system, and it would have to change any time a new type of information was collected.(back)
[5] Of course, I did. I forgot to declare result with var, so nested logs crashed and burned horribly. Unfortunately my arb function doesn’t generate nested logs, so my tests all pass the quickcheck. It’s important to generate test data of at least the complexity found in the data in your app if you want quickcheck to really give you confidence in your code. (back)

Property Based Testing Primer

Property based testing is a testing method made popular by the Haskell QuickCheck[1] library. Proponents of property based testing might argue that xUnit frameworks don’t automate enough of the testing task. They leave the programmer to manually generate test data.

Property based testing automates test data generation. The technique works like this: 1. The programmer states a property, which is true for any data of the appropriate type. 2. The programmer supplies a way to generate arbitrary data of the appropriate type. 3. The testing framework generates hundreds of arbitrary values to test the property. If any of the generated values falsifies the property, it reports that value. Otherwise it reports the number of successful attempts it made.

Let me illustrate with some examples. I’ll be using the node-quickcheck library. It’s an extremely simple implementation that’s missing some key features[2], but it works well for demonstration purposes, and the source code is very easy to understand.

Here’s a simple property of string concatenation: the length of the concatenation of two strings is equal to the sum of the lengths of those strings. At a node console:

> function str_concat_prop(a, b) { return (a+b).length === a.length + b.length }
> qc = require('quickcheck')
> qc.forAll(str_concat_prop, qc.arbString, qc.arbString);
+++ OK, passed 100 tests.

And here’s what a failing test looks like:

> function propertyEven(x) { return x % 2 == 0; }
> qc.forAll(propertyEven, qc.arbByte);
*** Failed!
11

Abstract Math and Property Based Testing

A blog whose stated purpose is to argue the benefit of modelling abstract concepts directly first detours into a post on property based testing. Why?

The abstractions I’m going to model come with properties. One large benefit for directly modeling them is automated test generation for these properties on all the concrete instances of these abstractions. I’ll be testing the associativity property of addition below, which is a special case of a general property of all monoidal operators. When I declare a type a Monoid, in addition to adding a generic interface to that type, I can also get a suite of tests for very little work.

The API

Most QuickCheck ports for dynamic languages contain two important things: a checker and a variety of arbitrary data generators. In the previous code, the checker is forAll, and arbString and arbByte were used to generate arbitrary strings and bytes.

forAll takes a property as it’s first argument. The rest of the arguments are generator functions, which will be invoked with no arguments to get arbitrary values that will then be used as arguments for the property.

This library also offers a few generators. arbByte gives an random uniformly distributed integer in the range [0-255]. arbChar gives a character, arbString a string, and arbBool does a fair coin flip. It also has a generator factory for arrays, arbArray, which takes a generator and returns a generator that yields arrays of random length from 0 to 100 whose elements are created by the supplied generator. So arbArray(arbByte) will yield successive arrays of between 0 and 100 bytes.

There are two more generator functions, but they have some unexpected properties. They make a good example of property based testing.

Numbers and Equality

arbInt and arbDouble are especially interesting generators provided by the qc library. Let’s use them to test associativity of addition.

> function prop_sum_associativity(a, b, c) { return a + (b+c) === (a+b) + c }
> qc.forAll(prop_sum_associativity, qc.arbDouble, qc.arbDouble, qc.arbDouble);
*** Failed!
-1.117531802955579e+308,1.101464035399532e+308,-3.0497536018773165e+307
> qc.forAll(prop_sum_associativity, qc.arbByte, qc.arbByte, qc.arbByte);
+++ OK, passed 100 tests.

Many programmers know a few things about floating point imprecision, and regard it as a bug to compare floats for equality. The arbitrary bytes all pass with addition, just as we’d expect, since there’s no imprecision issues with adding integers of such small magnitude. But there’s more going on here. Let’s try out a simple check for approximate equality.

> Number.eqDelta = 1e-8
> Number.prototype.eq = function(a, b) { return Math.abs(a-b) < Number.eqDelta }
> function prop_sum_associativity(a, b, c) { return (a+(b+c)).eq((a+b)+c) }
> qc.forAll(prop_sum_associativity, qc.arbDouble, qc.arbDouble, qc.arbDouble);
*** Failed!
-9.978852290214801e+307,1.3645480265786847e+308,-5.20595870142343e+307

Hmm. Our naive check for approximate equality doesn’t work very well for high magnitude values. High magnitude values are also imprecise, even when the value represented is an integer. Let’s give this another try:

> Number.prototype.eq = function(a, b) { return Math.abs(a/b - 1 ) < Number.eqDelta }
> qc.forAll(prop_sum_associativity, qc.arbDouble, qc.arbDouble, qc.arbDouble);
*** Failed!
-1.6821833463461643e+308,-1.7370330545041583e+307,1.0744120916385978e+308

What happened here? Maybe we should check this sum.

> function sum(a, b, c) { return a + b + c; }
> sum(-1.6821833463461643e+308,-1.7370330545041583e+307,1.0744120916385978e+308)
-Infinity

The addition overflowed to -Infinity when we added the first two numbers first. Then we added a positive number to -Infinity, which is still -Inifinity. But if you add a the second two numbers first, and then add the first to that result, you get no overflow.

> -1.6821833463461643e+308 + (-1.7370330545041583e+307 + 1.0744120916385978e+308)
-7.814745601579823e+307

So at the high magnitude boundary cases, floating point addition just isn’t associative. This isn’t a bug in floating point arithmetic. It’s the most reasonable thing that addition could do without casting to a higher precision data type. But it is a bug for your code to use a naive equality comparison, or even an approximation to it at extreme values, and still expect associativity to hold.

But you may be thinking, why does arbDouble only ever seem to generate such extreme numbers? These aren’t the sorts of doubles one encounters in typical applications. The answer is that it samples doubles from a uniform distribution ranging from -Number.MAX_VALUE to Number.MAX_VALUE. If you generate a random number between 0 and 1000, you’ll expect most of the values to have 3 digits, one tenth to have 2 digits, and about one onehundredth to have one digit. Likewise, if you get a random value between -e308 and e308, you expect the overwhelming majority to be in the magnitude of e308 or e307.

It turns out arbInt is implemented by flooring an arbDouble. Since the sample is always high magnitude integers anyway, arbInt is indistinguishable from abrDouble in practice.

> qc.forAll(prop_sum_associativity, qc.arbInt, qc.arbInt, qc.arbInt);
*** Failed!
2.3561210728985633e+307,1.1122776667786118e+308,-9.35712340491941e+307

So it turns out this QuickCheck port is a bit too simple and a bit too naive to use for many projects. I’ve been working on a fork of this with Jeremy Karmel that adds features and fixes bugs. The API is still evolving, and the whole project isn’t being thoroughly tested. We’re doing this for fun, so unless we start using it on a project the whole thing may never be more than a prototype.

I’d be remiss not to point out another JavaScript port by Darrin Thompson. It has some rather advanced features, including shrinking, which attempts to simplify a counterexample to the minimal case when a test does fail.

Conclusion

The main take-away from the floating point addition example is this:

Property based testing is frighteningly good at finding bugs.

It revealed the boundary cases for associativity of addition with ease. With a better floating point generator, it would easily find more flaws in our eq method: if both results are 0 or NaN, they would not pass eq.

We don’t get this for free. We have to write and maintain generators to create our test data. But those generators are often small and reusable. More complex data generators are often very easily defined in terms of simpler generators. Much of the difficulty is finding the invariants that software should maintain, yet identifying and maintaining invariants is one of the most important and underused tools in good software engineering. Property based testing makes it even more rewarding.

And for that small upfront cost, you get a test suite that gives you greater confidence in your code every time you run it. Even if nothing in your code base has changed, running your test suite still has a chance of finding latent bugs. This is a very powerful feature.

If you’re still not convinced, watch John Hughes’s excellent presentation on the virtues of property based testing. Actually, watch it anyway.

Footnotes

[1] QuickCheck has been ported to a large number of programming languages and environments. For a list of links, see the wikipedia page. (back)
[2] Python has a very nice quick-check port. It integrates nicely with python testing frameworks, allows users to specify bounds on arbitrary data generators where appropriate, and makes sure to test those bounds first, which makes checking edge cases easier. And its source is still straightforward and easy to understand. (back)

Not just for the love of math

This blog exists to demonstrate a point: very abstract mathematical concepts are worth modeling directly in your programs. Abstract algebra and category theory give us concepts like monoids, groups, rings, lattices, fields, functors, monads, etc. Other bloggers have written a lot of good material on how these concepts offer design insights, such as Dave Fayram’s article on monadic patterns in ruby, John Bender’s article on categories and functors in javascript, and Gabriel Gonzalez’s article on Functor’s as a design pattern.

But these articles focus on design insights and patterns, not modelling these patterns directly. Gonzalez talks about functors, which are modeled directly in Haskell, but the focus of his article is on implicit functors in Haskell and the design philosophy behind them. Bender models a single functor, not the general concept, and focuses on the insights from the exercise, not the benefits of using his model. Dave Fayram says Haskell “is one of the few environments where [monads] make sense to model directly”.

I will focus on practical benefits of modeling abstract mathematical structures directly. I value the design patterns these author’s are exposing, and will link to similar articles in future posts. But you may think that these patterns are only guides, and perhaps not too important to the actual practice of solving programming problems. You may think that some of these are too abstract to gain much practical value by modeling them directly. I aim to demonstrate their value.

I will focus on four kinds of benefits for each abstraction I consider:

  1. Modularity of designed that comes from programming an abstract interface. You can changed the behavior of your code with minimal effort when that change involes swapping one monoid or functor for another.

  2. Improved testing. Most of these objects have axioms or laws, constraints on their behavior. These provide ready properties for testing, using property based testing methods to check those properties with computer generated data.

  3. More accurate reasoning about your code. The interface and laws associated with an abstraction form a contract, which allows you to make assuptions and reason about invariants. They also tend to have some intuitive purpose, which helps communicating the intent of your code.

  4. Easier performance optimizations. The laws associated with these abstractions are typically equivalences. This means that you can substitute one statement for another more efficient statement without affecting the correctness of your code. They aid in reasoning about parallelism, informing how a computation can be restructured to run across multiple cores.

Throughout I’ll be using JavaScript on Node.js. The code I’ve written for these posts can be found here: https://github.com/adambaker/node-algebraic

I may switch to using JRuby or port my examples to a similar language in the future.

I hope you find these useful and interesting.