Implementing map-reduce in F#
May 24, 2010 § 6 Comments
Introduction
MapReduce is a software paradigm popularised by Google in which we take a set of tuples (key-value pairs), transform (map) them into an intermediate set of key-value pairs, and then perform some aggregation (reduce) operation on the intermediate values to obtain a result set. This is a useful way to express a problem because it yields an obvious way to “divide and conquer” the computation in a way that lends itself to parallel/distributed computing, thus providing a fairly simple way to perform computations on extremely large data sets.
It can be quite difficult to grok at first, so I decided to try implementing one of the examples from the MongoDB documentation in F# (if interested, see shell example 2). In this example, we have a list of people and the types of pet each of them has. We wish to calculate the total number of each animal.
The Code
Again, F# proves to be a remarkably succinct language to express problems, in this case the built in syntactic sugar for tuples is a godsend!
UPDATE (25-May-2010) – Controlflow helpfully suggested that I could make my original code somewhat neater by using pattern matching to decompose tuples. I’ve updated the code below with these improvements.
#light
// Simple example of map-reduce in F#
// Counts the total numbers of each animal
// Map function for our problem domain
let mapfunc (k,v) =
v |> Seq.map (fun(pet) -> (pet, 1))
// Reduce function for our problem domain
let reducefunc (k,(vs:seq<int>)) =
let count = vs |> Seq.sum
k, Seq.ofList([count])
// Performs map-reduce operation on a given set of input tuples
let mapreduce map reduce (inputs:seq<_*_>) =
let intermediates = inputs |> Seq.map map |> Seq.concat
let groupings = intermediates |> Seq.groupBy fst |> Seq.map (fun(x,y) -> x, Seq.map snd y)
let results = groupings |> Seq.map reduce
results
// Run the example...
let alice = ("Alice",["Dog";"Cat"])
let bob = ("Bob",["Cat"])
let charlie = ("Charlie",["Mouse"; "Cat"; "Dog"])
let dennis = ("Dennis",[])
let people = [alice;bob;charlie;dennis]
let results = people |> mapreduce mapfunc reducefunc
for result in results do
let animal = fst result
let count = ((snd result) |> Seq.toArray).[0]
printfn "%s : %s" animal (count.ToString())
printfn "Press any key to exit."
System.Console.ReadKey() |> ignore
This yields the expected results:
Dog : 2
Cat : 3
Mouse : 1
Exercise for the reader
Parallelise this implementation (for a single machine this should be trivial by using the Parallel LINQ integration provided in the F# Powerpack).