Insipired by Daniel Burrows’ tests, I decided to do mine. The object is to come up with a Haskell program that performs the job of wc -l as efficiently as possible. My test material is a randomly generated file with all lines less than 80 characters long.
Summary: my final idiomatic version is twice as slow as wc -l, and my final nonidiomatic version is as fast as wc -l, or at least within the margin of error.
First, the baseline:
$ time wc -l testfile 7798137 testfile real 0m12.127s user 0m0.765s sys 0m0.399s
Then the most idiomatic program. Honestly, I expect it to suck performance-wise:
main = getContents >>= print . length . lines
And it does:
$ time ./wcl1.hs < testfile 7798137 real 1m14.993s user 0m54.182s sys 0m0.573s
But only by a factor of six. My first hypothesis is that garbage collection is the culprit. And, indeed:
$ time ./wcl1.hs +RTS -sstderr -RTS < testfile ./wcl1.hs +RTS -sstderr 7798137 25,236,174,340 bytes allocated in the heap 5,070,193,924 bytes copied during GC 117,484 bytes maximum residency (3750 sample(s)) 96274 collections in generation 0 ( 26.36s) 3750 collections in generation 1 ( 0.74s) 2 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 27.17s ( 42.72s elapsed) GC time 27.10s ( 33.72s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 54.27s ( 76.44s elapsed) %GC time 49.9% (44.1% elapsed) Alloc rate 928,824,966 bytes per MUT second Productivity 50.1% of total user, 35.5% of total elapsed real 1m16.616s user 0m54.278s sys 0m0.700s
As you can see, GC takes half of the processing time. Unfortunately, minimizing GC time does not have as big an effect as I hoped:
$ time ./wcl1.hs +RTS -H100M -RTS < testfile 7798137 real 1m1.354s user 0m44.371s sys 0m1.171s
Here, I'm effectively telling it to use a 100-megabyte heap. A separate GC profiling run tells me that GC time has dropped from 50 % to 1 %.
The memory problem seems to be a red herring. What we're doing here is creating four separately allocated memory cells for each character in the input. Of course, the cells are generated lazily, that is, a cell pair is allocated when the lines function demands another character; it then creates another cell pair to denote the same character in a list of lines. As soon as each line has been counted, the four cells become garbage. This sounds like a lot of work! But apparently it isn't; I tried a few variations that did away with the stream of input altogether, not much unlike what Daniel did in his tests. Surprisingly, the performance worsened a lot. The lesson: functional programming language implementations, unlike typical imperative ones, are built on the assumption that you'll be doing a lot of allocation.
What I suspect is taking a lot of the time is arithmetic. Haskell's Int is boxed, and lazy by default, adding a lot of overhead. I hand-fused length and line, yielding a tail-recursive countLines function, and I added the command-line switch -funbox-strict-fields, which enables a few significant optimizations of the laziness of Int. The results were nice:
ajk@kukkamaljakko:~/scratch$ cat wcl4.hs module Main (main) where main :: IO () main = getContents >>= print . countLines 0 countLines :: Int -> String -> Int countLines n ('\n':r) = countLines (n + 1) r countLines n (_:r) = countLines n r countLines n  = n ajk@kukkamaljakko:~/scratch$ ghc -O -Wall -funbox-strict-fields --make wcl4.hs -o wcl4 Chasing modules from: wcl4.hs Compiling Main ( wcl4.hs, wcl4.o ) Linking ... ajk@kukkamaljakko:~/scratch$ time ./wcl4 < testfile 7798137 real 0m37.961s user 0m29.853s sys 0m0.495s ajk@kukkamaljakko:~/scratch$ time ./wcl4 +RTS -H100M < testfile 7798137 real 0m25.153s user 0m13.497s sys 0m0.803s ajk@kukkamaljakko:~/scratch$
With a large heap, the program is only twice as slow as the baseline. This is pretty good, considering that the code is still idiomatically functional.
The following rather ugly, imperative program does away with the lazy stream and carefully avoids any laziness without doing any too dirty tricks:
ajk@kukkamaljakko:~/scratch$ cat wcl7.hs module Main (main) where import Data.Word import Data.Array.IO import IO (stdin) count :: Int -> (Int, Int) -> IOUArray Int Word8 -> IO Int count 0 _ _ | False = undefined -- force ac evaluation count ac (l,u) _ | l > u = return ac count ac (l,u) ar = do a < - readArray ar l let ac' = case a of 10 -> ac + 1 _ -> ac count ac' (l+1, u) ar maxn :: Int maxn = 4096 main :: IO () main = do arr < - newArray (0,maxn) 0 let loop :: Int -> IO Int loop acc = do n < - hGetArray stdin arr maxn if n == 0 then return acc else do acc' <- count acc (0,n) arr loop acc' res <- loop 0 print res ajk@kukkamaljakko:~/scratch$ ghc -O -Wall -funbox-strict-fields --make wcl7.hs -o wcl7 Chasing modules from: wcl7.hs Compiling Main ( wcl7.hs, wcl7.o ) Linking ... ajk@kukkamaljakko:~/scratch$ time ./wcl7 < testfile 7798137 real 0m13.431s user 0m10.342s sys 0m0.467s
Of course, this program is not Haskell 98, but the nonstandard things it uses are fairly common. As can be seen, this version's performance is, considering the margin of error, equal to the baseline.
The compiler used in this test is The Glorious Glasgow Haskell Compilation System, version 6.4.1, as packaged in Debian sid.