Counting lines

Insipired by Daniel Burrows’ tests, I decided to do mine. The object is to come up with a Haskell program that performs the job of wc -l as efficiently as possible. My test material is a randomly generated file with all lines less than 80 characters long.

Summary: my final idiomatic version is twice as slow as wc -l, and my final nonidiomatic version is as fast as wc -l, or at least within the margin of error.

First, the baseline:

$ time wc -l testfile
7798137 testfile

real    0m12.127s
user    0m0.765s
sys     0m0.399s

Then the most idiomatic program. Honestly, I expect it to suck performance-wise:

main = getContents >>= print . length . lines

And it does:

$ time ./wcl1.hs < testfile

real    1m14.993s
user    0m54.182s
sys     0m0.573s

But only by a factor of six. My first hypothesis is that garbage collection is the culprit. And, indeed:

$ time ./wcl1.hs +RTS -sstderr -RTS < testfile
./wcl1.hs +RTS -sstderr
25,236,174,340 bytes allocated in the heap
5,070,193,924 bytes copied during GC
117,484 bytes maximum residency (3750 sample(s))

96274 collections in generation 0 ( 26.36s)
3750 collections in generation 1 (  0.74s)

2 Mb total memory in use

INIT  time    0.00s  (  0.00s elapsed)
MUT   time   27.17s  ( 42.72s elapsed)
GC    time   27.10s  ( 33.72s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time   54.27s  ( 76.44s elapsed)

%GC time      49.9%  (44.1% elapsed)

Alloc rate    928,824,966 bytes per MUT second

Productivity  50.1% of total user, 35.5% of total elapsed

real    1m16.616s
user    0m54.278s
sys     0m0.700s

As you can see, GC takes half of the processing time. Unfortunately, minimizing GC time does not have as big an effect as I hoped:

$ time ./wcl1.hs +RTS -H100M  -RTS  < testfile

real    1m1.354s
user    0m44.371s
sys     0m1.171s

Here, I'm effectively telling it to use a 100-megabyte heap. A separate GC profiling run tells me that GC time has dropped from 50 % to 1 %.

The memory problem seems to be a red herring. What we're doing here is creating four separately allocated memory cells for each character in the input. Of course, the cells are generated lazily, that is, a cell pair is allocated when the lines function demands another character; it then creates another cell pair to denote the same character in a list of lines. As soon as each line has been counted, the four cells become garbage. This sounds like a lot of work! But apparently it isn't; I tried a few variations that did away with the stream of input altogether, not much unlike what Daniel did in his tests. Surprisingly, the performance worsened a lot. The lesson: functional programming language implementations, unlike typical imperative ones, are built on the assumption that you'll be doing a lot of allocation.

What I suspect is taking a lot of the time is arithmetic. Haskell's Int is boxed, and lazy by default, adding a lot of overhead. I hand-fused length and line, yielding a tail-recursive countLines function, and I added the command-line switch -funbox-strict-fields, which enables a few significant optimizations of the laziness of Int. The results were nice:

ajk@kukkamaljakko:~/scratch$ cat wcl4.hs
module Main (main) where

main :: IO ()
main = getContents >>= print . countLines 0

countLines :: Int -> String -> Int
countLines n ('\n':r) = countLines (n + 1) r
countLines n (_:r) = countLines n r
countLines n [] = n
ajk@kukkamaljakko:~/scratch$ ghc -O -Wall -funbox-strict-fields --make wcl4.hs -o wcl4
Chasing modules from: wcl4.hs
Compiling Main             ( wcl4.hs, wcl4.o )
Linking ...
ajk@kukkamaljakko:~/scratch$ time ./wcl4 < testfile

real    0m37.961s
user    0m29.853s
sys     0m0.495s
ajk@kukkamaljakko:~/scratch$ time ./wcl4 +RTS -H100M < testfile

real    0m25.153s
user    0m13.497s
sys     0m0.803s

With a large heap, the program is only twice as slow as the baseline. This is pretty good, considering that the code is still idiomatically functional.

The following rather ugly, imperative program does away with the lazy stream and carefully avoids any laziness without doing any too dirty tricks:

ajk@kukkamaljakko:~/scratch$ cat wcl7.hs
module Main (main) where

import Data.Word
import Data.Array.IO
import IO (stdin)

count :: Int -> (Int, Int) -> IOUArray Int Word8 -> IO Int
count 0 _ _ | False = undefined -- force ac evaluation
count ac (l,u) _ | l > u = return ac
count ac (l,u) ar
    = do a < - readArray ar l
         let ac' = case a of 10 -> ac + 1
                             _  -> ac
         count ac' (l+1, u) ar

maxn :: Int
maxn = 4096

main :: IO ()
main = do
  arr < - newArray (0,maxn) 0
  let loop :: Int -> IO Int
      loop acc = do n < - hGetArray stdin arr maxn
                    if n == 0
                       then return acc
                       else do acc' <- count acc (0,n) arr
                               loop acc'
  res <- loop 0
  print res

ajk@kukkamaljakko:~/scratch$ ghc -O  -Wall -funbox-strict-fields --make wcl7.hs -o wcl7
Chasing modules from: wcl7.hs
Compiling Main             ( wcl7.hs, wcl7.o )
Linking ...
ajk@kukkamaljakko:~/scratch$ time ./wcl7 < testfile

real    0m13.431s
user    0m10.342s
sys     0m0.467s

Of course, this program is not Haskell 98, but the nonstandard things it uses are fairly common. As can be seen, this version's performance is, considering the margin of error, equal to the baseline.

The compiler used in this test is The Glorious Glasgow Haskell Compilation System, version 6.4.1, as packaged in Debian sid.

6 thoughts on “Counting lines

  1. Isn’t “user” supposed to contain much more accurate information on the program’s run time that “real”?
    So we rather have a factorof more than sixty for the first program.

    otoh, I am not sure whether such a heavy IO task is the optimal thing to demonstrate the performance of a functional programming language (implementation).

    Kind regards,

  2. I think that the reasoning in these experiments is sound, but the numbers are not trustworthy.

  3. Please specify your locale for your tests. I don’t know about Haskell, but if your wc tests were not run in the C locale, please rerun them; a tenfold improvement is not uncommon.

  4. Zeno, good point. I’ll investigate.

    Ralf, I use the Finnish locale. Rerunning baseline in C locale gives a noticeable but insiginificant performance boost (2 seconds in real titme), given the margin of error in these measurements.

  5. Zeno, “user” specifies how much CPU the program used in user time. Whether it is a better measurement than “real” is a matter of what you are measuring :) I was mainly concerned with the “response time”, so “real” was an appropriate measure. However, the large “user” means that the programs do an awful lot more work than baseline. I need to investigate that.

  6. Zeno, I chose wc -l as a benchmark precisely because Haskell seems likely to be poorly suited to it. The point is that a medium-to-large program will have to do many different algorithmic tasks, and the real world is not exclusively populated by tasks for which functional languages are well-suited :-). So it would be nice to know that your initial choice of a language of implementation doesn’t inevitably doom your program to suck in one way or another. Attacking problems that are non-ideal cases for the language is a way to get some idea of what the worst case will be.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>