Insipired by Daniel Burrows’ tests, I decided to do mine. The object is to come up with a Haskell program that performs the job of wc -l as efficiently as possible. My test material is a randomly generated file with all lines less than 80 characters long.
Summary: my final idiomatic version is twice as slow as wc -l, and my final nonidiomatic version is as fast as wc -l, or at least within the margin of error.
First, the baseline:
$ time wc -l testfile 7798137 testfile real 0m12.127s user 0m0.765s sys 0m0.399s
Then the most idiomatic program. Honestly, I expect it to suck performance-wise:
main = getContents >>= print . length . lines
And it does:
$ time ./wcl1.hs < testfile 7798137 real 1m14.993s user 0m54.182s sys 0m0.573s
But only by a factor of six. My first hypothesis is that garbage collection is the culprit. And, indeed:
$ time ./wcl1.hs +RTS -sstderr -RTS < testfile ./wcl1.hs +RTS -sstderr 7798137 25,236,174,340 bytes allocated in the heap 5,070,193,924 bytes copied during GC 117,484 bytes maximum residency (3750 sample(s)) 96274 collections in generation 0 ( 26.36s) 3750 collections in generation 1 ( 0.74s) 2 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 27.17s ( 42.72s elapsed) GC time 27.10s ( 33.72s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 54.27s ( 76.44s elapsed) %GC time 49.9% (44.1% elapsed) Alloc rate 928,824,966 bytes per MUT second Productivity 50.1% of total user, 35.5% of total elapsed real 1m16.616s user 0m54.278s sys 0m0.700s
As you can see, GC takes half of the processing time. Unfortunately, minimizing GC time does not have as big an effect as I hoped:
$ time ./wcl1.hs +RTS -H100M -RTS < testfile 7798137 real 1m1.354s user 0m44.371s sys 0m1.171s
Here, I'm effectively telling it to use a 100-megabyte heap. A separate GC profiling run tells me that GC time has dropped from 50 % to 1 %.
The memory problem seems to be a red herring. What we're doing here is creating four separately allocated memory cells for each character in the input. Of course, the cells are generated lazily, that is, a cell pair is allocated when the lines function demands another character; it then creates another cell pair to denote the same character in a list of lines. As soon as each line has been counted, the four cells become garbage. This sounds like a lot of work! But apparently it isn't; I tried a few variations that did away with the stream of input altogether, not much unlike what Daniel did in his tests. Surprisingly, the performance worsened a lot. The lesson: functional programming language implementations, unlike typical imperative ones, are built on the assumption that you'll be doing a lot of allocation.
What I suspect is taking a lot of the time is arithmetic. Haskell's Int is boxed, and lazy by default, adding a lot of overhead. I hand-fused length and line, yielding a tail-recursive countLines function, and I added the command-line switch -funbox-strict-fields, which enables a few significant optimizations of the laziness of Int. The results were nice:
ajk@kukkamaljakko:~/scratch$ cat wcl4.hs
module Main (main) where
main :: IO ()
main = getContents >>= print . countLines 0
countLines :: Int -> String -> Int
countLines n ('\n':r) = countLines (n + 1) r
countLines n (_:r) = countLines n r
countLines n [] = n
ajk@kukkamaljakko:~/scratch$ ghc -O -Wall -funbox-strict-fields --make wcl4.hs -o wcl4
Chasing modules from: wcl4.hs
Compiling Main ( wcl4.hs, wcl4.o )
Linking ...
ajk@kukkamaljakko:~/scratch$ time ./wcl4 < testfile
7798137
real 0m37.961s
user 0m29.853s
sys 0m0.495s
ajk@kukkamaljakko:~/scratch$ time ./wcl4 +RTS -H100M < testfile
7798137
real 0m25.153s
user 0m13.497s
sys 0m0.803s
ajk@kukkamaljakko:~/scratch$
With a large heap, the program is only twice as slow as the baseline. This is pretty good, considering that the code is still idiomatically functional.
The following rather ugly, imperative program does away with the lazy stream and carefully avoids any laziness without doing any too dirty tricks:
ajk@kukkamaljakko:~/scratch$ cat wcl7.hs
module Main (main) where
import Data.Word
import Data.Array.IO
import IO (stdin)
count :: Int -> (Int, Int) -> IOUArray Int Word8 -> IO Int
count 0 _ _ | False = undefined -- force ac evaluation
count ac (l,u) _ | l > u = return ac
count ac (l,u) ar
= do a < - readArray ar l
let ac' = case a of 10 -> ac + 1
_ -> ac
count ac' (l+1, u) ar
maxn :: Int
maxn = 4096
main :: IO ()
main = do
arr < - newArray (0,maxn) 0
let loop :: Int -> IO Int
loop acc = do n < - hGetArray stdin arr maxn
if n == 0
then return acc
else do acc' <- count acc (0,n) arr
loop acc'
res <- loop 0
print res
ajk@kukkamaljakko:~/scratch$ ghc -O -Wall -funbox-strict-fields --make wcl7.hs -o wcl7
Chasing modules from: wcl7.hs
Compiling Main ( wcl7.hs, wcl7.o )
Linking ...
ajk@kukkamaljakko:~/scratch$ time ./wcl7 < testfile
7798137
real 0m13.431s
user 0m10.342s
sys 0m0.467s
Of course, this program is not Haskell 98, but the nonstandard things it uses are fairly common. As can be seen, this version's performance is, considering the margin of error, equal to the baseline.
The compiler used in this test is The Glorious Glasgow Haskell Compilation System, version 6.4.1, as packaged in Debian sid.
Isn’t “user” supposed to contain much more accurate information on the program’s run time that “real”?
So we rather have a factorof more than sixty for the first program.
otoh, I am not sure whether such a heavy IO task is the optimal thing to demonstrate the performance of a functional programming language (implementation).
Kind regards,
Zeno
I think that the reasoning in these experiments is sound, but the numbers are not trustworthy.
Please specify your locale for your tests. I don’t know about Haskell, but if your wc tests were not run in the C locale, please rerun them; a tenfold improvement is not uncommon.
Zeno, good point. I’ll investigate.
Ralf, I use the Finnish locale. Rerunning baseline in C locale gives a noticeable but insiginificant performance boost (2 seconds in real titme), given the margin of error in these measurements.
Zeno, “user” specifies how much CPU the program used in user time. Whether it is a better measurement than “real” is a matter of what you are measuring
I was mainly concerned with the “response time”, so “real” was an appropriate measure. However, the large “user” means that the programs do an awful lot more work than baseline. I need to investigate that.
Zeno, I chose wc -l as a benchmark precisely because Haskell seems likely to be poorly suited to it. The point is that a medium-to-large program will have to do many different algorithmic tasks, and the real world is not exclusively populated by tasks for which functional languages are well-suited
. So it would be nice to know that your initial choice of a language of implementation doesn’t inevitably doom your program to suck in one way or another. Attacking problems that are non-ideal cases for the language is a way to get some idea of what the worst case will be.