Linux versus Mac OS X on Intel Dual Core

2006-05-25

In early May I posted benchmarks comparing Linux and OS X on a MacBookpro running my R packages (I later added Windows XP benchmarks). In one of the original benchmarks, both Linux and Windows XP were more than twice as fast as OS X. And in a second (more representative) benchmark, Linux was about 20% faster. The benchmarks were posted on Digg and a variety of other high traffic Internet websites such as OSnews. This attention generated a lot of comments and suggestions.

With the help of a variety of developers working at Apple and elsewhere, the large OS X performance gap previously reported here has been significantly reduced. The most important improvement is the use of a more efficient algorithm which relies on optimized BLAS to perform key matrix operations. This change increased the performance of the code on all platforms. The performance gap was further closed by compiling and linking R on OS X against Doug Lea's malloc (called dmalloc for short).

However, a Linux speed advantage remains which varies with the size of the dataset used. For example, the gap ranges from 0% for a small dataset to 10% for what is a medium size dataset for the algorithm in question. The gap shrinks again to 0% for a large dataset. The performance gap is much greater if the default OS X malloc is used notwithstanding the new algorithm: the gap goes from essentially zero for a small dataset, to 40% for a medium one, and up to 50% for a large one. Therefore, I recommend that R for OS X be linked against dmalloc just as it is for Windows. At the very least, packages such as the Scythe Statistical Library should be so compiled.

The default malloc on OS X, like the default malloc on Windows XP, causes a large performance degradation relative to the default mallocs on Linux and Solaris. R developers use the default system malloc on every operating system but Windows. It turns out that this decision is a bad one in the case of OS X because its memory allocator makes system calls more often (at 15KB and larger allocations) than dmalloc (at 256KB and larger). Indeed, as suggested by Kazushige Goto, the performance of this code could possibly be further improved by avoiding mmap altogether. GNU malloc, for example, always makes the munmap system call for large allocations which results in page faults with every such allocation. Calls to mmap can be avoided on Linux by adjusting two environmental variables: MALLOC_TRIM_THRESHOLD to -1 and MALLOC_MMAP_MAX to 0. It is unfortunate that it is not possible to do something similar with OS X's default malloc because it would help alleviate the performance issue. I am less clear on why Linux performs better than OS X with dmalloc. Also, I have been unable to find an article justifying the 15KB threshold for the switch to the kernel's virtual memory system. Is this really the optimal threshold for today's computers? For example, dmalloc's mmap threshold used to be 128KB but it was increased to 256KB as computers have changed---compare version 2.6.6 with 2.8.3 (search for DEFAULT_MMAP_THRESHOLD).

While the new BLAS version of the code is faster on all platforms, the quality of BLAS implementations is not constant across operating systems. Goto's BLAS implementation is currently the fastest and it is not yet available for x86 OS X. So, a direct comparison between OS X and Linux using the best BLAS library is not currently possible. For the purposes of these comparisons, the ATLAS (non-threaded) BLAS were used on both OS X and Linux. It turns out that the default OS X BLAS libraries in the vecLib Framework provide somewhat better performance for this code on OS X, so the OS X default BLAS implementation is used while the default Ubuntu ATLAS BLAS libraries are used for Linux.

Linux performance improves if Goto's BLAS are used instead of the ATLAS BLAS. Goto's work is described on an Apple webpage because he added a custom BLAS for the Apple super computer at Virginia Tech. Virginia Tech was able to achieve their excellent performance by using a kernel level memory manager which provides physically contiguous memory yielding both high and consistent performance.

The benchmarks are based on one of my statistical software packages for R: Matching (Multivariate and Propensity Score Matching Software). The code uses C++ code extensively. The benchmark scripts only vary by the sample size of the dataset being examined. The data sizes are: 445, 890, 1780 and 5340 observations. The script with 445 observations is exactly the same as the original Benchmark 1, but it is now run with the new algorithm which makes use of optimized BLAS. Each script runs the benchmark three times and the best runtime of the three is recorded. Each script is executed 100 times and the average times are reported below.

The computer being used is:
Label OS and Chip
OS X Core Duo Tiger on MacBookpro, Intel 2.16GHz Dual Core 2GB RAM
Linux Core Duo Ubuntu Linux (Drapper Drake) with i686-SMP kernel on MacBookpro, Intel 2.16GHz Dual Core 2GB RAM. Note: Xorg server running with GNOME


GenMatch

Unlike the original Benchmark 1 results (which were obtained using exactly this script), this benchmarks results in only small differences. There is no difference in performance between Linux and the OS X code when dmalloc is used. And the difference between Linux and the code with the OS X malloc is small but statistically significant---the p-value based on the empirical distribution over 100 simulations is 0.09.

As we increase the sample size, differences begin to become more pronounced.
GenMatch_890obs

Linux is now 20% faster than the default OS X version (p-value=0.00) and 10% faster than the OS X version using dmalloc (p-value=0.00).
GenMatch_1780obs

There is some evidence that the difference between Linux and OS X with dmalloc is either asymptoting at about 10% or possibly even shrinking---from 1.11 times as slow as Linux in the previous benchmark to 1.08 times as slow now. The next simulation will help to nail this down. In any case, the gap between Linux and the default OS X malloc version has doubled and OS X is now about 1.4 times slower than Linux.

In an attempt to answer the asymptoting vs shrinking gap question, a benchmarks was run with 5340 observations (12 times the original dataset).
GenMatch_5340obs
So it appears that the Linux advantage over OS X using dmalloc was only present for a given range of dataset size and at this point it has disappeared once again. The gap between OS X (default malloc) and Linux continues to grow. Interestingly, Shark reports that with the default OS X malloc, about 12% of the runtime is spent in 'mach_msg_trap' which is a symbol in the libSystem.B.dylib library. The only other libSystem calls taking more than 1% are '__isnand' (1.4%) and 'dyld_stub__isnand' (1.3%). 'malloc' itself is reported to take up (directly) 0.0% of the runtime---so calls to it are being accounted elsewhere. However, with dmalloc the only libSystem calls which take up greater than 1% of the time are '__inand' (2.5%), 'dyld_stub___isnand' (2.5%) and 'syscall' (1.2%). '__mmap' takes up 0.3% of the runtime and 'malloc' 0.4%. Does the large amount of time spent in 'mach_msg_trap' indicate that XNU kernel is spending a lot of time message passing as it is often accused of or is this simply how Shark is reporting the kernel's default virtual memory manager?

The claims that I had previously made about the inefficiency of OS X system calls relative to Linux still hold although they may or may not be an issue for these benchmarks. As one email correspondent noted, compare Darwin system call handling versus the Linux equivalent (Tiger version is available here). Anandtech (Part I and Part II) has conducted a series of benchmarks relevant to this issue, and Darwin is slower than the 2.6 Linux kernel in these benchmarks sometimes by several times.

I was a bit surprised at the degree to which the malloc implementation is still making a difference, but small memory allocations are still being frequently made by my code. The efficiency of the malloc implementation had a large impact on the original version of the code because malloc was being called in the inner loop by the Scythe Statistical Library. This library does not require that the dimensions of matrices be explicitly declared. In my updated algorithm most, but not all, matrix dimensions are declared at the time of variable creation and key operations now make use of system BLAS libraries. These changes make the code faster on all platforms. However, there is a tradeoff. The Scythe Statistical Library allows for the quick development of C++ code that is faster than R (an interpreted language). One can certainly write more efficient C++ code. But code using Scythe is clear and straightforward to write, read and debug especially for R programmers because it uses a similar syntax. Therefore, it is reasonable for developers to use Scythe even though there are more efficient alternatives.

Suggestions


If you have any suggestions or if you think something here is erroneous, please contact me.

Links


For an article on this see: "The Art of Benchmarking: Evaluating the Performance of R on Linux and OS X". The Political Methodologist, 14(1), 2006.

AnandTech's OS X benchmarks: "No more mysteries: Apple's G5 versus x86, Mac OS X versus Linux" and "No more mysteries, part two".


Return to Jasjeet Sekhon's Homepage