In early May I posted benchmarks comparing Linux and OS X on a
MacBookpro running my R
packages (I later added Windows XP benchmarks). In one of the
original benchmarks, both Linux and Windows XP were more than twice as
fast as OS X. And in a second (more representative) benchmark, Linux
was about 20% faster. The benchmarks were posted on Digg and a variety of other high traffic
Internet websites such as OSnews.
This attention generated a lot of comments and suggestions.
With the help of a variety of developers working at Apple and
elsewhere, the large OS X performance gap previously reported here has
been significantly reduced. The most important improvement is the use
of a more efficient algorithm which relies on optimized BLAS to perform key
matrix operations. This change increased the performance of the code
on all platforms. The performance gap was further closed by compiling
and linking R on OS X against Doug Lea's malloc
(called dmalloc for short).
However, a Linux speed advantage remains which varies with the
size of the dataset used. For example, the gap ranges from 0% for a
small dataset to 10% for what is a medium size dataset for the
algorithm in question. The gap shrinks again to 0% for a large
dataset. The performance gap is much greater if the default OS X
malloc is used notwithstanding the new algorithm: the gap goes from
essentially zero for a small dataset, to 40% for a medium one, and up
to 50% for a large one. Therefore, I recommend that R for OS X be
linked against dmalloc just as it is for Windows. At the very least,
packages such as the Scythe
Statistical Library should be so compiled.
The default malloc on OS X, like the default malloc on
Windows XP, causes a large performance degradation relative to the
default mallocs on Linux and Solaris. R developers use the default
system malloc on every operating system but Windows. It turns out
that this decision is a bad one in the case of OS X because its memory
allocator makes system calls more often (at 15KB and larger
allocations) than dmalloc (at 256KB and larger). Indeed, as suggested
by Kazushige Goto,
the performance of this code could possibly be further improved by
avoiding mmap altogether. GNU malloc, for example, always makes the
munmap system call for large allocations which results in page faults
with every such allocation. Calls to mmap can be avoided on Linux by
adjusting two environmental variables: MALLOC_TRIM_THRESHOLD
to -1 and MALLOC_MMAP_MAX to 0. It is unfortunate that it is not
possible to do something similar with OS X's default malloc because it
would help alleviate the performance issue. I am less clear on why
Linux performs better than OS X with dmalloc. Also, I have been
unable to find an article justifying the 15KB threshold for the switch
to the kernel's virtual memory system. Is this really the optimal
threshold for today's computers? For example, dmalloc's mmap
threshold used to be 128KB but it was increased to 256KB as computers
have changed---compare version 2.6.6 with 2.8.3 (search for
DEFAULT_MMAP_THRESHOLD).
While the new BLAS version of the code is faster on all platforms, the
quality of BLAS implementations is not constant across operating
systems. Goto's BLAS
implementation is currently the fastest and it is not yet available
for x86 OS X. So, a direct comparison between OS X and Linux using
the best BLAS library is not currently possible. For the purposes of
these comparisons, the ATLAS (non-threaded) BLAS
were used on both OS X and Linux. It turns out that the default OS X
BLAS libraries in the vecLib Framework provide somewhat better
performance for this code on OS X, so the OS X default BLAS
implementation is used while the default Ubuntu
ATLAS BLAS libraries are used for Linux.
Linux performance improves if Goto's BLAS are used instead of the
ATLAS BLAS. Goto's work is described on an Apple
webpage because he added a custom BLAS for the Apple super
computer at Virginia
Tech. Virginia Tech was able to achieve their excellent
performance by using a kernel level memory manager which provides
physically contiguous memory yielding both high and consistent
performance.
The benchmarks are based on one of my statistical software packages
for R: Matching (Multivariate and Propensity Score
Matching Software). The code uses C++ code extensively. The
benchmark scripts only vary by the sample size of the dataset being
examined. The data sizes are: 445, 890, 1780 and 5340 observations. The script with 445
observations is exactly the same as the original Benchmark 1, but it is now run with the new
algorithm which makes use of optimized BLAS. Each script runs the
benchmark three times and the best runtime of the three is recorded.
Each script is executed 100 times and the average times are reported
below.
Ubuntu Linux (Drapper
Drake) with i686-SMP kernel on MacBookpro, Intel 2.16GHz Dual Core
2GB RAM. Note: Xorg server running with GNOME
Unlike the original Benchmark 1 results
(which were obtained using exactly this script), this benchmarks
results in only small differences. There is no difference in
performance between Linux and the OS X code when dmalloc is used. And
the difference between Linux and the code with the OS X malloc is
small but statistically significant---the p-value based on the
empirical distribution over 100 simulations is 0.09.
As we increase the sample size, differences begin to become more
pronounced.
Linux is now 20% faster than the default OS X version (p-value=0.00)
and 10% faster than the OS X version using dmalloc
(p-value=0.00).
There is some evidence that the difference between Linux and OS X with
dmalloc is either asymptoting at about 10% or possibly even
shrinking---from 1.11 times as slow as Linux in the previous benchmark
to 1.08 times as slow now. The next simulation will help to nail this
down. In any case, the gap between Linux and the default OS X malloc
version has doubled and OS X is now about 1.4 times slower than Linux.
In an attempt to answer the asymptoting vs shrinking gap question, a
benchmarks was run with 5340 observations (12 times the original
dataset).
So it appears that the Linux advantage over OS X using dmalloc was
only present for a given range of dataset size and at this point it
has disappeared once again. The gap between OS X (default malloc) and
Linux continues to grow. Interestingly, Shark
reports that with the default OS X malloc, about 12% of the runtime is
spent in 'mach_msg_trap' which is a symbol in the libSystem.B.dylib
library. The only other libSystem calls taking more than 1% are
'__isnand' (1.4%) and 'dyld_stub__isnand' (1.3%). 'malloc' itself is
reported to take up (directly) 0.0% of the runtime---so calls to it
are being accounted elsewhere. However, with dmalloc the only
libSystem calls which take up greater than 1% of the time are
'__inand' (2.5%), 'dyld_stub___isnand' (2.5%) and 'syscall' (1.2%).
'__mmap' takes up 0.3% of the runtime and 'malloc' 0.4%. Does the
large amount of time spent in 'mach_msg_trap' indicate that XNU kernel is spending a
lot of time message passing as it is often accused of or is this
simply how Shark is reporting the kernel's default virtual memory
manager?
The claims that I had previously made about the inefficiency of OS X
system calls relative to Linux still hold although they may or may not
be an issue for these benchmarks. As one email correspondent noted,
compare Darwin
system call handling versus the Linux
equivalent (Tiger version is available here). Anandtech (Part I and
Part
II) has conducted a series of benchmarks relevant to this issue,
and Darwin is slower than the 2.6 Linux kernel in these benchmarks
sometimes by several times.
I was a bit surprised at the degree to which the malloc implementation
is still making a difference, but small memory allocations are still
being frequently made by my code. The efficiency of the malloc
implementation had a large impact on the original version of the code
because malloc was being called in the inner loop by the Scythe Statistical Library. This
library does not require that the dimensions of matrices be explicitly
declared. In my updated algorithm most, but not all, matrix
dimensions are declared at the time of variable creation and key
operations now make use of system BLAS libraries. These changes
make the code faster on all platforms. However, there is a tradeoff.
The Scythe Statistical Library
allows for the quick development of C++ code that is faster than R (an
interpreted language). One can certainly write more efficient C++
code. But code using Scythe is clear and straightforward to write,
read and debug especially for R programmers because it uses a similar
syntax. Therefore, it is reasonable for developers to use Scythe even
though there are more efficient alternatives.
Suggestions
If you have any suggestions or if you think something here is
erroneous, please contact me.