Skip to content

Intel Phi Coprocessor?

8 messages · ivo welch, Dirk Eddelbuettel, Simon Urbanek +1 more

#
does R run on the intel phi coprocessor?  the intel literature makes
it seem as if it can be treated just like a 50-core 200-thread
just-like-i686 processor running linux, albeit with only 8GB of very
fast shared RAM.  some posts have suggested it can be 2-3 times as
fast as two high-end Intel Xeon 8-core machines.  how do simple
library(parallel) R tasks scale on it?

regards,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)
#
On Jun 10, 2013, at 1:44 AM, ivo welch wrote:

            
Given that R is not thread-safe and almost everything (apart from parallel BLAS) is single-threaded, it's exactly the opposite of what you need for R. Explicit parallelization in R has overhead and cannot use threads, so you're better off with higher clock speed than large number of cores (unless you use those explicitly for particular tasks but writing your own low-level code). I was not able to test phi, but generally, in our experience scaling to many cores does not work very well, in particular when you have so little RAM (the only way parallel can scale is by running multiple processes which limits the amount of memory sharing that can be done). So, the way I see it you'd have to treat phi like GPU: you'll be able to leverage the speeds that are claimed by very specific code and algorithms written for it (or, e.g. by running BLAS on it if that's what you do often), but it will be much slower than Xenons for regular use of R. Your mileage may vary - this is just my personal experience evaluating high-core machines (250+) and R (the lesson was it's better to get multiple low-core, high-clockspeed, high-RAM machines instead - the opposite of phi), not particularly with phi.

Cheers,
Simon
#
thanks, simon.  for me, it is all about running the same code faster,
i.e., without much optimization.  so no phi for me.  An Intel i7 costs
about $100 more (of a b-o-m cost of $1,000) than an i5.   I presume this
is still worth it, because library parallel does use threads.  correct?

[this is asking too much, but here is a related quick question.  does
stock R take advantage of SSE?  SSEx?  AVX?  AVX2?  are these all
single-float based which stock R does not support anyway?]

/iaw

----
Ivo Welch (ivo.welch at gmail.com)
http://www.ivo-welch.info/
J. Fred Weston Professor of Finance
Anderson School at UCLA, C519
Director, UCLA Anderson Fink Center for Finance and Investments
Free Finance Textbook, http://book.ivo-welch.info/
Editor, Critical Finance Review, http://www.critical-finance-review.org/



On Mon, Jun 10, 2013 at 5:15 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
#
On Jun 10, 2013, at 10:00 AM, ivo welch wrote:

            
No, parallel does not use threads. multicore forks off a process from the current session, all the other methods create a new process, send the data to it etc.

However, with so few cores that i5/i7 have I would certainly go for more. If you want real speed, you have to pay much more ;)
That depends on how you compile R. R itself doesn't, but the compiler will use sse or avx if instructed so and system libraries do.

At least with gcc I found this to be a mixed bag, sse is certainly faster, but although avx can in theory speed double-precision arithmetics, it can also slow it down quite a bit, so for now I have only found MKL to use AVX to be consistently faster, but not the gcc or llvm compilers. This may change as the compilers get better ... hopefully ...

Cheers,
Simon
#
thx, simon.
there is a paragraph in the library(parallel) package description (apr
18, 2013) that says

On Windows the default is to report the number of logical CPUs. On
modern hardware (e.g.
Intel Core i7 ) the latter may not be unreasonable as hyper-threading
does give a significant
extra throughput.

I took this to mean that threads are useful.
depressingly more.  the 6-core or 8-core xeon are now 2-gen old (still
based on Sandy Bridge).  haswell only exists in 4-core, though.  and
ivy bridge-E is delayed to later.  maybe amd kaveri will be a quantum
leap...again next year.  alas, I am getting so old, I just hope to
still be alive by then.  if my computer is twice as fast, maybe I can
write twice as many papers until next year...[= weird sense of humor]
I am now wondering what the smallest form-factor i7-haswell board and
formfactor is to patch together a few of them.  I wish there was a
good 5--motherboard chassis, but they don't exist.  I guess I will
need 5 SFFs.
I tried gcc with more compile optimization flags and sse2.  dirk e
predicted correctly that it would make little difference above the
stock binary R distribution.  :-(..

I don't understand enough about the internals of intel processors,
compilers, and R, but it is surprising to me that with all the focus
on vector processing, all the various MMX instruction derivatives
still seem to make little difference in R in 2013.  what exactly is
the default here?  are we still using the old intel 80387
instructions, "just" better optimized for a vector language like R?
#
Ivo,

You are making a lot of assertions here.  As an empirically-minded
investigator, should you not simply be _profiling_ a lot more?

For what it is worth, Armadillo [a C++ library for linear algebra I like a
lot and connect to R via RcppArmadillo] just added SSE2 operations if and
only if -O3 is used and g++ is at least 4.7.1.  I have not had a chance to
time this to see if any differences materialize.

The rest of the discussion seems somewhat irrelevant.  R is a hybrid system
which uses an interpreter as well as compiled code.  You think you should
make sure you code is actually constrained by the compiled portions before
going down all these roads.  

Dirk
#
On Jun 10, 2013, at 11:43 AM, ivo welch wrote:

            
The HT feature of CPUs has really nothing to do with threads - it's a bit of a misnomer. It just says that you can run more tasks than you have cores - those can be processes, it is not limited to threads. The CPU pretends to have more cores than it does. Think of it as priming the pipeline into the one computing core with two tasks at once so when one of the tasks needs to wait for something, the other can jump ahead with practically no overhead. So, yes, that is a good feature, assuming that you have enough parallel jobs to keep it busy ;).
The issue is rather that the trend is towards massively parallel processors at low clock speeds. This is great if you can code up your very specific embarrassingly parallel task, but doesn't buy you much if you can't. But that's why most of today's focus is to find out how you can ;).
Unsurprisingly - see below.
What you are possibly missing is that the default *is* to use all that already in stock R. I assume you're talking about 64-bit AMD/Intel architecture, so the architecture is newer than SIMD instructions, so they can always be assumed. Hence "stock R" (whatever that means) is already optimized (that's why some results differ between 32-bit and 64-bit because the latter typically uses SIMD to do FP math instead of the FPU). But also remember that SIMD are very primitive operations, they won't help with complex computing (unless you hand-code an algorithm for it - which some libraries do). That's why there are hand-tweaked versions of BLAS, but R simply leverages that if you let it.

Cheers,
Simon
#
I have not yet used the device, but my understanding is that it would
take some fiddling to even make it run R's parallel/multicore (multiple
forked processes).  They mean it when they call it a coprocessor, so it
is somewhat like a GPU.  There are issues with the host processor memory
versus the coprocessor memory, etc.  

On the other hand, it is set up to use OpenMP, which makes C/C++/Fortran
programming fairly easy.  Thus selected operations in R could be
extended to a version involving this hardware.

I mentioned this device in my 2012 useR! talk, in the context of the
Thrust library, which features multiple backends, including CUDA and
OpenMP.  I argued that interfacing R to Thrust allows one to "hedge
one's bets" in guessing whether Intel or NVIDIA will ultimately prevail
in this market segment.

Norm