Skip to content
Prev 244303 / 398506 Next

Parallel Scan of Large File

----------------------------------------
I can't comment on R approaches but if your rational here is speed
and you hope to scale this up to bigger files I would suggest more
analysis or measurement. In the case you outline, disk IO is probably
going to be the rate limiting step. It usually helps if you can make
thing predictable so the disk and memory caches can be used efficiently.
If you split up disk IO among different threads there is no reasonable
way the hardware can figure out what access is likely to be next.
Further, often times things like "skip()" are implemented as dummy reads
on sequential file access calls. 

If you pursue this, I'd be curious to see what kind of results you get as
you go from 1 to 8 core with larger files.

You would probably be better off if you could find a way to pipeline this
work rather than split it up as you suggest. The idea sounds good of course,
you end up with 8 cores looking at your text, but you could easily be limited
by some other resource like bus bandwidths to disk or memory. As each core
gets a bigger junk, eventually you run out of physical memory and then of
course you are just doing disk IO for VM.