Tip for performance improvement while handling huge data?

Philipp Pagel · 2009-02-08T19:28:53Z

> For certain calculations, I have to handle a dataframe with say 10 million > rows and multiple columns of different datatypes. > When I try to perform calculations on certain elements in each row, the > program just goes in "busy" mode for really long time. > To avoid this "busy" mode, I split the dataframe into subsets of 10000 rows. > Then the calculation was done very fast. within reasonable time. > > Is there any other tip to improve the performance ? Depending on what exactly it is you

Philipp Pagel

Sun, Feb 8, 2009 11:28 AM

Depending on what exactly it is you are doing and what causes the slowdown
there may be a number of useful strategies:

 - Buy RAM (lots of it) - it's cheap
 - Vectorize whatever you are doing
 - Don't use all the data you have but draw a random sample of reasonalbe size
 - ...

To be more helpful we'd have to know

 - what are the computations involved?
 - how are they implemented at the moment?
  -> example code
 - what is the range of "really long time"?

cu
	Philipp

Dr. Philipp Pagel
Lehrstuhl f?r Genomorientierte Bioinformatik
Technische Universit?t M?nchen
Wissenschaftszentrum Weihenstephan
85350 Freising, Germany
http://mips.gsf.de/staff/pagel

Tip for performance improvement while handling huge data?

Thread (3 messages)