For certain calculations, I have to handle a dataframe with say 10
million
rows and multiple columns of different datatypes.
When I try to perform calculations on certain elements in each row, the
program just goes in "busy" mode for really long time.
To avoid this "busy" mode, I split the dataframe into subsets of 10000
rows.
Then the calculation was done very fast. within reasonable time.
Is there any other tip to improve the performance ?
Depending on what exactly it is you are doing and what causes the slowdown
there may be a number of useful strategies:
- Buy RAM (lots of it) - it's cheap
- Vectorize whatever you are doing
- Don't use all the data you have but draw a random sample of reasonalbe
size
- ...
To be more helpful we'd have to know
- what are the computations involved?
- how are they implemented at the moment?
-> example code
- what is the range of "really long time"?
cu
Philipp
--
Dr. Philipp Pagel
Lehrstuhl f?r Genomorientierte Bioinformatik
Technische Universit?t M?nchen
Wissenschaftszentrum Weihenstephan
85350 Freising, Germany
http://mips.gsf.de/staff/pagel