Handling data with thousands of variables
Ok, but what about memory usage. For now I have implemented my analysis in python with numpy arrays with only 100 000 cases and 10 000 keywords. But the memory required for large array and matrix is massive. In R one possibility is the Bigmemory library, but it's slow and if I remember correctly the bigmemory matrix is not supported by other R libraries. -H?vard
On Fri, Jul 1, 2011 at 10:02 AM, Han De Vries <handevries at gmail.com> wrote:
Perhaps you want to store your data in a big 10 mln rows x 20000 columns matrix, where each cell is 1 when the corresponding keyword applies to a record, zero otherwise. Because you will end up with very many zeroes, such a matrix can be stored as a sparse matrix in an efficient way (using the Matrix package). The Matrix package itself offers various analytical tools to quickly summarize by rows or columns, and many other types of estimations as long as they can be translated to matrix operations (like linear regression). Some other packages, such as glmnet, can read these matrices directly for more specific analyses. If you have sufficient memory (you want to keep the entire sparse matrix in memory), handling the data can be really fast. Because you're asking about personal experiences: I have been using this approach with (sparse) matrices up to a few million rows (records) and 20K columns (variables). Kind regards, Han
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc