Handling data with thousands of variables

Sun, Jun 26, 2011 4:06 AM

On Sun, 2011-06-26 at 12:52 +0200, H?vard Wahl Kongsg?rd wrote:

OK, so then I would definitely reiterate the suggestion to pre-process
the tuples into factors, so that each of your 20000 keywords will have a
numeric representation.

'slots' are the (named) list elements in the older S3 class system in R,
and many people (including me) use the word 'slot' generically to refer
to named list elements in general.  lists in R can contain elements that
hold any other arbitrary class of object/data, and so are useful for
'mixed type' collections.

Given the additional information you've provided, and making the further
assumption that your tuples are always the same length (three keywords):

I would probably first construct the factor levels for your keywords,
see ?factor 

I would 'flatten' the data into a four-column representation

response,tuple1,tuple2,tuple3

using the factor-numeric representation for each tuple.  This could be
stored using either a data.frame or a matrix.  If you use the pure
numeric representation, a matrix will be faster.

Then, as discussed previously, if you intend to parallelize, since your
individual records are small, consider how to batch them up for sending
to your worker/compute processes.

Regards,

   - Brian

Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock

Handling data with thousands of variables

Thread (13 messages)