Handling data with thousands of variables
On Sun, 2011-06-26 at 12:52 +0200, H?vard Wahl Kongsg?rd wrote:
Thanks, I am not very good at describing issues. Your assumptions are correct, however in my case I have 20 000 different keywords and 10 000000 records so it's a big data set.
OK, so then I would definitely reiterate the suggestion to pre-process the tuples into factors, so that each of your 20000 keywords will have a numeric representation.
I don't use R for programming, but are slots more or less a dictionary class(python) in R?
'slots' are the (named) list elements in the older S3 class system in R, and many people (including me) use the word 'slot' generically to refer to named list elements in general. lists in R can contain elements that hold any other arbitrary class of object/data, and so are useful for 'mixed type' collections. Given the additional information you've provided, and making the further assumption that your tuples are always the same length (three keywords): I would probably first construct the factor levels for your keywords, see ?factor I would 'flatten' the data into a four-column representation response,tuple1,tuple2,tuple3 using the factor-numeric representation for each tuple. This could be stored using either a data.frame or a matrix. If you use the pure numeric representation, a matrix will be faster. Then, as discussed previously, if you intend to parallelize, since your individual records are small, consider how to batch them up for sending to your worker/compute processes. Regards, - Brian
Brian G. Peterson http://braverock.com/brian/ Ph: 773-459-4973 IM: bgpbraverock