Skip to content
Prev 44728 / 63424 Next

Creating a Factor Object in C code?

Rory,
On Dec 27, 2012, at 3:14 AM, Rory Winston wrote:

            
varchars are character strings. Factors consists of index and level set, so if your DB doesn't keep those separate, it is not a factor (and below you suggest it doesn't). Even if the DB supports ordered and unordered sets, the drivers typically only return the strings anyway, so you don't get at the set (without querying the schema). To make a point - a factor is if you can have a column consisting of values A,A,B,B and a level set of A,B,C (i.e. C is not used so it is extra information that you cannot express in a character string). if you don't have levels information nor the order then it's just a character vector.
It really depends on what you want to get out and what your input really is. If your DB will be delivering results in rows, probably the most efficient way to construct a factor from string input is to simply create the index as you go and keep a hash of the levels. Then at the end you just put the two together into one factor object. Note that if your DB doesn't pre-specify the levels the the order is undefined.

If you are collecting the whole character vector first anyway, then I see no real point of not using as.factor() - even from C code.
Note, however, that in such case you should really give the user an option not do to that - dealing with factors is very painful and they are bad for data manipulation so many users prefer to set stringsAsFactors default to FALSE (including me) because it's much more efficient and less error-prone to deal with character vectors. Having to convert factors back to strings is very inefficient (in particular with large data) and superfluous since you already had strings to start with.
It would not for reasons above which is why it's typically done at R level as an optional post-processing step. That doesn't mean you can't do it in C, but it is somewhat painful as you'll have to hash the levels - it's more convenient to have R do that for you.

Cheers,
Simon