Skip to content

RMySQL and factors

5 messages · James Bullard, mikko@virt@@e@ m@iii@g oii kti@ii, Don MacQueen +1 more

#
I have just started using the RMySQL package and things are working out
well, but I would like to know if it is possible to return a data.frame
from dbGetQuery with class(dta$race) => factor rather than
class(dta$race) => character. I want to use glm on the resultant data,
but I cant figure out how to change the columns without effort. 

Currently what I do is write the dta to a local file and then re-read it
at which point the columns is of class factor.

Thanks in advance, The code I am using to get the data is:

getrace <- function(connection) {
  sql <- "SELECT race FROM PATIENT"
  rs <- dbGetQuery(connection, sql)

  class(rs$race)  
}

which returns "character" 


Jim
#
How about:

getrace <- function(connection) {
   sql <- "SELECT race FROM PATIENT"
   rs <- dbGetQuery(connection, sql)

   rs$trace <- factor(rs$trace)

   class(rs$race)
}
At 1:50 AM -0800 2/3/05, James Bullard wrote:

  
    
#
Don MacQueen <macq at llnl.gov> wrote:
This has the potential problem that if the data changes or the sql statement has
a WHERE in it, each run might get slightly different levels().

I guess the MySQL does have the ENUM data type, and the RSDBI does have the
hooks for using this information, but nothing seems to be implemented at the
moment.

   MJ;
#
At 10:03 PM +0200 2/3/05, mikko.virtanen at ktl.fi wrote:
I don't think that presents a problem. Either you know from the 
experimental design what all of the possible levels are, or you don't.

If you don't, there is no way to make sure that all will be present 
every time the function is used., no matter whether a 'where' clause 
is added or not, or whether or not the table changes.

If you do know what the possible levels are, you can explicitly list 
them in the call to factor(). See ?factor.

If at any given point in time the database table is considered to 
have all the possible levels present, and you want to have all of 
them present in the factor variable in R, even when not all levels 
are retrieved from the database, it's still easy:

Add another query, a select distinct on the field that is to become a 
factor. Use the results to specify the levels in the call to 
factor(). Then you will always have the same factors, regardless of 
what subset is selected.
-Don
#
To expand on the above, one option is to make the database design 
accurately reflect the experiment:

Have two tables instead of one:
Patient (int patient_id, char* name, int race_id)
Race(int race_id, char* race_name)

Race contains all possible factor levels, and might be useful elsewhere 
in the experiment/project.  Patient reflects all observed patient 
instances, which may or may not include all race levels.  You can 
modify the query to something like:

getAllRaces <- function(connection) {
	sql <- "SELECT race_name FROM RACE"
	rs <- dbGetQuery(connection, sql)
	allRaces <- rs$race_name
}
	
getRaceAsFactors <- function(connection) {
	sql <- "SELECT t2.race_name FROM PATIENT t1, RACE t2 where t1.race_id 
= t2.race_id"
	rs <- dbGetQuery(connection, sql)
	allRaces <- getAllRaces(connection)
	raceFactors <- factor(rs$race_name, levels=allRaces)
  }

of course, making a second database table may be overkill depending on 
the size of your project, and how often all race levels are needed in 
your code.

Jim