I have just started using the RMySQL package and things are working out
well, but I would like to know if it is possible to return a data.frame
from dbGetQuery with class(dta$race) => factor rather than
class(dta$race) => character. I want to use glm on the resultant data,
but I cant figure out how to change the columns without effort.
Currently what I do is write the dta to a local file and then re-read it
at which point the columns is of class factor.
Thanks in advance, The code I am using to get the data is:
getrace <- function(connection) {
sql <- "SELECT race FROM PATIENT"
rs <- dbGetQuery(connection, sql)
class(rs$race)
}
which returns "character"
Jim
RMySQL and factors
5 messages · James Bullard, mikko@virt@@e@ m@iii@g oii kti@ii, Don MacQueen +1 more
How about:
getrace <- function(connection) {
sql <- "SELECT race FROM PATIENT"
rs <- dbGetQuery(connection, sql)
rs$trace <- factor(rs$trace)
class(rs$race)
}
At 1:50 AM -0800 2/3/05, James Bullard wrote:
I have just started using the RMySQL package and things are working out
well, but I would like to know if it is possible to return a data.frame
from dbGetQuery with class(dta$race) => factor rather than
class(dta$race) => character. I want to use glm on the resultant data,
but I cant figure out how to change the columns without effort.
Currently what I do is write the dta to a local file and then re-read it
at which point the columns is of class factor.
Thanks in advance, The code I am using to get the data is:
getrace <- function(connection) {
sql <- "SELECT race FROM PATIENT"
rs <- dbGetQuery(connection, sql)
class(rs$race)
}
which returns "character"
Jim
_______________________________________________ R-sig-DB mailing list -- R Special Interest Group R-sig-DB at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-db
-------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA
Don MacQueen <macq at llnl.gov> wrote:
How about:
getrace <- function(connection) {
sql <- "SELECT race FROM PATIENT"
rs <- dbGetQuery(connection, sql)
rs$trace <- factor(rs$trace)
class(rs$race)
}
This has the potential problem that if the data changes or the sql statement has a WHERE in it, each run might get slightly different levels(). I guess the MySQL does have the ENUM data type, and the RSDBI does have the hooks for using this information, but nothing seems to be implemented at the moment. MJ;
At 10:03 PM +0200 2/3/05, mikko.virtanen at ktl.fi wrote:
Don MacQueen <macq at llnl.gov> wrote:
How about:
getrace <- function(connection) {
sql <- "SELECT race FROM PATIENT"
rs <- dbGetQuery(connection, sql)
rs$trace <- factor(rs$trace)
class(rs$race)
}
This has the potential problem that if the data changes or the sql statement has a WHERE in it, each run might get slightly different levels().
I don't think that presents a problem. Either you know from the experimental design what all of the possible levels are, or you don't. If you don't, there is no way to make sure that all will be present every time the function is used., no matter whether a 'where' clause is added or not, or whether or not the table changes. If you do know what the possible levels are, you can explicitly list them in the call to factor(). See ?factor. If at any given point in time the database table is considered to have all the possible levels present, and you want to have all of them present in the factor variable in R, even when not all levels are retrieved from the database, it's still easy: Add another query, a select distinct on the field that is to become a factor. Use the results to specify the levels in the call to factor(). Then you will always have the same factors, regardless of what subset is selected.
I guess the MySQL does have the ENUM data type, and the RSDBI does have the hooks for using this information, but nothing seems to be implemented at the moment. MJ;
-Don
-------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA
This has the potential problem that if the data changes or the sql statement has a WHERE in it, each run might get slightly different levels().
I don't think that presents a problem. Either you know from the experimental design what all of the possible levels are, or you don't. If you don't, there is no way to make sure that all will be present every time the function is used., no matter whether a 'where' clause is added or not, or whether or not the table changes. If you do know what the possible levels are, you can explicitly list them in the call to factor(). See ?factor. If at any given point in time the database table is considered to have all the possible levels present, and you want to have all of them present in the factor variable in R, even when not all levels are retrieved from the database, it's still easy:
To expand on the above, one option is to make the database design
accurately reflect the experiment:
Have two tables instead of one:
Patient (int patient_id, char* name, int race_id)
Race(int race_id, char* race_name)
Race contains all possible factor levels, and might be useful elsewhere
in the experiment/project. Patient reflects all observed patient
instances, which may or may not include all race levels. You can
modify the query to something like:
getAllRaces <- function(connection) {
sql <- "SELECT race_name FROM RACE"
rs <- dbGetQuery(connection, sql)
allRaces <- rs$race_name
}
getRaceAsFactors <- function(connection) {
sql <- "SELECT t2.race_name FROM PATIENT t1, RACE t2 where t1.race_id
= t2.race_id"
rs <- dbGetQuery(connection, sql)
allRaces <- getAllRaces(connection)
raceFactors <- factor(rs$race_name, levels=allRaces)
}
of course, making a second database table may be overkill depending on
the size of your project, and how often all race levels are needed in
your code.
Jim