I am using R as a data manipulation tool for a SQL database. So in some of
my R scripts I use the RODBC package to retreive data, then run analysis,
and use the sqlSave function in the RODBC package to store the results in a
database.
There are two problems I want to avoid, and they are highly related: (1)
having R rerun analysis which has already been done and saved into output
database table, and (2) ending up with more than one identical row in
my output database table.
-------------------------------------
The analysis I am running allows the user to input a large number of
variables, for example:
date, version, a, b, c, d, e, f, g, ...
After R completes its analysis, I write the results to a database table in
the format:
Value, date, version, a, b, c, d, e, f, g, ...
where Value is the result of the R analysis, and the rest of the columns are
the criteria that was used to get that value.
--------------------------------------
Can anyone think of a way to address these problems? The only thing I can
think of so far is to run an sqlQuery to get a table of all the variable
combinations that are saved at the start, and then simply avoid computing
and re-outputing those results. However, my results database table
currently has over 200K rows (and will grow very quickly as I keep going
with this project), so I think that would not be the most expeditious answer
as I think just the SQL query to download 200K rows x 10+ columns is going
to be time consuming in and of itself.
I know this is kindof a weird problem, and am open to all sorts of ideas...
Thanks!
[[alternative HTML version deleted]]