Skip to content

Do you use R for data manipulation?

18 messages · Farrel Buchinsky, Wensui Liu, milton ruser +14 more

#
take a look at sqldf package(http://code.google.com/p/sqldf/), you
will be amazed.
On Wed, May 6, 2009 at 12:22 AM, Farrel Buchinsky <fjbuch at gmail.com> wrote:

  
    
#
well, I am less proficient in R comparing with other tools/languages. 
Therefore my biased opinion is - it is possible in R, but it may be 
easier if you use other tools, especially if you have to build a 
user-friendly GUI.

The most accessible (although limited to MS Windows only) method would 
be building GUI with HTA (HTML Application)/javasript which is nearly 
the same as creating web page and calling R from there when necessary. 
Less limited, but steeper learning curve - Python, Perl, Tcl/Tk - all 
open source tools that can communicate with R and all have decent GUI 
building tools. Then proprietary Adobe Flex, Flash, Air (the later 
somehow resembles HTA) or Runtime Revolution (RR) all allow to easily 
build crossplatform eye-candies, but these are not free although not too 
expensive either if you can allocate some resources for your project. I 
usually hide all the command line utilities beyond GUIs built with RR. 
All the tools listed above can easily do any kind of data manipulation 
and reshaping, but each have its strong sides: Python - tidy object 
oriented syntax, tons of 3rd party modules, Perl - powerful regular 
expressions tons of modules, RR - database connectivity, chunk 
expressions (item, char, word, line, etc...) and syntax that makes data 
manipulation much much easier.

But I may be wrong, so please let me here ask another related question 
(new thread?..) for the group - what do you use to build graphical user 
interfaces for end-users of your tools in R?

All the best
Viktoras
Farrel Buchinsky wrote:
#
Sorry for reply to the wrong person, I lost the original email.
I personally started to use R because I got tired of manually writing scripts
for data manipulation and processing.  The argument of your new recruit smells
of ignorance and resistance to learning something new.  Ask her _how_ did she
assess R, how much time she spent on her assessment and whether did she
actually try to run it and perform some concrete simple tasks.

(Yes, R is somewhat "different", it has a steep learning curve, but the effort
of learning it is worth it.  And yes, R can be used in the same way as any
other scripting language, i.e., it is not restricted to interactive work.)

Take a look at plyr and reshape packages (http://had.co.nz/), I have a hunch
that they would have saved me a lot of headache had I found out about them
earlier :)

I would also recommend investing in Phil Spector's book "Data manipulation with
R", it will get you started much faster.

I also find R's image files very convenient for sharing data (and code!) in a
very compact format (single file, portable across architectures).  When you
quit your R session, all the variables and functions get saved in the image
file, which you can take with you (or send to somebody else), start R again,
load the image into a new session and continue from where you left.  You won't
get this kind of automatic persistence in any scripting language out of the
box.
I'd go with 1).  R has also interfaces towards databases through RODBC, so you
do not have to go through several conversions when you're about to process or
plot data in R.
#
Le mercredi 06 mai 2009 ? 00:22 -0400, Farrel Buchinsky a ?crit :
[ Large Snip ! ... ]

Depends on what you have to do.

I've done what can be more or less termed "data management" with almost
uncountable tools (from Excel (sigh...) to R with SQL, APL, Pascal, C,
Basic (in 1982 !), Fortran and even Lisp in passing...).

SQL has strong points : join is, to my tastes, more easily expressed in
SQL than in most languages, projection and aggregation are natural.

However, in SQL, there is no "natural" ordering of row tables, which
makes expressing algorithms using this order difficult. Try for example
to express the differences of a time series ... (it can be done, but it
is *not* a pretty sight).

On the other hand, R has some unique expressive possibilities (reshape()
comes to mind).

So I tend to use a combination of tools : except for very small samples,
I tend to manage my data in SQL and with associated tools (think data
editing, for example ; a simple form in OpenOffice's Base is quite easy
to create, can handle anything for which an ODBC driver exists, and
won't crap out for more than a few hundreds line...). finer manipulation
is usually done in R with  native tools and sqldf.

But, at least in my trade, the ability to handle Excel files is a must
(this is considered as a standard for data entry. Sigh ...). So the
first task is usually a) import data in an SQL database, and b) prepare
some routines to dump SQL tables / R dataframes in Excel tor returning
back to the original data author...

HTH

					Emmanuel Charpentier
#
On Wed, May 06, 2009 at 12:22:45AM -0400, Farrel Buchinsky wrote:
I happily use both approaches depending on the original format the
data come in:

For data that are not in a "well behaved" format and require actual
parsing, I tend to use Python scripts for transmogrifying the data
into nice and tidy tables (and maybe some very basic filtering). For
everything after that I prefer R. I also use Python if the relevant
data needs to be harvested and assembled from many differnt sources
(e.g. data files + web + databases).

Once the data files are easy to read (csv, tab separated, database,
...) and the task is to reshape, filter and clean the data, I usually
do it in R. R has true advantages here: 

 - After reading a table into a data frame I can immediatly tell, if all
   measurements are what they are supposed to be (integer, numeric,
   factor, boolean) and functions like read.table even do quite some
   error checking for me (equal number of columns etc.)

 - Finding out if factors have the right (or plausible) number of levels is easy
 
 - Filtering by logical indexing

 - Powerful and reliable reshaping (reshape package)

 - Very conveniant diagnostics: str(), dim(), table(), summary(),
   plotting the data in various ways, ...

cu
	Philipp
#
I also use the approach Philipp describes below.  I use Python and shell 
scripts for processing thousands of input files and getting all the data 
into one tidy csv table.  From that point onwards it's R all the way 
(often with the reshape package).

Paul
Philipp Pagel wrote:
#
I second what Zeljko wrote.  In addition, see the data manipulation 
section in Chapter 4 of 
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RS/sintro.pdf

Frank
Zeljko Vrba wrote:

  
    
#
As the author of these two packages, I'm admittedly biased, but I
think R is unparalleled for data preparation, manipulation, and
cleaning (with the small caveat that your data needs to fit in
memory).  The R data frame is a fantastic abstraction that most other
programming languages lack, and vectorised subscripting make it
possible to express many transformations in an elegant and efficient
manner.  On top of the facilities provided by base R, there are a huge
number of packages available to load data from just about every data
format, as well as a number of packages (plyr, reshape, sqldf, doBy,
gdata, scope, ...) for data manipulation - just pick the metaphor that
is most natural to you.

Hadley
#
In my opinion, no statisticians toolbox should contain only 1 tool (even if it is as amazing a tool as R).  Learning the different tools helps you appreciate when each are the most appropriate to use and learn different ways of looking at problems.   There are some tasks that I (it could easily differ for others) find quickest to do some data extraction using Perl, then load the results into R.

Having said the above, I do admit that the percentage of time that I spend using tools other than R for working with data has gone down quite a bit with time.  3 possible reasons:

1. my clients are getting better at giving me the data in appropriate forms
2. my proficiency with R continues to grow and I can better see how to do something using R
3. R continues to grow with more and more tools to help manage data.

And a possible 4th: 4. I am getting to lazy in my old age to switch to other programs.

While I like to think that I am having success at educating my clients, number 1 only contributes very little to the overall, 3 is definitely a big contributor and hopefully 2 is part of the reason as well.
#
Another tool I find useful is Matthew Dowle's data.table package. It
has very fast indexing, can have much lower memory requirements than a
data frame, and has some built-in data manipulation capability.
Especially with a 64-bit OS, you can use this to keep things in memory
where you otherwise would have to use a database.

See here: http://article.gmane.org/gmane.comp.lang.r.packages/282

- Tom
1 day later
#
+1. I worked with Matthew for a while and saw in practice just how 
powerful that package is.
I'm surprised it isn't more widely used.

Martin
Tom Short wrote:
3 days later
#
2009/5/6 Emmanuel Charpentier <charpent at bacbuc.dyndns.org>:
I don't think Excel is  a standard tool for data entry. Epidata entry
is much more professional.

  
    
#
I am not a statistician and not a computer scientist by education. I
consider myself an R novice and came to R - thanks to my boss - from
an SPSS background. I work for a market research company and the most
typical data files we deal with are not huge - up to several thousand
rows and up to a thousand variables.
I would say, on certain projects, most of what we do in R (if you look
at the number of lines in R we devote to a given task) is data
manipulation. The actual statistical method is frequently just a line
- all the rest is getting the data shaped right and then spitting out
the results of the analysis in a way that is usable (i.e.,
presentable).
I find R to be excellent in data manipulations that we perform. First
of all, it's great that you can always grab variables/cases you need
and ignore all the rest. In SPSS you just keep staring at all those
data and variables that you don't need - trying to find the one you
need.
Second - I find R to be incredibly fast (as opposed to SPSS or Excel)
with the amounts of data we are dealing with.
And third - nothing is "written in stone" and your original data is
always untouched - you can always read it in again and again. For
example, if I create a new variable and make a mistake, I can always
fix the code, rerun that piece of the code and that gives me the
corrected object that containes that new variable. I never touch the
original data and hence - never "spoil" it.

Dimitri
On Mon, May 11, 2009 at 11:20 AM, ronggui <ronggui.huang at gmail.com> wrote:

  
    
#
Le lundi 11 mai 2009 ? 23:20 +0800, ronggui a ?crit :

[ Snip... ]
[ Re-snip... ]
Irony squared ?

This *must* go in the fortunes file !
					Emmanuel Charpentier
1 day later
#
Farrel Buchinsky wrote:
It's hard to shift people's individual preferences, but impressive 
objective comparisons are easy to come by.  Ask her how many lines it 
would take to do this trivial R task in Python:

	data <- read.csv('original-data.csv')
	write.csv('scaled-data.csv', data * 10)

R's ability to do something to an entire data structure -- or a slice of 
it, or some other subset -- in a single operation is very useful when 
cleaning up data for presentation and analysis.  Also point out how easy 
it is to get data *out* of R, as above, not just into it, so you can 
then hack on it in Python, if that's the better language for further 
manipulation.

If she gives you static about how a few more lines are no big deal, 
remind her that it's well established that bug count is always a simple 
function of line count.  This fact has been known since the 70's.

While making your points, remember that she has a good one, too: R is 
not the only good language out there.  You should learn Python while 
she's learning R.
#
Warren Young wrote:
you might want to learn that this is a question of appropriate
libraries.  in r, read.csv and write.csv reside in the package utils. 
in python, you'd use numpy:

    from numpy import loadtxt, savetxt
    savetxt('scaled.csv', loadtxt('original.csv', delimiter=',')*10,
delimiter=',')

this makes 2 lines, together with importing the library.
but this is really *hardly* r-specific.  you can do that in many, many
languages, be assured.  just peek out.
that's a slogan, esp. when you think of how compact (but unreadable, and
thus error-prone) can code written in perl be.  often, more lines of
code make it easier to maintain, and thus avoid bugs.
+1