An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090506/60be4b72/attachment-0001.pl>
Do you use R for data manipulation?
18 messages · Farrel Buchinsky, Wensui Liu, milton ruser +14 more
take a look at sqldf package(http://code.google.com/p/sqldf/), you will be amazed.
On Wed, May 6, 2009 at 12:22 AM, Farrel Buchinsky <fjbuch at gmail.com> wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group. Her assessment is that R is useful only when it comes to data analysis and working with statistical models. So what do you think: 1)R is a phenomenally powerful and flexible tool and since you are going to do analyses in R you might as well use it to read data in and merge it and reshape it to whatever you need. OR 2) Are you crazy? Nobody in their right mind uses R to pipe the data around their lab and assemble it for analysis. Your insights would be appreciated. Details if you are interested: Our setup: Hundreds of patients recorded as cases with about 60 variables. Inputted and stored in a Sybase relational database. High throughput SNP genotyping platforms saved data output to csv or excel tables. Previously, not knowing any SQL I had used Microsoft Access to write queries to get the data that I needed and to merge the genotyping with the clinical database. It was horrible. I could not even use it on anything other than my desktop machine at work. When I realized that I was going to need to learn R to handle the genetic analyses I decided to keep Sybase as the data repository for the clinical information and the do all the data manipulation, merging and piping with R using RODBC. I was and am a very amateur coder. Nevertheless, many many hours later I have scripts that did what I needed them to do and I understand R code and can tinker with it as needed. My scripts work for me but they are not exactly user-friendly for others in the laboratory to just run. For instance, depending on what machine the script is being run from, one may need to change the file name or file path and tinker under the hood to accomplish that. My bias is to fulfill all our data manipulation and reshaping with R. Since I am the principal investigator it is me who stays constant and coders or analysts who may come and go. I am even more enamored with R for data manipulation since reading a book about it. ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
============================== WenSui Liu Acquisition Risk, Chase Blog : statcompute.spaces.live.com Tough Times Never Last. But Tough People Do. - Robert Schuller ==============================
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090506/e9f438c8/attachment-0001.pl>
well, I am less proficient in R comparing with other tools/languages. Therefore my biased opinion is - it is possible in R, but it may be easier if you use other tools, especially if you have to build a user-friendly GUI. The most accessible (although limited to MS Windows only) method would be building GUI with HTA (HTML Application)/javasript which is nearly the same as creating web page and calling R from there when necessary. Less limited, but steeper learning curve - Python, Perl, Tcl/Tk - all open source tools that can communicate with R and all have decent GUI building tools. Then proprietary Adobe Flex, Flash, Air (the later somehow resembles HTA) or Runtime Revolution (RR) all allow to easily build crossplatform eye-candies, but these are not free although not too expensive either if you can allocate some resources for your project. I usually hide all the command line utilities beyond GUIs built with RR. All the tools listed above can easily do any kind of data manipulation and reshaping, but each have its strong sides: Python - tidy object oriented syntax, tons of 3rd party modules, Perl - powerful regular expressions tons of modules, RR - database connectivity, chunk expressions (item, char, word, line, etc...) and syntax that makes data manipulation much much easier. But I may be wrong, so please let me here ask another related question (new thread?..) for the group - what do you use to build graphical user interfaces for end-users of your tools in R? All the best Viktoras
Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group. Her assessment is that R is useful only when it comes to data analysis and working with statistical models. So what do you think: 1)R is a phenomenally powerful and flexible tool and since you are going to do analyses in R you might as well use it to read data in and merge it and reshape it to whatever you need. OR 2) Are you crazy? Nobody in their right mind uses R to pipe the data around their lab and assemble it for analysis. Your insights would be appreciated. Details if you are interested: Our setup: Hundreds of patients recorded as cases with about 60 variables. Inputted and stored in a Sybase relational database. High throughput SNP genotyping platforms saved data output to csv or excel tables. Previously, not knowing any SQL I had used Microsoft Access to write queries to get the data that I needed and to merge the genotyping with the clinical database. It was horrible. I could not even use it on anything other than my desktop machine at work. When I realized that I was going to need to learn R to handle the genetic analyses I decided to keep Sybase as the data repository for the clinical information and the do all the data manipulation, merging and piping with R using RODBC. I was and am a very amateur coder. Nevertheless, many many hours later I have scripts that did what I needed them to do and I understand R code and can tinker with it as needed. My scripts work for me but they are not exactly user-friendly for others in the laboratory to just run. For instance, depending on what machine the script is being run from, one may need to change the file name or file path and tinker under the hood to accomplish that. My bias is to fulfill all our data manipulation and reshaping with R. Since I am the principal investigator it is me who stays constant and coders or analysts who may come and go. I am even more enamored with R for data manipulation since reading a book about it. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Sorry for reply to the wrong person, I lost the original email.
Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group. Her assessment is that R is useful only when it comes to data analysis and working with statistical models.
I personally started to use R because I got tired of manually writing scripts for data manipulation and processing. The argument of your new recruit smells of ignorance and resistance to learning something new. Ask her _how_ did she assess R, how much time she spent on her assessment and whether did she actually try to run it and perform some concrete simple tasks. (Yes, R is somewhat "different", it has a steep learning curve, but the effort of learning it is worth it. And yes, R can be used in the same way as any other scripting language, i.e., it is not restricted to interactive work.) Take a look at plyr and reshape packages (http://had.co.nz/), I have a hunch that they would have saved me a lot of headache had I found out about them earlier :) I would also recommend investing in Phil Spector's book "Data manipulation with R", it will get you started much faster. I also find R's image files very convenient for sharing data (and code!) in a very compact format (single file, portable across architectures). When you quit your R session, all the variables and functions get saved in the image file, which you can take with you (or send to somebody else), start R again, load the image into a new session and continue from where you left. You won't get this kind of automatic persistence in any scripting language out of the box.
So what do you think: 1)R is a phenomenally powerful and flexible tool and since you are going to do analyses in R you might as well use it to read data in and merge it and reshape it to whatever you need. OR 2) Are you crazy? Nobody in their right mind uses R to pipe the data around their lab and assemble it for analysis.
I'd go with 1). R has also interfaces towards databases through RODBC, so you do not have to go through several conversions when you're about to process or plot data in R.
Le mercredi 06 mai 2009 ? 00:22 -0400, Farrel Buchinsky a ?crit :
Is R an appropriate tool for data manipulation and data reshaping and data organizing?
[ Large Snip ! ... ] Depends on what you have to do. I've done what can be more or less termed "data management" with almost uncountable tools (from Excel (sigh...) to R with SQL, APL, Pascal, C, Basic (in 1982 !), Fortran and even Lisp in passing...). SQL has strong points : join is, to my tastes, more easily expressed in SQL than in most languages, projection and aggregation are natural. However, in SQL, there is no "natural" ordering of row tables, which makes expressing algorithms using this order difficult. Try for example to express the differences of a time series ... (it can be done, but it is *not* a pretty sight). On the other hand, R has some unique expressive possibilities (reshape() comes to mind). So I tend to use a combination of tools : except for very small samples, I tend to manage my data in SQL and with associated tools (think data editing, for example ; a simple form in OpenOffice's Base is quite easy to create, can handle anything for which an ODBC driver exists, and won't crap out for more than a few hundreds line...). finer manipulation is usually done in R with native tools and sqldf. But, at least in my trade, the ability to handle Excel files is a must (this is considered as a standard for data entry. Sigh ...). So the first task is usually a) import data in an SQL database, and b) prepare some routines to dump SQL tables / R dataframes in Excel tor returning back to the original data author... HTH Emmanuel Charpentier
On Wed, May 06, 2009 at 12:22:45AM -0400, Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group.
I happily use both approaches depending on the original format the data come in: For data that are not in a "well behaved" format and require actual parsing, I tend to use Python scripts for transmogrifying the data into nice and tidy tables (and maybe some very basic filtering). For everything after that I prefer R. I also use Python if the relevant data needs to be harvested and assembled from many differnt sources (e.g. data files + web + databases). Once the data files are easy to read (csv, tab separated, database, ...) and the task is to reshape, filter and clean the data, I usually do it in R. R has true advantages here: - After reading a table into a data frame I can immediatly tell, if all measurements are what they are supposed to be (integer, numeric, factor, boolean) and functions like read.table even do quite some error checking for me (equal number of columns etc.) - Finding out if factors have the right (or plausible) number of levels is easy - Filtering by logical indexing - Powerful and reliable reshaping (reshape package) - Very conveniant diagnostics: str(), dim(), table(), summary(), plotting the data in various ways, ... cu Philipp
Dr. Philipp Pagel Lehrstuhl f?r Genomorientierte Bioinformatik Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan 85350 Freising, Germany http://mips.gsf.de/staff/pagel
I also use the approach Philipp describes below. I use Python and shell scripts for processing thousands of input files and getting all the data into one tidy csv table. From that point onwards it's R all the way (often with the reshape package). Paul
Philipp Pagel wrote:
On Wed, May 06, 2009 at 12:22:45AM -0400, Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group.
I happily use both approaches depending on the original format the data come in: For data that are not in a "well behaved" format and require actual parsing, I tend to use Python scripts for transmogrifying the data into nice and tidy tables (and maybe some very basic filtering). For everything after that I prefer R. I also use Python if the relevant data needs to be harvested and assembled from many differnt sources (e.g. data files + web + databases). Once the data files are easy to read (csv, tab separated, database, ...) and the task is to reshape, filter and clean the data, I usually do it in R. R has true advantages here: - After reading a table into a data frame I can immediatly tell, if all measurements are what they are supposed to be (integer, numeric, factor, boolean) and functions like read.table even do quite some error checking for me (equal number of columns etc.) - Finding out if factors have the right (or plausible) number of levels is easy - Filtering by logical indexing - Powerful and reliable reshaping (reshape package) - Very conveniant diagnostics: str(), dim(), table(), summary(), plotting the data in various ways, ... cu Philipp
I second what Zeljko wrote. In addition, see the data manipulation section in Chapter 4 of http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RS/sintro.pdf Frank
Zeljko Vrba wrote:
Sorry for reply to the wrong person, I lost the original email.
Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group. Her assessment is that R is useful only when it comes to data analysis and working with statistical models.
I personally started to use R because I got tired of manually writing scripts for data manipulation and processing. The argument of your new recruit smells of ignorance and resistance to learning something new. Ask her _how_ did she assess R, how much time she spent on her assessment and whether did she actually try to run it and perform some concrete simple tasks. (Yes, R is somewhat "different", it has a steep learning curve, but the effort of learning it is worth it. And yes, R can be used in the same way as any other scripting language, i.e., it is not restricted to interactive work.) Take a look at plyr and reshape packages (http://had.co.nz/), I have a hunch that they would have saved me a lot of headache had I found out about them earlier :) I would also recommend investing in Phil Spector's book "Data manipulation with R", it will get you started much faster. I also find R's image files very convenient for sharing data (and code!) in a very compact format (single file, portable across architectures). When you quit your R session, all the variables and functions get saved in the image file, which you can take with you (or send to somebody else), start R again, load the image into a new session and continue from where you left. You won't get this kind of automatic persistence in any scripting language out of the box.
So what do you think: 1)R is a phenomenally powerful and flexible tool and since you are going to do analyses in R you might as well use it to read data in and merge it and reshape it to whatever you need. OR 2) Are you crazy? Nobody in their right mind uses R to pipe the data around their lab and assemble it for analysis.
I'd go with 1). R has also interfaces towards databases through RODBC, so you do not have to go through several conversions when you're about to process or plot data in R.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
Take a look at plyr and reshape packages (http://had.co.nz/), I have a hunch that they would have saved me a lot of headache had I found out about them earlier :)
As the author of these two packages, I'm admittedly biased, but I think R is unparalleled for data preparation, manipulation, and cleaning (with the small caveat that your data needs to fit in memory). The R data frame is a fantastic abstraction that most other programming languages lack, and vectorised subscripting make it possible to express many transformations in an elegant and efficient manner. On top of the facilities provided by base R, there are a huge number of packages available to load data from just about every data format, as well as a number of packages (plyr, reshape, sqldf, doBy, gdata, scope, ...) for data manipulation - just pick the metaphor that is most natural to you. Hadley
In my opinion, no statisticians toolbox should contain only 1 tool (even if it is as amazing a tool as R). Learning the different tools helps you appreciate when each are the most appropriate to use and learn different ways of looking at problems. There are some tasks that I (it could easily differ for others) find quickest to do some data extraction using Perl, then load the results into R. Having said the above, I do admit that the percentage of time that I spend using tools other than R for working with data has gone down quite a bit with time. 3 possible reasons: 1. my clients are getting better at giving me the data in appropriate forms 2. my proficiency with R continues to grow and I can better see how to do something using R 3. R continues to grow with more and more tools to help manage data. And a possible 4th: 4. I am getting to lazy in my old age to switch to other programs. While I like to think that I am having success at educating my clients, number 1 only contributes very little to the overall, 3 is definitely a big contributor and hopefully 2 is part of the reason as well.
Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111 > -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Farrel Buchinsky > Sent: Tuesday, May 05, 2009 10:23 PM > To: R > Cc: Ross; gregory_warnes at urmc.rochester.edu; greg at warnes.net > Subject: [R] Do you use R for data manipulation? > > Is R an appropriate tool for data manipulation and data reshaping and > data > organizing? I think so but someone who recently joined our group thinks > not. > The new recruit believes that python or another language is a far > better > tool for developing data manipulation scripts that can be then used by > several members of our research group. Her assessment is that R is > useful > only when it comes to data analysis and working with statistical > models. > So what do you think: > 1)R is a phenomenally powerful and flexible tool and since you are > going to > do analyses in R you might as well use it to read data in and merge it > and > reshape it to whatever you need. > OR > 2) Are you crazy? Nobody in their right mind uses R to pipe the data > around > their lab and assemble it for analysis. > > Your insights would be appreciated. > > Details if you are interested: > > Our setup: Hundreds of patients recorded as cases with about 60 > variables. > Inputted and stored in a Sybase relational database. High throughput > SNP > genotyping platforms saved data output to csv or excel tables. > Previously, > not knowing any SQL I had used Microsoft Access to write queries to get > the > data that I needed and to merge the genotyping with the clinical > database. > It was horrible. I could not even use it on anything other than my > desktop > machine at work. When I realized that I was going to need to learn R to > handle the genetic analyses I decided to keep Sybase as the data > repository > for the clinical information and the do all the data manipulation, > merging > and piping with R using RODBC. I was and am a very amateur coder. > Nevertheless, many many hours later I have scripts that did what I > needed > them to do and I understand R code and can tinker with it as needed. My > scripts work for me but they are not exactly user-friendly for others > in the > laboratory to just run. For instance, depending on what machine the > script > is being run from, one may need to change the file name or file path > and > tinker under the hood to accomplish that. My bias is to fulfill all our > data > manipulation and reshaping with R. Since I am the principal > investigator it > is me who stays constant and coders or analysts who may come and go. > > I am even more enamored with R for data manipulation since reading a > book > about it. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Another tool I find useful is Matthew Dowle's data.table package. It has very fast indexing, can have much lower memory requirements than a data frame, and has some built-in data manipulation capability. Especially with a 64-bit OS, you can use this to keep things in memory where you otherwise would have to use a database. See here: http://article.gmane.org/gmane.comp.lang.r.packages/282 - Tom
1 day later
+1. I worked with Matthew for a while and saw in practice just how powerful that package is. I'm surprised it isn't more widely used. Martin
Tom Short wrote:
Another tool I find useful is Matthew Dowle's data.table package. It has very fast indexing, can have much lower memory requirements than a data frame, and has some built-in data manipulation capability. Especially with a 64-bit OS, you can use this to keep things in memory where you otherwise would have to use a database. See here: http://article.gmane.org/gmane.comp.lang.r.packages/282 - Tom
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
3 days later
2009/5/6 Emmanuel Charpentier <charpent at bacbuc.dyndns.org>:
Le mercredi 06 mai 2009 ? 00:22 -0400, Farrel Buchinsky a ?crit :
Is R an appropriate tool for data manipulation and data reshaping and data organizing?
[ Large Snip ! ... ] Depends on what you have to do. I've done what can be more or less termed "data management" with almost uncountable tools (from Excel (sigh...) to R with SQL, APL, Pascal, C, Basic (in 1982 !), Fortran and even Lisp in passing...).
SQL has strong points : join is, to my tastes, more easily expressed in SQL than in most languages, projection and aggregation are natural. However, in SQL, there is no "natural" ordering of row tables, which makes expressing algorithms using this order difficult. Try for example to express the differences of a time series ... (it can be done, but it is *not* a pretty sight). On the other hand, R has some unique expressive possibilities (reshape() comes to mind). So I tend to use a combination of tools : except for very small samples, I tend to manage my data in SQL and with associated tools (think data editing, for example ; a simple form in OpenOffice's Base is quite easy to create, can handle anything for which an ODBC driver exists, and won't crap out for more than a few hundreds line...). finer manipulation is usually done in R with ?native tools and sqldf.
But, at least in my trade, the ability to handle Excel files is a must (this is considered as a standard for data entry. Sigh ...). So the first task is usually a) import data in an SQL database, and b) prepare some routines to dump SQL tables / R dataframes in Excel tor returning back to the original data author...
I don't think Excel is a standard tool for data entry. Epidata entry is much more professional.
HTH ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Emmanuel Charpentier
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
HUANG Ronggui, Wincent PhD Candidate Dept of Public and Social Administration City University of Hong Kong Home page: http://asrr.r-forge.r-project.org/rghuang.html
I am not a statistician and not a computer scientist by education. I consider myself an R novice and came to R - thanks to my boss - from an SPSS background. I work for a market research company and the most typical data files we deal with are not huge - up to several thousand rows and up to a thousand variables. I would say, on certain projects, most of what we do in R (if you look at the number of lines in R we devote to a given task) is data manipulation. The actual statistical method is frequently just a line - all the rest is getting the data shaped right and then spitting out the results of the analysis in a way that is usable (i.e., presentable). I find R to be excellent in data manipulations that we perform. First of all, it's great that you can always grab variables/cases you need and ignore all the rest. In SPSS you just keep staring at all those data and variables that you don't need - trying to find the one you need. Second - I find R to be incredibly fast (as opposed to SPSS or Excel) with the amounts of data we are dealing with. And third - nothing is "written in stone" and your original data is always untouched - you can always read it in again and again. For example, if I create a new variable and make a mistake, I can always fix the code, rerun that piece of the code and that gives me the corrected object that containes that new variable. I never touch the original data and hence - never "spoil" it. Dimitri
On Mon, May 11, 2009 at 11:20 AM, ronggui <ronggui.huang at gmail.com> wrote:
2009/5/6 Emmanuel Charpentier <charpent at bacbuc.dyndns.org>:
Le mercredi 06 mai 2009 ? 00:22 -0400, Farrel Buchinsky a ?crit :
Is R an appropriate tool for data manipulation and data reshaping and data organizing?
[ Large Snip ! ... ] Depends on what you have to do. I've done what can be more or less termed "data management" with almost uncountable tools (from Excel (sigh...) to R with SQL, APL, Pascal, C, Basic (in 1982 !), Fortran and even Lisp in passing...).
SQL has strong points : join is, to my tastes, more easily expressed in SQL than in most languages, projection and aggregation are natural. However, in SQL, there is no "natural" ordering of row tables, which makes expressing algorithms using this order difficult. Try for example to express the differences of a time series ... (it can be done, but it is *not* a pretty sight). On the other hand, R has some unique expressive possibilities (reshape() comes to mind). So I tend to use a combination of tools : except for very small samples, I tend to manage my data in SQL and with associated tools (think data editing, for example ; a simple form in OpenOffice's Base is quite easy to create, can handle anything for which an ODBC driver exists, and won't crap out for more than a few hundreds line...). finer manipulation is usually done in R with ?native tools and sqldf.
But, at least in my trade, the ability to handle Excel files is a must (this is considered as a standard for data entry. Sigh ...). So the first task is usually a) import data in an SQL database, and b) prepare some routines to dump SQL tables / R dataframes in Excel tor returning back to the original data author...
I don't think Excel is ?a standard tool for data entry. Epidata entry is much more professional.
HTH ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Emmanuel Charpentier
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- HUANG Ronggui, Wincent PhD Candidate Dept of Public and Social Administration City University of Hong Kong Home page: http://asrr.r-forge.r-project.org/rghuang.html
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Le lundi 11 mai 2009 ? 23:20 +0800, ronggui a ?crit : [ Snip... ]
But, at least in my trade, the ability to handle Excel files is a must (this is considered as a standard for data entry. Sigh ...).
[ Re-snip... ]
I don't think Excel is a standard tool for data entry. Epidata entry is much more professional.
Irony squared ? This *must* go in the fortunes file ! Emmanuel Charpentier
1 day later
Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group. Her assessment is that R is useful only when it comes to data analysis and working with statistical models.
It's hard to shift people's individual preferences, but impressive
objective comparisons are easy to come by. Ask her how many lines it
would take to do this trivial R task in Python:
data <- read.csv('original-data.csv')
write.csv('scaled-data.csv', data * 10)
R's ability to do something to an entire data structure -- or a slice of
it, or some other subset -- in a single operation is very useful when
cleaning up data for presentation and analysis. Also point out how easy
it is to get data *out* of R, as above, not just into it, so you can
then hack on it in Python, if that's the better language for further
manipulation.
If she gives you static about how a few more lines are no big deal,
remind her that it's well established that bug count is always a simple
function of line count. This fact has been known since the 70's.
While making your points, remember that she has a good one, too: R is
not the only good language out there. You should learn Python while
she's learning R.
Warren Young wrote:
Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data organizing? I think so but someone who recently joined our group thinks not. The new recruit believes that python or another language is a far better tool for developing data manipulation scripts that can be then used by several members of our research group. Her assessment is that R is useful only when it comes to data analysis and working with statistical models.
It's hard to shift people's individual preferences, but impressive
objective comparisons are easy to come by. Ask her how many lines it
would take to do this trivial R task in Python:
data <- read.csv('original-data.csv')
write.csv('scaled-data.csv', data * 10)
you might want to learn that this is a question of appropriate
libraries. in r, read.csv and write.csv reside in the package utils.
in python, you'd use numpy:
from numpy import loadtxt, savetxt
savetxt('scaled.csv', loadtxt('original.csv', delimiter=',')*10,
delimiter=',')
this makes 2 lines, together with importing the library.
R's ability to do something to an entire data structure -- or a slice of it, or some other subset -- in a single operation is very useful when cleaning up data for presentation and analysis.
but this is really *hardly* r-specific. you can do that in many, many languages, be assured. just peek out.
Also point out how easy it is to get data *out* of R, as above, not just into it, so you can then hack on it in Python, if that's the better language for further manipulation. If she gives you static about how a few more lines are no big deal, remind her that it's well established that bug count is always a simple function of line count. This fact has been known since the 70's.
that's a slogan, esp. when you think of how compact (but unreadable, and thus error-prone) can code written in perl be. often, more lines of code make it easier to maintain, and thus avoid bugs.
While making your points, remember that she has a good one, too: R is not the only good language out there. You should learn Python while she's learning R.
+1