Scripting in R -- pattern matching, logic, system calls, the works!
Don,
Excellent advice. I've gone back and done a bit of coding and wanted to see
what you think and possibly "shore up" some of the technical stuff I am
still having a bit of difficulty with.
I'll past the code I have to date with any important annotations:
topdir="~"
library(gmodels)
setwd(topdir)
### Will probably want to do two for loops as opposed to recursive
files=list.files(path=topdir,pattern="Coverage")
for (i in files)
{
dir=paste("~/hangers/",i,sep="")
files2=list.files(path=dir,pattern="Length")
### Make an empty matrix that will have the independent variable as
the filenum and the dependent variable
### as the mean of the length or should I have two vectors for the
regression. Basically the Length_(\d+) is the independent variable (which
is taken from the filename) which all the regressions will have and then
inside the Length_(\d+) is a 1d set of numbers which I take the mean of
which in turn becomes the dependent variable. So in essence the points are:
f(length)=mean(length$V1)
f(45)=50
f(50)=60
etc ...
for (j in files2)
{
## I just rearranged the following line but I'm not sure what the
command is doing
## I am assuming 'as.numeric' means take the input as a number
instead of a string and the gsub has #me stumped
filenum=as.numeric(gsub('Length_','',j))
## Can I assign variables at the top instead of hardcoding? like
upper=50 , lower=30?
## And I don't need to put brackets for this if statement do I?
Does it basically just
## say that if the filenum is outside those parameters, just go to
the next j in files2?
if (filenum > 200 | filenum < -10) next
dir2=paste("~/hangers",i,j,sep="/")
tmp=read.table(dir2)
mean(tmp($V1))
Now should I put these in a matrix or a vector (all j values (length
vs mean(tmp$V1) for each i iteration)
}
}
I think lastly, Id like to get a print out of each of the regressions (each
iteration of i). Is that when I use the summary command? And, like in
unix, can I redirect the output to a file?
Best
Don MacQueen wrote:
I can't go through all the details, but hopefully this will help get
you started.
If you look at the help page for the list.files() function, you will see
this:
list.files(path = ".", pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE)
The "." in path means to start at your current working directory.
Assuming your 5 Coverage directories are subdirectories of your
current working directory, that's what you want.
Then, setting recursive to TRUE will cause it to also list the
contents of all subdirectories. Since your Length files are in the
Coverage subdirectories, that's what you want.
Finally, the pattern argument returns only files that match the
pattern, so something like
patter="Length"
should get you just the files you want.
The result is a character vector containing the names of all your
Length files. Try it and see.
Then, a simple loop over the over the vector of filenames, with an
appropriate scan() or read.table() command for each, will read the
data in.
If you need to restrict the files, say Length_20, Length_25,
Length_30, etc. then you'll have to do some more work.
Look at
as.numeric(gsub( 'Length_', '', filename))
to get just the number part of the filename, as a number, and then
you can use numeric inequalities to identify whether or not any
particular file is to be processed.
Since you haven't shown what the contents of your files look like
(two columns of numbers or what), I have no idea what to suggest for
the part having to do with reading them in, plotting or doing linear
regression.
The basic function for linear regression is lm().
Here is a summary:
files <- list.files( '~' , pattern='Length', recursive=TRUE)
for (fl in files) {
## optional, to restrict to only certain files
filenum <- as.numeric(gsub( 'Length_', '', filename))
## skip to next file if it isn't in the correct number range
if (filenum > 50 | filenum < 20) next
## a command to read the current file. perhaps:
## tmp <- read.table(fl)
## commands to do statistics on the data in the current file. perhaps:
## fit <- lm( y ~ y, data=tmp)
## some output
cat('------ file =',fl,'-----\n')
print(fit)
}
This example doesn't restrict only to certain Coverage subdirectories.
-Don
At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:
Im very new to R so this might be a very simple question. First I'll lay
out
the hierarchy of my directories, goals. I have say 5 directories of form "Coverage_(some number)" and each one of these I have text files of form "Length_(some number)" which are comprised of say 30 numbers. Each one of these Length files (which are basically incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged where the average is the y-value and the length is the x-value in a linear regression. What I want to do is, write a script that looks in each of the coverage directories and then reads in each of the files, takes the means, and
plots
them in form I specified above. The catch is, what if I only want to plot say Length_(20-50) and what command/method is best for a linear
regression?
I've looked at m1(), but have not gotten it to work properly. Below is some of the code I've put together: topdir="~" setwd(topdir) ### Took this function from a friend so I'm not sure what its doing
besides
grep-ing a directory?
ll<-function(string)
{
grep(string,dir(),value=T)
}
### I believe this is looking for all files of form below
subdir = ll("Coverage_[1-9][0-9]$")
### A for loop iterating through each of the sub directories.
for (i in subdir)
{
#not sure what this line is doing as I found it on the internet
on a
similar function
setwd(paste(getwd(),i,sep="/"))
#This makes a vector of all the file names
filelist=ll("Length_")
Can I use a regex or logic to only take the filelist variables I want?
And can I now get the mean of each Length_* and set in a matrix (length x
mean)?
Then finally, how to do a linear regression of this.
--
View this message in context: http:// www.
nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https:// stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http:// www.
R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19512508.html Sent from the R help mailing list archive at Nabble.com.