Skip to content

Extracting arithmetic mean for specific values from multiple .txt-files

16 messages · vimmster, jim holtman, Rui Barradas

#
Hello,

I'm coming straight to the point:

I have 65 .txt-Files named "XYZ_1.txt" to "XYZ_65.txt" (each number
represents a test subject).

I have to open them in Microsoft Excel to see the exact structure.

In each of those .txt-files there are reaction time values (in milliseconds)
from line 15, column H to line 166, column H for each test subject (and a
couple of other data in the other colums of course).

My problem is, that I only need the arithmetic mean for all of these
reaction times per test subject.

--> Again: I have 65 test subjects and according to Excel 152 reaction times 
for each test subject / in each .txt-file.

Is there an easy way to only extract the arithmetic mean for each test
subject in an Excel file column?

Thanks for your answers!

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809.html
Sent from the R help mailing list archive at Nabble.com.
#
Since you did not provide an example of the file, I will take a guess
at the content and show to to extract the values and take the mean of
all of them since you did not say if you want the mean of each file,
or a single means.

myData <- do.call(c, lapply(1:65, function(.file){
	x <- read.csv(paste0("XYZ_", .file, ".txt"))
	x[15:166, 'colH']
})))

mean(myData)
On Sun, Jul 8, 2012 at 7:01 PM, vimmster <superdodge at gmx.net> wrote:

  
    
#
Dear Mr. Holtman,

thank you for your reply.

I think I did say which mean I needed: "all of these reaction times per test
subject. ", which means that I need a file with the mean of reaction times
of each file / of each test subject (because file XYZ_34.txt is identical
with subject 34's data).

There are 65 x 152 reaction times and I need 65 x mean(152 reaction times
per test subject file) = 65 mean reaction times.

I have now provided an example for a test subject 34:

http://r.789695.n4.nabble.com/file/n4635834/XYZ_34.txt XYZ_34.txt 

Kind regards

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4635834.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello,

Your data example has dots in the column of interess. If those values 
are ntegers, this might do it.


fun <- function(x){
	dat <- read.table(x, skip=14)
	H <- as.numeric(gsub("\\.", "", dat[, 8]))
	mean(H)
}

sapply(list.files(pattern="XYZ.*\\.txt"), fun)

Now do what you want with the result, for instance, write.table().

Hope this helps.

Rui Barradas

Em 09-07-2012 12:20, vimmster escreveu:
#
Dear Mr. Barradas,

your solution comes very close to what I want.

But I have two questions left:


First question: If "R" computes the mean for the reaction times of test
subject 34 (the example I provided above), it says "310112.0", but if I use
the "mean"-function in Excel it says "345.210". Apart from the dots in the
column of interest (which you mentioned before), the mean is obviously not
the same. Do you have any idea why?

Second question: Why are the dots in the column of interest problematic?

Kind regards

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4635854.html
Sent from the R help mailing list archive at Nabble.com.
#
I think the real problem is the first data line:

2	1	1	3	27	0	6	1.200.995

Notice the two periods in the value.  The previous solution was
getting rid of all the periods.  If you leave out this value, you get
339.5.  if you change it to 1200.995, you get 345.21, so you data is
incorrect.
On Mon, Jul 9, 2012 at 9:54 AM, vimmster <superdodge at gmx.net> wrote:

  
    
#
Hello,

There must be a difference in the file you are processing and in the one 
excel and I are:


 > fun <- function(x){
+ dat <- read.table(x, skip=14)
+ dat[ , 8] <- as.numeric(gsub("\\.", "", dat[, 8]))
+ mean(dat[, 8])
+ }
 >
 > sapply(list.files(pattern="XYZ.*\\.txt"), fun)
XYZ_34.txt
   345210.4

This result is even better, more accurate than excel's.

AS for the second question, because with the dots, those values are read 
by R as character and when put into the data.frame converted to factors, 
the name R gives to categorical variables. You can see this with the 
instruction, right after the read.table,

print(str(dat))
 > str(dat)
'data.frame':   151 obs. of  8 variables:
  $ V1: int  2 2 2 2 2 2 2 2 2 2 ...
  $ V2: int  1 2 3 4 5 6 7 8 9 10 ...
  $ V3: int  1 2 3 4 5 6 7 8 9 10 ...
  $ V4: int  3 2 4 3 3 1 3 1 3 2 ...
  $ V5: int  27 16 16 27 27 27 27 27 27 16 ...
  $ V6: int  0 16 16 16 27 27 27 27 16 16 ...
  $ V7: int  6 1 1 2 1 1 1 1 2 1 ...
  $ V8: Factor w/ 151 levels "1.200.995","247.102",..: 1 139 135 39 133 
73 142 63 77 67 ...

It's V8 the column we want. The real values are 1 139 135 39 etc. The 
levels are categories labels, the categories themselves are the 1-based 
integer values.

Anyway, what's important is that the code is working, and if there's an 
error maybe it can be solved with this modification:


fun <- function(x, skip = 14){
	dat <- read.table(x, skip=skip)

And the rest is the same. Inspect the file and see if the data starts at 
line 15.

(And please, Rui is enough, NO 'Mr.')

Hope this helps,

Rui Barradas

Em 09-07-2012 14:54, vimmster escreveu:
#
Dear Mr. Holtman,

but I cannot leave out the value and cannot change the values to 1200.995
manually (for each test subject with a reaction time > 1000 ms), because the
first your lead to incomplete data and the latter would be too
time-consuming.

Dear Rui,

here I have three files, which have exactly the same content as
"XYZ_34.txt", EXCEPT that the file "XYZ_50.txt" doesn't have a period in the
first value 1200.9952 IF YOU OPEN IT WITH THE EDITOR (!), maybe because I
didn't change the structure with MS Excel. The other two files should be
identical.

http://r.789695.n4.nabble.com/file/n4635962/XYZ_2.txt XYZ_2.txt 
http://r.789695.n4.nabble.com/file/n4635962/XYZ_50.txt XYZ_50.txt 
http://r.789695.n4.nabble.com/file/n4635962/XYZ_1112.txt XYZ_1112.txt 

R gives me the following output:
+     dat <- read.table(x, skip=14)
+     dat[ , 8] <- as.numeric(gsub("\\.", "", dat[, 8]))
+     mean(dat[, 8])
+ }
XYZ_1112.txt    XYZ_2.txt   XYZ_50.txt 
    345210.4     345210.4     310112.0

Your second suggestion leads to the same output:
+     dat <- read.table(x, skip=skip) 
+     dat[ , 8] <- as.numeric(gsub("\\.", "", dat[, 8]))
+     mean(dat[, 8])
+ }
XYZ_1112.txt    XYZ_2.txt   XYZ_50.txt 
    345210.4     345210.4     310112.0

Thank you for your replies!

Kind regards

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4635962.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello,

Ok, I think that there were two problems.
One, gsub substitutes all (g - global) occurrences of the search 
pattern, so both periods were removed.
The other, it would allways consider column 8 as character, but when 
there are no values with two periods it's read in with class numeric.
Both are now corrected.



fun <- function(x, skip = 14){
     dat <- read.table(x, skip=skip, stringsAsFactors = FALSE)
     if(is.character(dat[, 8])){
         len <- sapply(strsplit(dat[, 8], "\\."), length)
         dat[len == 3 , 8] <- sub("\\.", "", dat[len == 3 , 8])
         dat[, 8] <- as.numeric(dat[, 8])
     }
     mean(dat[, 8])
}

sapply(list.files(pattern="XYZ.*\\.txt"), fun)


Rui Barradas

Em 10-07-2012 09:35, vimmster escreveu:
#
Dear Rui,

thank you very much.

Your solution works perfectly.

One last question:

I need to write a function, with ONE value (here: a ratio) for the correct
reactions divided per trials or trialCount, respectively, FOR EACH test
subject.

"/" means "divided by" in the following.

I need the ratio correct (reactions)/trial or correct
(reactions)/trialCount, respectively (because trial and trialCount are the
same WITHIN test SUBJECTS; BUT they differ in length between BETWEEN test
SUBJECTS!).

It would be very helpful, if I had a data frame in the end in R, with one
column for
"trialCount"/"trial", one column for "correct reactions"(= 1) AND (more
importantly) one column for "correct (= 1) answers / trialCount".

legend (just as additional information) for the variable "correct":
1 = correct reaction
2 = false reaction
3 = reaction too slow
4 = reaction too fast
5 = more than one button pressed
6 = no reaction within RT window

I would be very thankful for an answer!

Sorry for the questions, but I am doing this for the first time!

Kind regards

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4636020.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello,

I'm glad it help.

As for this second question, you should explain yourself better.

1. What is a test subject, which column records its id? vpNum?
2. You say "divided per trials or trialCount". Does this mean per trial 
number (example: divide by 1, by 2, by 3, etc, by 149) or per number of 
trials (149 in the previous example)
3. 'correct' now seems to be categorical. Divide WHAT by trial or 
trialCount?

Hint: post a small data example with three or four subjects and the 
wanted output.

Rui Barradas

Em 10-07-2012 18:06, vimmster escreveu:
#
Dear Rui,

1) With test subject I mean each file (I have posted three similar files
above (2, 50 and 1112), but each test subject has one exact file (which
differs of course! --> 2, 50 an 1112 are the same file but I renamed it for
the problem described ans solved above). In this file the vpNum is always
the same (for each test subject of course; example: for test subject 44 it
is always vpNum = 44). The examples above (2, 50 and 1112) are in fact all
the second test subject's file (vpNum always "2").

2) With "trials" or "trialCount" I mean the number of trials (149 in your
example, and 151 in the examples 2, 50 and 1112). But the number of trials
differs between subjects, because in the examples below (vpNum = 3, 43 and
63) it is 152 (for vpNum = 3), 150 (for vpNum = 43) and 157 (for vpNum =
63).

http://r.789695.n4.nabble.com/file/n4636074/Fluencyflanker_3.txt
Fluencyflanker_3.txt 
http://r.789695.n4.nabble.com/file/n4636074/Fluencyflanker_43.txt
Fluencyflanker_43.txt 
http://r.789695.n4.nabble.com/file/n4636074/Fluencyflanker_63.txt
Fluencyflanker_63.txt 

3) I mean the number of correct answers given per test subject (for example
for test subject 3 in the previous example (5 rows above) we have 152 trials
and 4 trials that are not correct, which means 148 correct trials (with the
categorical value "1")). So in R this would be the ratio of:
[1] 0.9736842

The wanted output should (if possible) look like this (here only for vpNum =
3 !!!):
vpNum	trial OR trialCount	correct (reactions)	ratio (which means: number
correct / trialCount)
3	        152			                148			                 0.9736842

The final output should look like this:
vpNum	trial OR trialCount	correct (reactions)	ratio (which means: number
correct / trialCount)
1	         n			                 n			                 x
2	         n			                 n			                 x
3	         152			         148			         0.9736842
and so on until vpNum = 65 (In my question before I forgot to ask for the
column "vpNum", sorry about that!).

I hope this makes it more clear!

Thanks for your time and help!

Kind regards

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4636074.html
Sent from the R help mailing list archive at Nabble.com.
#
Dear Rui,

1) With test subject I mean each file (I have posted three similar files
above (2, 50 and 1112), but each test subject has one exact file (which
differs of course! --> 2, 50 an 1112 are the same file but I renamed it for
the problem described ans solved above). In this file the vpNum is always
the same (for each test subject of course; example: for test subject 44 it
is always vpNum = 44). The examples above (2, 50 and 1112) are in fact all
the second test subject's file (vpNum always "2").

2) With "trials" or "trialCount" I mean the number of trials (149 in your
example, and 151 in the examples 2, 50 and 1112). But the number of trials
differs between subjects, because in the examples below (vpNum = 3, 43 and
63) it is 152 (for vpNum = 3), 150 (for vpNum = 43) and 157 (for vpNum =
63).

http://r.789695.n4.nabble.com/file/n4636106/XYZ_3.txt XYZ_3.txt 
http://r.789695.n4.nabble.com/file/n4636106/XYZ_43.txt XYZ_43.txt 
http://r.789695.n4.nabble.com/file/n4636106/XYZ_63.txt XYZ_63.txt 

3) I mean the number of correct answers given per test subject (for example
for test subject 3 in the previous example (5 rows above) we have 152 trials
and 4 trials that are not correct, which means 148 correct trials (with the
categorical value "1")). So in R this would be the ratio of:
[1] 0.9736842

The wanted output should (if possible) look like this (here only for vpNum =
3 !!!):
vpNum trial OR trialCount correct (reactions) ratio (which means: number
correct / trialCount)
3              152                              148                             
0.9736842

The final output should look like this:
vpNum trial OR trialCount correct (reactions) ratio (which means: number
correct / trialCount)
1              n                                  n                                  
x
2              n                                  n                                  
x
3              152                              148                              
0.9736842
and so on until vpNum = 65 (In my question before I forgot to ask for the
column "vpNum", sorry about that!).

I hope this makes it more clear!

Thanks for your time and help!

Kind regards 

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4636106.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello,

Try


make.row <- function(x, skip = 14, column){
     dat <- read.table(x, skip = skip - 1, header = TRUE, 
stringsAsFactors = FALSE)
     vpNum <- dat$vpNum[1]
     trial <- length(dat[[ column ]])
     correct <- sum(dat$correct == 1)
     result <- c(vpNum, trial, correct, correct/trial)
     names(result) <- c("vpNum", column, "correct", "ratio")
     result
}


files <- list.files(pattern = "^XYZ_.*.txt")
ratios <- t(sapply(files, make.row, column = "trial"))
ratios <- data.frame(ratios, row.names = seq_len(nrow(ratios)))
ratios


I think it's what you want.

Rui Barradas

Em 11-07-2012 06:18, vimmster escreveu:
#
Dear Rui,

thank you VERY much.

You helped me a lot!

I've just added the following:

rsort <- ratios[order(ratios$vpNum),]

Now the test subjects are arranged according to their vpNum.

Thanks a lot again!

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4636143.html
Sent from the R help mailing list archive at Nabble.com.
#
Dear Mr. Holtman and especially dear Rui,

thank you VERY much.

You helped me a lot!

I've just added the following:

rsort <- ratios[order(ratios$vpNum),]

Now the test subjects are arranged according to their vpNum.

Thanks a lot again! 

--
View this message in context: http://r.789695.n4.nabble.com/Extracting-arithmetic-mean-for-specific-values-from-multiple-txt-files-tp4635809p4636265.html
Sent from the R help mailing list archive at Nabble.com.