An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121217/a0689358/attachment.pl>
Why does matrix selection behave differently when using which?
5 messages · Asis Hallab, David Winsemius, Berend Hasselman
On 17-12-2012, at 20:22, Asis Hallab wrote:
Dear R community, I have a medium sized matrix stored in variable "t" and a simple function " countRows" (see below) to count the number of rows in which a selected column "C" matches a given value. If I count all rows matching all pairwise distinct values in the column "C" and sum these counts up, I get the number or rows of "t". If I delete the "which" calls from function "countRows" the resulting sum of matching row numbers is much greater than the number of rows in "t". The table "t" I use can be downloaded from here: https://github.com/groupschoof/PhyloFun/archive/test_selector.zip Unzip the file and read in the table "t" using t <- read.table("test.tbl") The above function "sumRows" is defined as follows: sumRows <- function( tbl, ps ) { sum( sapply(ps, function(x) { t <- if ( is.na(x) ) { tbl[ which( is.na(tbl[ , "Domain.Architecture.Distance" ]) ), , drop=F] } else { tbl[ which( tbl[ , "Domain.Architecture.Distance" ] == x ), , drop=F] } nrow(t) } ) ) }
And how are we supposed to call sumRows()? sumRows(???, ??? Berend
On Dec 17, 2012, at 11:22 AM, Asis Hallab wrote:
Dear R community, I have a medium sized matrix stored in variable "t" and a simple function " countRows" (see below) to count the number of rows in which a selected column "C" matches a given value. If I count all rows matching all pairwise distinct values in the column "C" and sum these counts up, I get the number or rows of "t". If I delete the "which" calls from function "countRows" the resulting sum of matching row numbers is much greater than the number of rows in "t". The table "t" I use can be downloaded from here: https://github.com/groupschoof/PhyloFun/archive/test_selector.zip
What part of "minimal" example are you having difficulty understanding? That zip file expands to a 1.8 MB file!
Unzip the file and read in the table "t" using t <- read.table("test.tbl")
Since it has a header line, you will be creating all factors and it's doubtful you are getting what you want.
Instead:
t <- read.table("test.tbl", header=TRUE)
The above function "sumRows" is defined as follows:
sumRows <- function( tbl, ps ) {
sum(
sapply(ps,
'ps'? What is ps????
function(x) {
t <- if ( is.na(x) ) {
I suspect that it is not `which` that is the problem, but rahter your understanding of how `if` processes vectors. (This also should be simplified greatly to avoid stepping through vectors one element at a time.)
tbl[ which( is.na(tbl[ , "Domain.Architecture.Distance" ]) ), , drop=F]
You didn't do anything with that result!
} else {
tbl[ which( tbl[ , "Domain.Architecture.Distance" ] == x ), ,
drop=F]
}
nrow(t)
That value will not depend in any manner on what preceded it. ???? It will simply be the number of rows in the local copy of "t" You goal is _only_ to get a count? Why not just this: sum( tbl[!is.na(tbl$Domain.Architecture.Distance), "Domain.Architecture.Distance" ] == x ) E.g.:
sum( tbl[!is.na(tbl$Domain.Architecture.Distance), "Domain.Architecture.Distance" ] == 0.99)
[1] 3440 You should probably be creating a factor variable with `cut` to create reasonable intervals for grouping, and if you do not know this it suggests you need to do more stufy of the text or introductory materials.To get a quick look at the distribution this is useful" plot( density(tbl[!is.na(tbl$Domain.Architecture.Distance), "Domain.Architecture.Distance" ] )) (125 KB file so not attached)
table( cut(tbl$Domain.Architecture.Distance, breaks=(0:10)/10) )
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
616 1864 328 103 923 1763 1151 2490 3709 38563
} ) ) } What does cause the different behavior of sumRows, when the which calls are deleted? What does which do, I seem not to grasp?
The question ... as yet unanswered .... is _how_ exactly are you calling that function. You posted a link to data "t" but there is no code that calls that function with the data. I do not see anything that would resemble a "ps"-object.
Or is there an error in my test.tbl?
(See above.)
* * Any help on this subject will be greatly appreciated. Kind regards and *merry christmas*! [[alternative HTML version deleted]]
Please read the Posting Guide and learn to post in plain text.
David Winsemius Alameda, CA, USA
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121217/2656c2b2/attachment.pl>
On 17-12-2012, at 21:03, Asis Hallab wrote:
Dear R experts, please accept my apologies for the missing information. You need to call sumRows in the following manner: sumRows(t, sort( unique( t[,"Domain.Architecture.Distance"] ) ) ) Thank you Berend and David for pointing out my mistake.
Use this alternative sumRows
sumRows.1 <- function( tbl, ps ) {
sum(
sapply(ps,
function(x) {
t <- if ( is.na(x) ) {
tbl[ is.na(tbl[ , "Domain.Architecture.Distance" ] ), ,drop=F]
} else {
# explicit check for NA
tbl[ !is.na(tbl[ , "Domain.Architecture.Distance" ]) & tbl[ , "Domain.Architecture.Distance" ] == x , ,drop=F]
}
nrow(t)
}
)
)
}
z <- sort( unique( t[,"Domain.Architecture.Distance"] ) )
sumRows(t,z)
sumRows.1(t,z)
You must check with is.na() when not using which.
More insight can be gained by reading the help for Logical operators.
Try
?'!'
and read the bit about NA.
I'm too lazy to check if the modifcation with !is.na completely accounts for the difference between the which and the not which versions.
And please don't use t as an object name.
Berend