Entering data into a multi-way array? - R-help

Sat, Oct 1, 2011 1:38 PM #

I am trying to replicate the script, appended below.  My data is in OOCalc
files.  The script (below) synthesizes a dataset (it serves as a
"tutorial"), but I will need to get my data from OOCalc into R for use in
that script (which uses arrays).

I've worked my way through the script, and understand how most of it works
(except the first bit - Step 1 - which is irrelevant to me, anyway).

[begin script]

### Supplementary material with the paper
### Interpretation of ANOVA models for microarray data using PCA
### J.R. de Haan et al. Bioinformatics (2006)
### Please cite this paper when you use this code in a publication.
### Written by J.R. de Haan, December 18, 2006

### Step1: a synthetic dataset of 500 genes is generated with 5 classes
###	1 unresponsive genes (300 genes)
###	2 constant genes (50 genes)
###	3 profile 1 (50 genes)
###	4 profile 2 (50 genes)
###	5 profile 3 (50 genes)

#generate synthetic dataset with similar dimensions:
# 500 genes, 3 replicates, 10 timepoints, 4 treatments
X <- array(0, c(500, 3, 10, 4))
labs.synth <- c(rep(1, 300), rep(2, 50), rep(3, 50), rep(4, 50), rep(5, 50))
gnames <- cbind(labs.synth, labs.synth)
#print(dim(gnames))
gnames[1:300,2] <- "A"
gnames[301:350,2] <- "B"
gnames[351:400,2] <- "C"
gnames[401:450,2] <- "D"
gnames[451:500,2] <- "E"

### generate 300 "noise" genes with expressions slightly larger than
### the detection limit (class 1)
X[labs.synth==1,1,,] <- rnorm(length(X[labs.synth==1,1,,]), mean=50, sd=40)
X[labs.synth==1,2,,] <- X[labs.synth==1,1,,] +
  rnorm(length(X[labs.synth==1,1,,]), mean=0, sd=10)
X[labs.synth==1,3,,] <- X[labs.synth==1,1,,] +
  rnorm(length(X[labs.synth==1,1,,]), mean=0, sd=10)

# generate 50 stable genes at two levels (class 2)
X[301:325,1,,] <- rnorm(length(X[301:325,1,,]), mean=1500, sd=40)
X[301:325,2,,] <- X[301:325,1,,] + rnorm(length(X[301:325,1,,]), mean=0,
sd=10)
X[301:325,3,,] <- X[301:325,1,,] + rnorm(length(X[301:325,1,,]), mean=0,
sd=10)

X[326:350,1,,] <- rnorm(length(X[326:350,1,,]), mean=3000, sd=40)
X[326:350,2,,] <- X[326:350,1,,] + rnorm(length(X[326:350,1,,]), mean=0,
sd=10)
X[326:350,3,,] <- X[326:350,1,,] + rnorm(length(X[326:350,1,,]), mean=0,
sd=10)

# generate50 genes with profile 1 (class 3)
increase.range <- matrix(rep(1:50, 10), ncol=10, byrow=FALSE)
profA3 <- matrix(rep(c(10, 60, 110, 150, 150, 150, 150, 150, 150, 150) ,
50),
                 ncol=10, byrow=TRUE) * increase.range
X[351:400,1,,1] <- profA3 + rnorm(length(profA3), mean=0, sd=40)
profB3 <- matrix(rep(c(10, 100, 220, 280, 280, 280, 280, 280, 280, 280),
50),
                 ncol=10, byrow=TRUE) * increase.range
X[351:400,1,1:10,2] <- profB3 + rnorm(length(profA3), mean=0, sd=40)
profC3 <- matrix(rep(c(10, 120, 300, 300, 280, 280, 280, 280, 280, 280),
50),
                 ncol=10, byrow=TRUE) * increase.range
X[351:400,1,1:10,3] <- profC3 + rnorm(length(profA3), mean=0, sd=40)
profD3 <- matrix(rep(c(100, 75, 50, 50, 50, 50, 50, 50, 75, 100), 50),
ncol=10,
                 byrow=TRUE)
X[351:400,1,1:10,4] <- profD3 + rnorm(length(profA3), mean=0, sd=40)
#again replicates
X[351:400,2,,] <- X[351:400,1,,] + rnorm(length(X[351:400,2,,]), mean=0,
sd=10)
X[351:400,3,,] <- X[351:400,1,,] + rnorm(length(X[351:400,3,,]), mean=0,
sd=10)

# generate50 genes with profile 2 (class 4)
increase.range <- matrix(rep(1:50, 10), ncol=10, byrow=FALSE)
profA4 <- matrix(rep(c(10, 60, 110, 150, 125, 100, 75, 50, 50, 50) , 50),
                 ncol=10, byrow=TRUE) * increase.range
X[401:450,1,,1] <- profA4 + rnorm(length(profA4), mean=0, sd=40)
profB4 <- matrix(rep(c(10, 100, 220, 280, 200, 150, 100, 50, 50, 50), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[401:450,1,1:10,2] <- profB4 + rnorm(length(profA4), mean=0, sd=40)
profC4 <- matrix(rep(c(10, 150, 300, 220, 150, 100, 50, 50, 50, 50), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[401:450,1,1:10,3] <- profC4 + rnorm(length(profA4), mean=0, sd=40)
profD4 <- matrix(rep(c(150, 100, 50, 50, 75, 75, 75, 100, 100, 100), 50),
                 ncol=10, byrow=TRUE)
X[401:450,1,1:10,4] <- profD4 + rnorm(length(profA4), mean=0, sd=40)
#again replicates
X[401:450,2,,] <- X[401:450,1,,] + rnorm(length(X[401:450,2,,]), mean=0,
sd=10)
X[401:450,3,,] <- X[401:450,1,,] + rnorm(length(X[401:450,3,,]), mean=0,
sd=10)

# generate50 genes with profile 3 (class 5)
increase.range <- matrix(rep(1:25, 20), ncol=10, byrow=FALSE)
profA4 <- matrix(rep((200 - c(10, 60, 110, 150, 125, 100, 75, 50, 50, 50)),
50),
                 ncol=10, byrow=TRUE) * increase.range
X[451:500,1,,1] <- profA4 + rnorm(length(profA4), mean=0, sd=40)
profB4 <- matrix(rep((200 - c(10, 100, 180, 200, 200, 150, 100, 50, 50,
50)), 50),
                 ncol=10, byrow=TRUE) * increase.range
X[451:500,1,1:10,2] <- profB4 + rnorm(length(profA4), mean=0, sd=40)
profC4 <- matrix(rep((200 - c(10, 150, 200, 180, 150, 100, 50, 50, 50, 50)),
50),
                 ncol=10, byrow=TRUE) * increase.range
X[451:500,1,1:10,3] <- profC4 + rnorm(length(profA4), mean=0, sd=40)
profD4 <- matrix(rep((200 - c(150, 100, 50, 50, 75, 75, 75, 100, 100, 100)),
50),
                 ncol=10, byrow=TRUE)
X[451:500,1,1:10,4] <- profD4 + rnorm(length(profA4), mean=0, sd=40)
#again replicates
X[451:500,2,,] <- X[451:500,1,,] + rnorm(length(X[451:500,2,,]), mean=0,
sd=10)
X[451:500,3,,] <- X[451:500,1,,] + rnorm(length(X[451:500,3,,]), mean=0,
sd=10)


# Step 2: Now the effects for different factors in the ANOVA model
# can be calculated:

# subtraction of the general mean
x <- X - mean(X, na.rm=TRUE)
tpoints <- c(1, 3, 6, 12,c( 24*c(1, 2, 3, 5, 8, 12)))
nrgenes <- dim(x)[1]

# calculation of the three main effects
cat("calculating main effects\n")
timemeans <- apply(x, 3, mean, na.rm=TRUE)
treatmeans <- apply(x, 4, mean, na.rm=TRUE)
genemeans <- apply(x, 1, mean, na.rm=TRUE)

#par(mfrow=c(2, 3))


# calculation of the interaction effects
# interaction time-treatment
cat("calculating interaction time-treatment\n")
mean.ti.tr <- apply(x, c(3,4), mean, na.rm=TRUE)
#print(dim(mean.ti.tr))
time1.m <- matrix(rep(timemeans, 4), nrow=10, byrow=FALSE)
tr1.m <- matrix(rep(treatmeans, 10), nrow=10, byrow=TRUE)
int.ti.tr <- (mean.ti.tr - time1.m) - tr1.m

# interaction time-gene
cat("calculating interaction time-gene\n")
mean.ti.gene <- apply(x, c(1,3), mean, na.rm=TRUE)
#print(dim(mean.ti.gene))
time2.m <- matrix(rep(timemeans, dim(x)[1]), nrow=dim(x)[1], byrow=TRUE)
gene2.m <- matrix(rep(genemeans, 10), nrow=dim(x)[1], byrow=FALSE)
int.ti.gene <- (mean.ti.gene - time2.m) - gene2.m

# interaction gene-treatment
cat("calculating interaction gene-treatment\n")
mean.gene.tr <- apply(x, c(1,4), mean, na.rm=TRUE)
#print(dim(mean.gene.tr))
gene3.m <- matrix(rep(genemeans, 4), ncol=4, byrow=FALSE)
tr3.m <- matrix(rep(treatmeans, dim(x)[1]), ncol=4, byrow=TRUE)
int.gene.tr <- (mean.gene.tr - gene3.m) - tr3.m

# calculation of 3 factor interaction
cat("calculating 3 factor interaction\n")
mean.gene.time.tr <- apply(x, c(1, 3, 4), mean, na.rm=TRUE)
#print(dim(mean.gene.time.tr))

ar.ti.tr <- array(0, c(nrgenes, 10, 4))
for (i in 1:nrgenes){ar.ti.tr[i,,] <- int.ti.tr}

ar.ti.gene <- array(0, c(nrgenes, 10, 4))
for (i in 1:4){ar.ti.gene[,,i] <- int.ti.gene}

ar.gene.tr <- array(0, c(nrgenes, 10, 4))
for (i in 1:10){ar.gene.tr[,i,] <- int.gene.tr}

ar.ti <- array(0, c(nrgenes, 10, 4))
for (i in 1:4){ar.ti[,,i] <- time2.m}

ar.tr <- array(0, c(nrgenes, 10, 4))
for (i in 1:10){ar.tr[,i,] <- tr3.m}

ar.gene <- array(0, c(nrgenes, 10, 4))
for (i in 1:4){ar.gene[,,i] <- gene2.m}

int.gene.time.tr <- mean.gene.time.tr - ar.ti - ar.tr - ar.gene -
  ar.ti.tr - ar.ti.gene - ar.gene.tr
imat.gtt <- cbind(int.gene.time.tr[,,1],
                  int.gene.time.tr[,,2],
                  int.gene.time.tr[,,3],
                  int.gene.time.tr[,,4])

### calculation of error term
error.term1 <- abs(sweep(x, c(1, 3, 4), mean.gene.time.tr))
cell.error <- apply(error.term1, c(1, 3, 4), mean, na.rm=TRUE)
mncn.error <- scale(cbind(cell.error[,,1], cell.error[,,2],
                          cell.error[,,3], cell.error[,,4]), scale=FALSE)

### Step 3: The results of the model can now be inspected with
### different plots (PCA for the interactions)
#plot(timemeans, type="l")
#plot(treatmeans, type="l")
#plot(genemeans, type="l")

#source("biplot.R")
#jbiplot(1, 2, int.ti.gene, gnames[,2], as.character(tpoints), rep(2, 10))
#jbiplot(1, 2, int.gene.tr, gnames[,2], c("TR1", "TR2", "TR3", "UNT"), 2:5)
#jbiplot(1, 2, int.ti.tr, as.character(tpoints), c("TR1", "TR2", "TR3",
"UNT"), rep(2, 4))
#jbiplot(1, 2, imat.gtt, gnames[,2], as.character(rep(tpoints, 4)),
sort(rep(2:5, 10)))

[end script]

--
View this message in context: http://r.789695.n4.nabble.com/Entering-data-into-a-multi-way-array-tp3862054p3863670.html
Sent from the R help mailing list archive at Nabble.com.

Victoria_Stuart

Wed, Oct 5, 2011 7:41 PM #

Solution:  I figured this out on my own (below).

V1        V2        V3        V4        V5           V6        V7
1 NM_005588 NM_004407 NM_006136 NM_004817 NM_006012 NM_001008693 NM_181435
[snip]
       V497      V498      V499      V500
1 NM_152546 XM_375762 NM_020696 NM_138459

# Saved 500 genes (values only): 3 replicates of (8 time points for
treatment 1) + 3 replicates of (8 time points for treatment 2)  = 500 x 48
.csv file

Read 24000 items

[1] 500  48

# Looks good: Checked all, below, against 'source'  file "X.ods" : OK!

, , 1, 1   # = Time Point 1 replicates (columns) for Treatment 1
                [,1]          [,2]         [,3]
  [1,] -3.2093000000 -2.1463000000  0.347800000
  [2,]  0.1689000000 -0.0945000000 -0.383400000
  [3,] -0.0487000000 -0.8695000000 -1.990200000
[snip]

, , 2, 1   # = Time Point 2 replicates (columns) for Treatment 1
          [,1]          [,2]    [,3]
  [1,] -2.8531 -3.2044000000 -1.3212
  [2,]  0.5820  0.1358000000  0.4183
  [3,]  1.0075  0.3211000000 -1.7940
[snip]

, , 8, 1   # = Time Point 8 replicates (columns) for Treatment 1
                [,1]          [,2]    [,3]
  [1,] -2.3661000000 -1.8353000000  0.0682
  [2,]  0.6607000000  0.0726000000 -0.1255
  [3,] -0.4960000000 -1.8894000000 -0.9466
[snip]

, , 1, 2   # = Time Point 1 replicates (columns) for Treatment 2
                [,1]    [,2]    [,3]
  [1,] -2.0151000000 -1.4385 -0.9281
  [2,] -0.0482000000  0.2382 -0.1319
  [3,] -1.3375000000 -1.4805 -1.4416
[snip]

, , 2, 2   # = Time Point 2 replicates (columns) for Treatment 2
          [,1]    [,2]          [,3]
  [1,] -2.3212 -1.3964 -1.1155000000
  [2,]  0.0264  0.2057  0.1566000000
  [3,] -0.3569 -0.9750 -0.8666000000
[snip]

, , 7, 2   # = Time Point 7 replicates (columns) for Treatment 2
                [,1]          [,2]    [,3]
  [1,] -1.2615000000 -2.1288000000 -1.8844
  [2,]  0.1048000000 -0.2306000000 -0.0634
  [3,] -1.4571000000  0.7725000000 -1.0384
[snip]

, , 8, 2   # = Time Point 8 replicates (columns) for Treatment 2
          [,1]          [,2]    [,3]
  [1,] -0.0819 -1.8137000000 -1.4085
  [2,] -0.1336  0.2744000000  1.2913
  [3,] -1.1404 -0.7546000000 -1.2400
[snip]

Thanks anyway - Victoria  :-)


--
View this message in context: http://r.789695.n4.nabble.com/Entering-data-into-a-multi-way-array-tp3862054p3876874.html
Sent from the R help mailing list archive at Nabble.com.

Victoria_Stuart

Sat, Oct 8, 2011 2:13 PM #

Update:

1. Here is a better illustration.  I assumed that the gene expression was
the same for the three replicates for each gene, in the respective Time and
Treatment groups, so that I could follow (verify) what was going on.

2. To better-view the following, copy and paste into a plain-text editor
with a monospace font (e.g. gedit on Ubuntu with the Monospace font).

Sincerely, Victoria  :-)

 ------------------------------------

5 Genes x 3 Replicates x 2 Times x 2 Treatments

Read 60 items

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    1    1    1    2    2    2    3    3    3     4     4     4   #
Gene1 avg. = row avg. = 2.5
[2,]    1    1    1    2    2    2    3    3    3     4     4     4   #
Gene2 avg. = row avg. = 2.5
[3,]    1    1    1    2    2    2    3    3    3     4     4     4   #
Gene3 avg. = row avg. = 2.5
[4,]    1    1    1    2    2    2    3    3    3     4     4     4   #
Gene4 avg. = row avg. = 2.5
[5,]    1    1    1    2    2    2    3    3    3     4     4     4   #
Gene5 avg. = row avg. = 2.5

#      |------- Treatment1 -------|  |-------- Treatment2 ---------| Tr1
avg. = (1+2)/2; Tr2 avg. = (3+4)/2
#      |-- Time1 --|  |-- Time2 --|  |-- Time1 --|     |-- Time2 --| Ti1
avg. = (1+3)/2; Ti2 avg. = (2+4)/2
#          (x3)           (x3)           (x3)              (x3)  

# This, above,is for illustration only; the actual data will have different
values for eah of the above
# [ 5x3x2x2 = 60 data individual, independent data points ]

[1]  5 12

[1] 5

[1] 5 3 2 2

[1] 2.5 2.5 2.5 2.5 2.5  # = row averages (5 Genes)

[1] 2 3  # = Time1 avg  Time2 avg  (1+1+1+3+3+3)/6; (2+2+2+4+4+4)/6

[1] 1.5 3.5  # Treatment1 avg  Treatment2 avg  (1+1+1+2+2+2)/6;
(3+3+3+4+4+4)/6

------------------------------------


--
View this message in context: http://r.789695.n4.nabble.com/Entering-data-into-a-multi-way-array-tp3862054p3886079.html
Sent from the R help mailing list archive at Nabble.com.

Victoria_Stuart

Sat, Oct 8, 2011 5:17 PM #

One last comment: add sep="," to the read.csv assignment, to avoid "Error in
scan(file ... : scan() expected 'a real', got" -type errors:

x<-matrix(scan("/home/victoria/R/5.3.2.2.csv",n=5*12,sep=","),5,12,byrow=TRUE)



--
View this message in context: http://r.789695.n4.nabble.com/Entering-data-into-a-multi-way-array-tp3862054p3886346.html
Sent from the R help mailing list archive at Nabble.com.