parts of data frames: subset vs. [-c()] - R-help

Fri, Aug 26, 2005 9:17 AM #

Dear all

I have a problem with splitting up a data frame called ReVerb:

?? str(ReVerb)
`data.frame':   92713 obs. of  16 variables:
 $ CHILD    : Factor w/ 7 levels "ABE","ADA","EVE",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ AGE      : Factor w/ 484 levels "1;06.00","1;06.16",..: 43 43 43 99 99 99 99 99 99 99 ...
 $ AGE_Q    : num  2.0 2.0 2.0 2.4 2.4 ...
 $ INTERVALS: num  2 2 2 2.25 2.25 2.25 2.25 2.25 2.25 2.25 ...
 $ RND      : int  34368 38311 14949 20586 72516 27186 88019 10767 114448 86146 ...
 $ SYNTAX   : Factor w/ 17 levels "Acmp","Amats",..: 15 12 8 15 7 16 7 7 16 7 ...
 $ LEXICAL  : Factor w/ 1643 levels "$ACHE","$ACT",..: 194 803 803 294 299 803 1562 299 679 1562 ...
 $ MORPH    : Factor w/ 337 levels "$","$ =inf","$ =prs",..: 9 20 9 39 184 231 57 67 231 39 ...
 $ COMPLEM  : Factor w/ 1989 levels "$","$ V PR=Lp [1.2]",..: 203 547 220 203 1101 368 1834 1667 368 1834 ...
 $ MATRIX   : Factor w/ 906 levels "$ ???","$ be PR=Aen",..: 5 5 5 308 5 856 5 5 856 308 ...
 $ SITUATION: Factor w/ 9 levels "[imitation of Mom: you know what I said]",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ V_ANN    : int  1 1 1 4 4 4 4 3 3 3 ...
 $ QUEST    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ EXCL     : int  0 0 0 1 1 1 1 0 0 0 ...
 $ U_LEN    : int  3 4 5 13 13 13 13 8 8 8 ...
 $ UTTERANCE: Factor w/ 55113 levels "","# (be)cause he wanted to .",..: 5696 39091 52180 2262 2262 2262 2262 3593 3593 3593 ...

The level causing the problem is SYNTAX:

?? as.data.frame(sort(table(SYNTAX)))
              sort(table(SYNTAX))
Particles                     100
PR=N1                         144
Amats                         271
Trans_PR=A2                   787
Ditrans                      1181
Intrans_PR=A1                1399
Acmp                         2402
Trans_PR=V2                  2433
CPcmps                       2769
Vpreps                       4896
Intrans_V0                   5182
Trans_PR=L2                  7653
Trans_V02                    8117
Intrans_PR=L1                8457
Intrans_V1                   9643
Intrans_PR=V1               14987
Trans_V12                   22288


I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code:

?? ditrans<-which(SYNTAX=="Ditrans")
?? ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1)
[1] 91532    16
?? 
?? # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ...
?? 
?? ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1)
[1] 91528    16
?? 
?? # ... so why don't I get 91532 again as the number of rows?
?? 
Any ideas??

?? R.version # on Windows XP with service Pack 2
         _              
platform i386-pc-mingw32
arch     i386           
os       mingw32        
system   i386, mingw32  
status                  
major    2              
minor    1.1            
year     2005           
month    06             
day      20             
language R              

Thanks a lot,
STG
--
Stefan Th. Gries
----------------------------------------
Max Planck Inst. for Evol. Anthropology
http://people.freenet.de/Stefan_Th_Gries
----------------------------------------

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - ??ber 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1

Peter Dalgaard

Fri, Aug 26, 2005 9:33 AM #

"Stefan Th. Gries" <stgries_lists at arcor.de> writes:

Dear all

I have a problem with splitting up a data frame called ReVerb:

?? str(ReVerb)
`data.frame':   92713 obs. of  16 variables:
 $ CHILD    : Factor w/ 7 levels "ABE","ADA","EVE",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ AGE      : Factor w/ 484 levels "1;06.00","1;06.16",..: 43 43 43 99 99 99 99 99 99 99 ...
 $ AGE_Q    : num  2.0 2.0 2.0 2.4 2.4 ...
 $ INTERVALS: num  2 2 2 2.25 2.25 2.25 2.25 2.25 2.25 2.25 ...
 $ RND      : int  34368 38311 14949 20586 72516 27186 88019 10767 114448 86146 ...
 $ SYNTAX   : Factor w/ 17 levels "Acmp","Amats",..: 15 12 8 15 7 16 7 7 16 7 ...
 $ LEXICAL  : Factor w/ 1643 levels "$ACHE","$ACT",..: 194 803 803 294 299 803 1562 299 679 1562 ...
 $ MORPH    : Factor w/ 337 levels "$","$ =inf","$ =prs",..: 9 20 9 39 184 231 57 67 231 39 ...
 $ COMPLEM  : Factor w/ 1989 levels "$","$ V PR=Lp [1.2]",..: 203 547 220 203 1101 368 1834 1667 368 1834 ...
 $ MATRIX   : Factor w/ 906 levels "$ ???","$ be PR=Aen",..: 5 5 5 308 5 856 5 5 856 308 ...
 $ SITUATION: Factor w/ 9 levels "[imitation of Mom: you know what I said]",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ V_ANN    : int  1 1 1 4 4 4 4 3 3 3 ...
 $ QUEST    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ EXCL     : int  0 0 0 1 1 1 1 0 0 0 ...
 $ U_LEN    : int  3 4 5 13 13 13 13 8 8 8 ...
 $ UTTERANCE: Factor w/ 55113 levels "","# (be)cause he wanted to .",..: 5696 39091 52180 2262 2262 2262 2262 3593 3593 3593 ...

The level causing the problem is SYNTAX:

?? as.data.frame(sort(table(SYNTAX)))
              sort(table(SYNTAX))
Particles                     100
PR=N1                         144
Amats                         271
Trans_PR=A2                   787
Ditrans                      1181
Intrans_PR=A1                1399
Acmp                         2402
Trans_PR=V2                  2433
CPcmps                       2769
Vpreps                       4896
Intrans_V0                   5182
Trans_PR=L2                  7653
Trans_V02                    8117
Intrans_PR=L1                8457
Intrans_V1                   9643
Intrans_PR=V1               14987
Trans_V12                   22288


I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code:

?? ditrans<-which(SYNTAX=="Ditrans")
?? ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1)
[1] 91532    16
?? 
?? # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ...
?? 
?? ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1)
[1] 91528    16
?? 
?? # ... so why don't I get 91532 again as the number of rows?
?? 
Any ideas??

The SYNTAX variable is not necessarily the same. Could you retry the
first case with 

 ditrans <- which(ReVerb$SYNTAX=="Ditrans")

? 

Otherwise, try doing a setdiff() on the rownames of the two discrepant
results and see which are the four cases that differ.

O__  ---- Peter Dalgaard             ??ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Brian Ripley

Fri, Aug 26, 2005 10:40 AM #

Are there NAs in the variable?

SYNTAX=="Ditrans" and SYNTAX!="Ditrans" are not mutually exclusive.

On Fri, 26 Aug 2005, Stefan Th. Gries wrote:

Dear all

I have a problem with splitting up a data frame called ReVerb:

? str(ReVerb)
`data.frame':   92713 obs. of  16 variables:
$ CHILD    : Factor w/ 7 levels "ABE","ADA","EVE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ AGE      : Factor w/ 484 levels "1;06.00","1;06.16",..: 43 43 43 99 99 99 99 99 99 99 ...
$ AGE_Q    : num  2.0 2.0 2.0 2.4 2.4 ...
$ INTERVALS: num  2 2 2 2.25 2.25 2.25 2.25 2.25 2.25 2.25 ...
$ RND      : int  34368 38311 14949 20586 72516 27186 88019 10767 114448 86146 ...
$ SYNTAX   : Factor w/ 17 levels "Acmp","Amats",..: 15 12 8 15 7 16 7 7 16 7 ...
$ LEXICAL  : Factor w/ 1643 levels "$ACHE","$ACT",..: 194 803 803 294 299 803 1562 299 679 1562 ...
$ MORPH    : Factor w/ 337 levels "$","$ =inf","$ =prs",..: 9 20 9 39 184 231 57 67 231 39 ...
$ COMPLEM  : Factor w/ 1989 levels "$","$ V PR=Lp [1.2]",..: 203 547 220 203 1101 368 1834 1667 368 1834 ...
$ MATRIX   : Factor w/ 906 levels "$ ???","$ be PR=Aen",..: 5 5 5 308 5 856 5 5 856 308 ...
$ SITUATION: Factor w/ 9 levels "[imitation of Mom: you know what I said]",..: 2 2 2 2 2 2 2 2 2 2 ...
$ V_ANN    : int  1 1 1 4 4 4 4 3 3 3 ...
$ QUEST    : int  0 0 0 0 0 0 0 0 0 0 ...
$ EXCL     : int  0 0 0 1 1 1 1 0 0 0 ...
$ U_LEN    : int  3 4 5 13 13 13 13 8 8 8 ...
$ UTTERANCE: Factor w/ 55113 levels "","# (be)cause he wanted to .",..: 5696 39091 52180 2262 2262 2262 2262 3593 3593 3593 ...

The level causing the problem is SYNTAX:

? as.data.frame(sort(table(SYNTAX)))
             sort(table(SYNTAX))
Particles                     100
PR=N1                         144
Amats                         271
Trans_PR=A2                   787
Ditrans                      1181
Intrans_PR=A1                1399
Acmp                         2402
Trans_PR=V2                  2433
CPcmps                       2769
Vpreps                       4896
Intrans_V0                   5182
Trans_PR=L2                  7653
Trans_V02                    8117
Intrans_PR=L1                8457
Intrans_V1                   9643
Intrans_PR=V1               14987
Trans_V12                   22288


I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code:

? ditrans<-which(SYNTAX=="Ditrans")
? ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1)
[1] 91532    16
?
? # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ...
?
? ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1)
[1] 91528    16
?
? # ... so why don't I get 91532 again as the number of rows?
?
Any ideas??

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Stefan Th. Gries

Fri, Aug 26, 2005 10:42 AM #

I have a problem with splitting up a data frame called ReVerb: I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code:

[1] 91532    16
# ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ...

[1] 91528    16
# ... so why don't I get 91532 again as the number of rows?
# Any ideas??

The results were the same as with 'ditrans<-which(SYNTAX=="Ditrans")'.

This solved the issue: Using setdiff, I found that the cases that the second way with subset fails to include are NA's ... - I was not aware of how subset treats NA, sorry.

Thanks a lot,
STG
--
Stefan Th. Gries
----------------------------------------
Max Planck Inst. for Evol. Anthropology
http://people.freenet.de/Stefan_Th_Gries
----------------------------------------

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - ??ber 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1