Skip to content

data.table/ifelse conditional new variable question

10 messages · Kate Ignatius, Jorge I Velez, John McKown +1 more

#
Hi,

I have a data.table question (as well as if else statement query).

I have a large list of families (file has 935 individuals that are
sorted by famiy of varying sizes).  At the moment the file has the
columns:

SampleID FamilyID Relationship

To prevent from having to make a pedigree file by hand - ie adding a
PaternalID and a MaternalID one by one I want to try write a script
that will quickly do this for me  (I eventually want to run this
through a program such as plink)   Is there a way to use data.table
(maybe in conjucntion with ifelse to do this effectively)?

An example of the file is something like:

Family.ID Sample.ID Relationship
14           62  sibling
14          94  father
14           63  sibling
14           59 mother
17         6004  father
17           6003 mother
17         6005   sibling
17         368   sibling
130           202 mother
130           203  father
130           204   sibling
130           205   sibling
130           206   sibling
222         9 mother
222         45  sibling
222         34  sibling
222         10  sibling
222         11  sibling
222         18  father

But the goal is to have a file like this:

Family.ID Sample.ID Relationship PID MID
14           62  sibling 94 59
14          94  father 0 0
14           63  sibling 94 59
14           59 mother 0 0
17         6004  father 0 0
17           6003 mother 0 0
17         6005   sibling 6004 6003
17         368   sibling 6004 6003
130           202 mother 0 0
130           203  father 0 0
130           204   sibling 203 202
130           205   sibling 203 202
130           206   sibling 203 202
222         9 mother 0 0
222         45  sibling 18 9
222         34  sibling 18 9
222         10  sibling 18 9
222         11  sibling 18 9
222         18  father 0 0

I've tried searches for this but with no luck.  Greatly appreciate any
help - even if its just a link to a great example/solution!

Thanks!
#
Dear Kate,

Assuming you have nuclear families, one option would be:

x <- read.table(textConnection("Family.ID Sample.ID Relationship
14           62  sibling
14          94  father
14           63  sibling
14           59 mother
17         6004  father
17           6003 mother
17         6005   sibling
17         368   sibling
130           202 mother
130           203  father
130           204   sibling
130           205   sibling
130           206   sibling
222         9 mother
222         45  sibling
222         34  sibling
222         10  sibling
222         11  sibling
222         18  father"), header = TRUE)
closeAllConnections()

xs <- with(x, split(x, Family.ID))
res <- do.call(rbind, lapply(xs, function(l){
l$PID <- l$MID <- 0
 father <- with(l, Relationship == 'father')
 mother <- with(l, Relationship == 'mother')
 l$PID[l$Relationship == 'sibling'] <- l$Sample.ID[father]
 l$MID[l$Relationship == 'sibling'] <- l$Sample.ID[mother]
l
 }))
res

HTH,
Jorge.-


Best regards,
Jorge.-



On Sun, Aug 17, 2014 at 5:42 AM, Kate Ignatius <kate.ignatius at gmail.com>
wrote:

  
  
#
Thanks!

I think I know what is being done here but not sure how to fix the
following error:

Error in l$PID[l$\Relationship == "sibling"] <- l$Sample.ID[father] :
  replacement has length zero
On Sat, Aug 16, 2014 at 6:48 PM, Jorge I Velez <jorgeivanvelez at gmail.com> wrote:
#
Actually - I didn't check this before, but these are not all nuclear
families (as I assumed they were).  That is, some don't have a father
or don't have a mother.... Usually if this is the case PID or MID will
become 0, respectively, for the child.  How can the code be edit to
account for this?
On Sat, Aug 16, 2014 at 8:02 PM, Kate Ignatius <kate.ignatius at gmail.com> wrote:
#
Dear Kate,

Try this:

res <- do.call(rbind, lapply(xs, function(l){
 l$PID <- l$MID <- 0
father <- with(l, Relationship == 'father')
 mother <- with(l, Relationship == 'mother')
 if(sum(father) == 0)
l$PID[l$Relationship == 'sibling'] <- 0
 else l$PID[l$Relationship == 'sibling'] <- l$Sample.ID[father]
 if(sum(mother) == 0)
l$MID[l$Relationship == 'sibling'] <- 0
 else l$MID[l$Relationship == 'sibling'] <- l$Sample.ID[mother]
 l
}))

It is assumed that when either parent is not available the M/PID is 0.

Best,
Jorge.-


On Sun, Aug 17, 2014 at 10:58 AM, Kate Ignatius <kate.ignatius at gmail.com>
wrote:

  
  
#
Yep - you're right - missing parents are indicated as zero in the M/PID field.

The above code worked with a few errors:

1: In l$PID[l$Relationship == "sibling"] <- l$Sample.ID[father] :
  number of items to replace is not a multiple of replacement length
2: In l$PID[l$Relationship == "sibling"] <- l$Sample.ID[father] :
  number of items to replace is not a multiple of replacement length
3: In l$PID[l$Relationship == "sibling"] <- l$Sample.ID[father] :
  number of items to replace is not a multiple of replacement length
4: In l$MID[l$Relationship == "sibling"] <- l$Sample.ID[mother] :
  number of items to replace is not a multiple of replacement length

looking at the output I get numbers where the father/mother ID should
be in the M/PID field.  For example:

2702  349       mother   0   0
2702  3456  sibling   0 842
2702  9980  sibling   0 842
3064  3  father   0   0
3064  4  mother   0   0
3064  5    sibling 879 880
3064  86   sibling 879 880
3064  87   sibling 879 880
On Sat, Aug 16, 2014 at 9:31 PM, Jorge I Velez <jorgeivanvelez at gmail.com> wrote:
#
Perhaps I am missing something but I do not get the same result:

x <- read.table(textConnection("Family.ID Sample.ID Relationship
2702  349       mother
2702  3456  sibling
2702  9980  sibling
3064  3  father
3064  4  mother
3064  5    sibling
3064  86   sibling
3064  87   sibling"), header = TRUE)
closeAllConnections()

xs <- with(x, split(x, Family.ID))
res <- do.call(rbind, lapply(xs, function(l){
 l$PID <- l$MID <- 0
father <- with(l, Relationship == 'father')
 mother <- with(l, Relationship == 'mother')
 if(sum(father) == 0)
l$PID[l$Relationship == 'sibling'] <- 0
 else l$PID[l$Relationship == 'sibling'] <- l$Sample.ID[father]
 if(sum(mother) == 0)
l$MID[l$Relationship == 'sibling'] <- 0
 else l$MID[l$Relationship == 'sibling'] <- l$Sample.ID[mother]
 l
}))
 #Family.ID Sample.ID Relationship MID PID
#2702.1      2702       349       mother   0   0
#2702.2      2702      3456      sibling 349   0
#2702.3      2702      9980      sibling 349   0
#3064.4      3064         3       father   0   0
#3064.5      3064         4       mother   0   0
#3064.6      3064         5      sibling   4   3
#3064.7      3064        86      sibling   4   3
#3064.8      3064        87      sibling   4   3

HTH,
Jorge.-




On Sun, Aug 17, 2014 at 11:47 AM, Kate Ignatius <kate.ignatius at gmail.com>
wrote:

  
  
#
Actually - your code is not wrong... because this is a large file I
went through the file to see if there was anything wrong with it -
looks like there are two fathers or three mothers in some families.
Taking these duplicates out fixed the problem.

Sorry about the confusion!  And thanks so much for your help!
On Sat, Aug 16, 2014 at 9:53 PM, Jorge I Velez <jorgeivanvelez at gmail.com> wrote:
#
On Sat, Aug 16, 2014 at 9:02 PM, Kate Ignatius <kate.ignatius at gmail.com>
wrote:
Kate,
I hope you don't mind, but I have a curiosity question on my part. Were the
families with multiple fathers or mothers a mistake, just duplicates (same
Family.ID & Sample.ID), or more like an "intermixed" family due to divorce
and remarriage. Or even, like in some countries, a case of polygamy? Sorry,
I just get curious about the strangest things sometimes.
#
On 17-Aug-2014 03:50:33 John McKown wrote:
When Kate first posted her query, similar thoughts to John's occurred
to me. The potential for convoluted ancestry and kinship is enormous!

For perhaps (or perhaps not) ultimate convolution, try reconstructing
a canine pedigree from a breeding register of thoroughbreds, where
again the primary data is for each individual is
  * ID of individual
  * ID of litter the individual was born in ("family")
  * ID of male parent
  * ID of female parent
(as, for instance, registered with the UK Kennel Club).

Similar convolutions can be found with race-horses.

But even humans can compete. Here is a little challenge for anyone
who has an R program that will work out a pedigree from data such as
described above. I have used Kate's notation. Individuals are numbered
from 1 up (with a gap): Sample.ID; Families from 101 up: Family.ID.
Relationships are "sibling", "father", "mother".

ID for father/mother may be "NA" (data not given).

Family.ID Sample.ID Relationship
101       01        sibling
101       02        father
101       03        mother

102       02        sibling
102       04        father
102       05        mother

103       03        sibling
103       06        father
103       07        mother

104       04        sibling
104       08        father
104       09        mother

104       05        sibling
104       08        father
104       09        mother

104       06        sibling
104       08        father
104       09        mother

104       15        sibling
104       08        father
104       09        mother

105       07        sibling
105       04        father
105       15        mother

106       08        sibling
106       16        father
106       17        mother

106       18        sibling
106       16        father
106       17        mother

106       19        sibling
106       16        father
106       17        mother

107       09        sibling
107       18        father
107       19        mother

108       16        sibling
108       NA        father
108       NA        mother

109       17        sibling
109       NA        father
109       NA        mother

That's the data. Now a little quiz question: Can you guess the
identity of the person with sample.ID = 01 ?

Best wishes to all,
Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 17-Aug-2014  Time: 19:41:38
This message was sent by XFMail