Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
(801) 408-8111
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Andres Legarra
> Sent: Thursday, March 20, 2008 2:25 AM
> To: Michael Dewey
> Cc: R-help at r-project.org
> Subject: Re: [R] two cols in a data frame are the same factor
>
> Hi,
> I am afraid you misunderstood it. I do not have repeated
> records, but for every record I have two, possibly different,
> simultaneously present, instanciations of an explanatory variable.
>
> My data is as follows :
>
> yield haplo1 haplo2
> 100 A B
> 151 B A
> 212 A A
>
> So I have one effect (haplo), but two copies of each affect "yield".
> If I use lm() I get:
> >
> a=data.frame(yield=c(100,151,212),haplo1=c("A","B","A"),haplo2=c("B","
> > A","A"))
> Call:
> lm(formula = yield ~ -1 + haplo1 + haplo2, data = a)
>
> Coefficients:
> haploA haploB haplo2B
> 212 151 -112
>
>
> But I get different coefficients for the two "A"s (in fact oe
> was set to 0) and the Two "Bs" . That is, the model has four
> unknowns but in my example I have just two!
>
> A least-squares solution is simple to do by hand:
>
> X=matrix(c(1,1,1,1,2,0),ncol=2) #the incidence matrix
> > X
> [,1] [,2]
> [1,] 1 1
> [2,] 1 2
> [3,] 1 0
> > solve(crossprod(X,X),crossprod(X,a$yield))
> [,1]
> [1,] 184.8333
> [2,] -30.5000
>
> where [1,] is the solution for A and [2,] is the solution for B
>
> This is not difficult to do by hand, but it is for a simple
> case and I miss all the machinery in lm()
>
> Thank you
> Andres
>
> On Wed, Mar 19, 2008 at 6:57 PM, Michael Dewey
> <info at aghmed.fsnet.co.uk> wrote:
> > At 09:11 18/03/2008, Andres Legarra wrote:
> > >Dear all,
> > >I have a data set (QTL detection) where I have two cols
> of factors
> > in >the data frame that correspond logically (in my model) to the
> > same >factor. In fact these are haplotype classes.
> > >Another real-life example would be family gas consumption as a
> > >function of car company (e.g. Ford, GM, and Honda)
> (assuming 2 cars
> > by >family).
> >
> > Unless I completely misunderstand this it looks like you have the
> > dataset in wide format when you really wanted it in long
> format (to
> > use the terminology of ?reshape). Then you would fit a
> model allowing
> > for the clustering by family.
> >
> >
> >
> >
> > >An artificial example follows:
> > >set.seed(1234)
> > >L3 <- LETTERS[1:3]
> > >(d <- data.frame( y=rnorm(10), fac=sample(L3, 10,
> > >repl=TRUE),fac1=sample(L3,10,repl=T)))
> > >
> > > lm(y ~ fac+fac1,data=d)
> > >
> > >and I get:
> > >
> > >Coefficients:
> > >(Intercept) facB facC fac1B fac1C
> > > 0.3612 -0.9359 -0.2004 -2.1376 -0.5438
> > >
> > >However, to respect my model, I need to constrain effects
> in fac and
> > >fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
> > >logically just 4 unknowns (average,A,B,C).
> > >With continuous covariates one might do y ~ I(cov1+cov2),
> but this
> > is >not the case.
> > >
> > >Is there any trick to do that?
> > >Thanks,
> > >
> > >Andres Legarra
> > >INRA-SAGA
> > >Toulouse, France
> >
> > Michael Dewey
> > http://www.aghmed.fsnet.co.uk
> >
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>