Skip to content

Fitting linear models

16 messages · Vemuri, Aparna, Bert Gunter, David Winsemius +2 more

#
Is this homework? If so, you need to read the text and/or class notes more
carefully.

-- Bert Gunter


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Vemuri, Aparna
Sent: Monday, April 20, 2009 4:26 PM
To: r-help at r-project.org
Subject: [R] Fitting linear models

I am not sure if this is an R-users question, but since most of you here
are statisticians, I decided to give it a shot. 

I am using the lm() function in R to fit a dependent variable to a set
of 3 to 5 independent variables. For this, I used the following
commands:
Coefficients:
(Intercept)          SO4          NO3      NH4
    0.01323      0.01968      0.01856           NA  

and
Coefficients:
(Intercept)          SO4         	 NO3      NH4
Na       Cl  
 -0.0006987   -0.0119750   -0.0295042    0.0842989    0.1344751
NA

In both cases, the last independent variable has a coefficient of NA in
the result. I say last variable because, when I change the order of the
variables, the coefficient changes (see below). Can anyone point me to
the reason R behaves this way?  Is there anyway for me to force R to use
all the variables? I checked the correlation matrices to makes sure
there is no orthogonality between the variables. 

Thanks
Aparna 

model1<-lm(formula = PBW ~ SO4 + NH4 +NO3)
Call:
lm(formula = PBW ~ SO4 + NH4 + NO3)

Coefficients:
(Intercept)          SO4      NH4          NO3  
    0.01323     -0.00430      0.06394           NA
Call:
lm(formula = PBW ~ SO4 + NO3 + Na + Cl + NH4)

Coefficients:
(Intercept)          SO4             NO3                 	Na
Cl                  NH4	  
 -0.0006987    0.0196371   -0.0050303    0.0685020    0.0427431
NA  


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On Apr 20, 2009, at 7:26 PM, Vemuri, Aparna wrote:

            
You can omit the unnecessary preambles.
You really did not name your dependent variable "function" did you?  
Please stop that.

Just a guess, ... since you have not provided enough information to do  
otherwise, ... Are all of those variables 1/0 dummy variables? If so  
and if you want to have an output that satisfies your need for  
labeling the coefficients as you naively anticipate, then put "0+" at  
the beginning of the formula or "-1" at the end, so that the intercept  
will disappear and then all variables will get labeled as you expect.
#
Try:
model1<-lm(PBW~SO4+NO3+NH4)
Does it work?
Dimitri
On Mon, Apr 20, 2009 at 7:26 PM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
David,
Thanks for the suggestions. No, I did not label my dependent variable "function".

My dependent variable PBW and all the independent variables are continuous variables. It is especially troubling since the order in which I input independent variables determines whether or not it gets a coefficient.  Like I already mentioned, I checked the correlation matrix and picked the variables with moderate to high correlation with the independent variable. . So I guess it is not so na?ve to expect a regression coefficient on all of them.

Dimitri 
model1<-lm(PBW~SO4+NO3+NH4), gives me the same result as before.

Bert:
 This is not homework. But I will remember to do my research before posting here.

Aparna 


-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Monday, April 20, 2009 5:35 PM
To: Vemuri, Aparna
Cc: r-help at r-project.org
Subject: Re: [R] Fitting linear models
On Apr 20, 2009, at 7:26 PM, Vemuri, Aparna wrote:

            
You can omit the unnecessary preambles.
You really did not name your dependent variable "function" did you?  
Please stop that.

Just a guess, ... since you have not provided enough information to do  
otherwise, ... Are all of those variables 1/0 dummy variables? If so  
and if you want to have an output that satisfies your need for  
labeling the coefficients as you naively anticipate, then put "0+" at  
the beginning of the formula or "-1" at the end, so that the intercept  
will disappear and then all variables will get labeled as you expect.
#
Aparna,

I should have been more explicit. Run ?lm . You'll see this:

"lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)"

So, in addition to specifying the formula, you have to specify the
data frame in which you keep your variables. I assume they are in a
data frame? (unless for some reasons you keep all variables as
separate vectors).
So, after you wrote the formula, you have to indicate the name of the
data frame, for example "MyData":

model1<-lm(PBW~SO4+NO3+NH4, MyData)

Dimitri
On Tue, Apr 21, 2009 at 11:12 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
The variables are all in separate vectors. 

-----Original Message-----
From: Dimitri Liakhovitski [mailto:ld7631 at gmail.com] 
Sent: Tuesday, April 21, 2009 8:26 AM
To: Vemuri, Aparna
Cc: David Winsemius; r-help at r-project.org
Subject: Re: [R] Fitting linear models

Aparna,

I should have been more explicit. Run ?lm . You'll see this:

"lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)"

So, in addition to specifying the formula, you have to specify the
data frame in which you keep your variables. I assume they are in a
data frame? (unless for some reasons you keep all variables as
separate vectors).
So, after you wrote the formula, you have to indicate the name of the
data frame, for example "MyData":

model1<-lm(PBW~SO4+NO3+NH4, MyData)

Dimitri
On Tue, Apr 21, 2009 at 11:12 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
Are they of the same length?
On Tue, Apr 21, 2009 at 11:31 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
Yes, they are all of the same length.

-----Original Message-----
From: Dimitri Liakhovitski [mailto:ld7631 at gmail.com] 
Sent: Tuesday, April 21, 2009 8:32 AM
To: Vemuri, Aparna
Cc: r-help at r-project.org
Subject: Re: [R] Fitting linear models

Are they of the same length?
On Tue, Apr 21, 2009 at 11:31 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
Can we see your data to be able to replicate the error? Or maybe a
subset of data with some fake variable names?
On Tue, Apr 21, 2009 at 11:32 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
On Apr 21, 2009, at 11:12 AM, Vemuri, Aparna wrote:

            
That was from my error in reading your call to lm. In my defense I am  
reasonably sure the proper assignment to arguments is lm(formula= ...)  
rather than lm(function= ...).
Did you get the expected results with;
model1<-lm(formula=PBW~SO4+NO3+NH4+0)

You could, of course, provide either the data or the results of str()  
applied to each of the variables and then we could all stop guessing.
#
Attached are the first hundred rows of my data in comma separated format. 	
Forcing the regression line through the origin, still does not give a coefficient on the last independent variable. Also, I don't mind if there is a coefficient on the dependent axis. I just want all of the variables to have coefficients in the regression equation or a at least a consistent result, irrespective of the order of input information.

-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Tuesday, April 21, 2009 8:38 AM
To: Vemuri, Aparna
Cc: r-help at r-project.org
Subject: Re: [R] Fitting linear models
On Apr 21, 2009, at 11:12 AM, Vemuri, Aparna wrote:

            
That was from my error in reading your call to lm. In my defense I am  
reasonably sure the proper assignment to arguments is lm(formula= ...)  
rather than lm(function= ...).
Did you get the expected results with;
model1<-lm(formula=PBW~SO4+NO3+NH4+0)

You could, of course, provide either the data or the results of str()  
applied to each of the variables and then we could all stop guessing.
#
I am not sure what the problem is.
I found no errors:

data<-read.csv(file.choose())  # I had to change your file extension
to .csv first
dim(data)
names(data)

lapply(data,function(x){sum(is.na(x))})
lm.model.1<-lm(PBW~SO4+NO3+NH4,data)
lm.model.2<-lm(PBW~SO4+NH4+NO3,data)
print(lm.model.1)  # Getting nice results
print(lm.model.2) # Getting same results

# Another method (gets exactly the same results):
library(Design)
ols.model.1<-ols(PBW~SO4+NO3+NH4,data)
ols.model.2<-ols(PBW~SO4+NH4+NO3,data)

Dimitri
On Tue, Apr 21, 2009 at 11:50 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
On Apr 21, 2009, at 10:37 AM, David Winsemius wrote:

            
I am going to take a wild stab in the dark here and suggest that 'NH4'  
is exactly correlated to or even identical to one of the other IVs  
used in the formula.

  set.seed(1)
  PBW <- rnorm(100)
  SO4 <- rnorm(100)
  NO3 <- rnorm(100)
  NH4 <- rnorm(100)

 > lm(PBW ~ SO4 + NO3 + NH4)

Call:
lm(formula = PBW ~ SO4 + NO3 + NH4)

Coefficients:
(Intercept)          SO4          NO3          NH4
     0.11065     -0.00273      0.02096     -0.04826


Now watch:

NH4 <- NO3 * 1.5

 > lm(PBW ~ SO4 + NO3 + NH4)

Call:
lm(formula = PBW ~ SO4 + NO3 + NH4)

Coefficients:
(Intercept)          SO4          NO3          NH4
   1.084e-01   -7.871e-05    1.596e-02           NA


 > cor(cbind(SO4, NO3, NH4))
             SO4         NO3         NH4
SO4  1.00000000 -0.04953621 -0.04953621
NO3 -0.04953621  1.00000000  1.00000000
NH4 -0.04953621  1.00000000  1.00000000


I suspect that there is a collinearity problem here. Aparna, post back  
with the correlation matrix of your IV's (full data set) and that  
should either support or refute my theory. If supported and you use:

 > summary(lm(PBW ~ SO4 + NO3 + NH4))

Call:
lm(formula = PBW ~ SO4 + NO3 + NH4)

Residuals:
      Min       1Q   Median       3Q      Max
-2.30129 -0.60350  0.01765  0.58513  2.27806

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.084e-01  9.083e-02   1.194    0.236
SO4         -7.871e-05  9.531e-02  -0.001    0.999
NO3          1.596e-02  8.827e-02   0.181    0.857
NH4                 NA         NA      NA       NA

Residual standard error: 0.9073 on 97 degrees of freedom
Multiple R-squared: 0.0003379,	Adjusted R-squared: -0.02027
F-statistic: 0.01639 on 2 and 97 DF,  p-value: 0.9837


Note the warning message about singularities for NH4.

BTW, as an aside, picking variables for a model based upon their  
correlation with the DV is not a good way to go. You might want to  
pick up a copy of Frank's book "Regression Modeling Strategies":

   http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS

HTH,

Marc Schwartz
#
Thanks Dimitri! Following exactly what you did, I wrote all my individual variable vectors to a data frame and used lm(formula,data) and this time it works for me too. 

Marc, your theory is correct.NH4 variable shares a strong correlation with one of the IV along with the DV. 
	SO4 	NO3	NH4	PBW
SO4	1           -0.0867	0.999	0.999
NO3	-0.0867   1	-0.0527	-0.0938
NH4	0.999	-0.0527   1	0.999
PBW	0.999	-0.0938	 0.999	1


Aparna 

-----Original Message-----
From: Dimitri Liakhovitski [mailto:ld7631 at gmail.com] 
Sent: Tuesday, April 21, 2009 9:02 AM
To: Vemuri, Aparna
Cc: r-help at r-project.org; David Winsemius
Subject: Re: [R] Fitting linear models

I am not sure what the problem is.
I found no errors:

data<-read.csv(file.choose())  # I had to change your file extension
to .csv first
dim(data)
names(data)

lapply(data,function(x){sum(is.na(x))})
lm.model.1<-lm(PBW~SO4+NO3+NH4,data)
lm.model.2<-lm(PBW~SO4+NH4+NO3,data)
print(lm.model.1)  # Getting nice results
print(lm.model.2) # Getting same results

# Another method (gets exactly the same results):
library(Design)
ols.model.1<-ols(PBW~SO4+NO3+NH4,data)
ols.model.2<-ols(PBW~SO4+NH4+NO3,data)

Dimitri
On Tue, Apr 21, 2009 at 11:50 AM, Vemuri, Aparna <avemuri at epri.com> wrote:

  
    
#
But if the multicollinearity is so strong, then I am wondering why it
worked in the data frame as opposed to 4 seprate vectors? It should
not make any difference...
Dimitri
On Tue, Apr 21, 2009 at 12:21 PM, Vemuri, Aparna <avemuri at epri.com> wrote: