Request for functions to calculate correlated factors influencing an outcome.
This would be better posted on a statistical list like stats.stackexchange.com, as it is largely about statistical methodology, not R code. Once you have determined what kinds of methods you want, you might then post back here -- or better yet, just search! -- for packages that implement those methods in R. Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Mon, May 4, 2015 at 1:40 AM, Lalitha Viswanathan
<lalitha.viswanathan79 at gmail.com> wrote:
Hi I used the MASS library library(MASS) (by reading about examples at http://www.statmethods.net/stats/regression.html <http://s.bl-1.com/h/ofLlK27?url=http://www.statmethods.net/stats/regression.html> ) fit <- lm(Mileage~Disp+HP+Weight+Reliability,data=newx) step <- stepAIC(fit, direction="both") step$anova # display results It showed the most relevant variables affecting Mileage. While that is a start, I am looking for a model that fits the entire data (including Mileage), not factors that influence Mileage. Multi model inference / selection. I was reading about glmulti. Are there any other packages I could look at, for infering models that best fit the data. To use nlm / nls, I need a formula, as one of the parameters to best fit the data and I am looking for functions that will help infer that formula from the data. Thanks lalitha On Sun, May 3, 2015 at 11:33 PM, Prashant Sethi <theseth.prashant at gmail.com> wrote:
Hi, I'm not an expert in data analysis (a beginner still learning tricks of the trade) but I believe in your case since you're trying to determine the correlation of a dependent variable with a number of factor variables, you should try doing the regression analysis of your model. The function you'll use for that is the lm() function. You can use the forward building or the backward elimination method to build your model with the most critical factors included. Maybe you can give it a try. Thanks and regards, Prashant Sethi On 3 May 2015 23:18, "Lalitha Viswanathan" < lalitha.viswanathan79 at gmail.com> wrote:
Hi
I am sorry, I saved the file removing the dot after the Disp (as I was
going wrong on a read.delim which threw an error about !header, etc...The
dot was not the culprit, but I continued to leave it out.
Let me paste the full code here.
x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt",
header=TRUE,
sep="\t")
x<-data.frame(x)
for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country == i);
print(y); }
newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
HP))
cor(newx, method="pearson")
my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
Putting exact=NULL still doesn't remove the warning
my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
I tried to find the correlation coeff for a various combination of
variables, but am unable to interpet the results. (Results pasted below in
an earlier post)
Followed it up with a normality test
shapiro.test(newx$Disp)
shapiro.test(newx$HP)
Then decided to do a kruskal.test(newx)
with the result
Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16
Question is : I am trying to find factors influencing efficiency (in this
case mileage)
What are the range of functions / examples I should be looking at, to find
a factor or combination of factors influencing efficiency?
Any pointers will be helpful
Thanks
Lalitha
On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
lalitha.viswanathan79 at gmail.com> wrote:
Hi
I have a dataset of the type attached.
Here's my code thus far.
dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
newData<-subset(dataset, select = c(Price, Reliability, Mileage, Weight,
Disp, HP));
cor(newData, method="pearson");
Results are
Price Reliability Mileage Weight Disp
HP
Price 1.0000000 NA -0.6537541 0.7017999 0.4856769
0.6536433
Reliability NA 1 NA NA NA
NA
Mileage -0.6537541 NA 1.0000000 -0.8478541 -0.6931928
-0.6667146
Weight 0.7017999 NA -0.8478541 1.0000000 0.8032804
0.7629322
Disp 0.4856769 NA -0.6931928 0.8032804 1.0000000
0.8181881
HP 0.6536433 NA -0.6667146 0.7629322 0.8181881
1.0000000
It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP
and
Price are strongly correlated.
To find the statistical significance,
I am trying sample.correln<-cor.test(newData$Disp, newData$HP,
method="kendall", exact=NULL)
Kendall's rank correlation tau
data: newx$Disp and newx$HP
z = 7.2192, p-value = 5.229e-13
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.6563871
If I try the same with
sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
exact=NULL)
I get Warning message:
In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact =
NULL)
: Cannot compute exact p-value with ties
sample.correln
Spearman's rank correlation rho
data: newx$Disp and newx$HP
S = 5716.8, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8411566
I am not sure how to interpret these values.
Basically, I am trying to figure out which combination of factors
influences efficiency.
Thanks
Lalitha
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.