Dear All,
I have been struggling to understand why for the housing data in MASS
library R and stata give coef. estimates that are really different. I also
tried to come up with many many examples myself (see below, of course I
did not have the set.seed command included) and all of my
`random' examples seem to give verry similar output. For the housing data,
I have changed the data into numeric vectors instead of factors/ordered
factors. I did so to try and get the same results as in STATA and to have
the housing example as close as possible to the one I constructed.
I run a debian sid, kernel 2.4, R 2.0.0, and STATA version 8.2, MASS
version 7.2-8.
here's the example ( I assume that you have STATA installed and can run in
batch mode, if not the output is also given below)
library(MASS)
library(foreign)
set.seed(100)
X <- rnorm(1000)
X1 <- rnorm(1000)
X2 <- rnorm(1000)
X <- X +X1+X2
XX <- X<=quantile(X, .25)
XX[X>quantile(X, .25) & X<=quantile(X, .50)] <- 2
XX[X>quantile(X, .5) & X<=quantile(X, .75)] <- 3
XX[X>quantile(X, .75)] <- 4
temp <- data.frame(XX=XX, X1=X1, X2=X2, X=X)
summary(polr(factor(XX)~X1 +X2, data=temp, method="probit"))
write.dta(temp, "temp.dta")
####################################
#Stata stuff
####################################
cat("use temp.dta\n oprobit XX X1 X2\n", file="temp.ado")
system("stata -b do temp.ado&")
system("cat temp.log")
#
##### here's R's output
#############################
Re-fitting to get Hessian
Call:
polr(formula = factor(XX) ~ X1 + X2, data = temp, method = "probit")
Coefficients:
Value Std. Error t value
X1 0.9891735 0.04749225 20.82811
X2 0.9400804 0.04527653 20.76309
Intercepts:
Value Std. Error t value
1|2 -1.1411 0.0572 -19.9613
2|3 -0.0372 0.0486 -0.7656
3|4 1.1101 0.0579 19.1865
Residual Deviance: 1969.734
AIC: 1979.734
##############################################
#and here Stata's output
##############################################
Ordered probit estimates Number of obs =
1000
LR chi2(2) =
802.86
Prob > chi2 =
0.0000
Log likelihood = -984.86675 Pseudo R2 =
0.2896
------------------------------------------------------------------------------
XX | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
X1 | .9891651 .0474922 20.83 0.000 .8960822 1.082248
X2 | .94007 .0452765 20.76 0.000 .8513298 1.02881
-------------+----------------------------------------------------------------
_cut1 | -1.141119 .0571667 (Ancillary parameters)
_cut2 | -.0371779 .0485592
_cut3 | 1.110117 .0578593
polr probit versus stata oprobit
7 messages · Thomas Lumley, Jean Eid
On Wed, 10 Nov 2004, Jean Eid wrote:
Dear All, I have been struggling to understand why for the housing data in MASS library R and stata give coef. estimates that are really different. I also tried to come up with many many examples myself (see below, of course I did not have the set.seed command included) and all of my `random' examples seem to give verry similar output. For the housing data, I have changed the data into numeric vectors instead of factors/ordered factors. I did so to try and get the same results as in STATA and to have the housing example as close as possible to the one I constructed. I run a debian sid, kernel 2.4, R 2.0.0, and STATA version 8.2, MASS version 7.2-8. here's the example ( I assume that you have STATA installed and can run in batch mode, if not the output is also given below)
That example shows the same results with Stata and polr() from MASS.
For the housing data, I also get the same coefficients in Stata as with
polr():
In R:
library(MASS)
library(foreign)
write.dta(housing, file="housing.dta")
house.probit<-polr(Sat ~ Infl + Type + Cont, data = housing, weights =
Freq, method = "probit")
summary(house.probit)
-------------------------
Re-fitting to get Hessian
Call:
polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq,
method = "probit")
Coefficients:
Value Std. Error t value
InflMedium 0.3464233 0.06413706 5.401297
InflHigh 0.7829149 0.07642620 10.244063
TypeApartment -0.3475372 0.07229093 -4.807480
TypeAtrium -0.2178874 0.09476607 -2.299213
TypeTerrace -0.6641737 0.09180004 -7.235005
ContHigh 0.2223862 0.05812267 3.826153
Intercepts:
Value Std. Error t value
Low|Medium -0.2998 0.0762 -3.9371
Medium|High 0.4267 0.0764 5.5850
Residual Deviance: 3479.689
AIC: 3495.689
------------------------
In Stata
-----------------
. use housing.dta
. xi: oprobit Sat i.Infl i.Type i.Cont [fw=Freq]
i.Infl _IInfl_1-3 (naturally coded; _IInfl_1 omitted)
i.Type _IType_1-4 (naturally coded; _IType_1 omitted)
i.Cont _ICont_1-2 (naturally coded; _ICont_1 omitted)
Iteration 0: log likelihood = -1824.4388
Iteration 1: log likelihood = -1739.9254
Iteration 2: log likelihood = -1739.8444
Ordered probit estimates Number of obs = 1681
LR chi2(6) = 169.19
Prob > chi2 = 0.0000
Log likelihood = -1739.8444 Pseudo R2 = 0.0464
------------------------------------------------------------------------------
Sat | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
_IInfl_2 | .3464228 .064137 5.40 0.000 .2207165 .472129
_IInfl_3 | .7829146 .076426 10.24 0.000 .6331224 .9327069
_IType_2 | -.3475367 .0722908 -4.81 0.000 -.4892241 -.2058493
_IType_3 | -.2178875 .094766 -2.30 0.021 -.4036254 -.0321497
_IType_4 | -.6641735 .0917999 -7.24 0.000 -.844098 -.484249
_ICont_2 | .2223858 .0581226 3.83 0.000 .1084676 .336304
-------------+----------------------------------------------------------------
_cut1 | -.2998279 .0761537 (Ancillary parameters)
_cut2 | .4267208 .0764043
------------------------------------------------------------------------------
-thomas
Dear Thomas, Where you also able to replicate the second example? (the exaample that I turned the housing data into numerical variables) That is the one that my estimates differ. Jean,
On Wed, 10 Nov 2004, Thomas Lumley wrote:
On Wed, 10 Nov 2004, Jean Eid wrote:
Dear All, I have been struggling to understand why for the housing data in MASS library R and stata give coef. estimates that are really different. I also tried to come up with many many examples myself (see below, of course I did not have the set.seed command included) and all of my `random' examples seem to give verry similar output. For the housing data, I have changed the data into numeric vectors instead of factors/ordered factors. I did so to try and get the same results as in STATA and to have the housing example as close as possible to the one I constructed. I run a debian sid, kernel 2.4, R 2.0.0, and STATA version 8.2, MASS version 7.2-8. here's the example ( I assume that you have STATA installed and can run in batch mode, if not the output is also given below)
That example shows the same results with Stata and polr() from MASS.
For the housing data, I also get the same coefficients in Stata as with
polr():
In R:
library(MASS)
library(foreign)
write.dta(housing, file="housing.dta")
house.probit<-polr(Sat ~ Infl + Type + Cont, data = housing, weights =
Freq, method = "probit")
summary(house.probit)
-------------------------
Re-fitting to get Hessian
Call:
polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq,
method = "probit")
Coefficients:
Value Std. Error t value
InflMedium 0.3464233 0.06413706 5.401297
InflHigh 0.7829149 0.07642620 10.244063
TypeApartment -0.3475372 0.07229093 -4.807480
TypeAtrium -0.2178874 0.09476607 -2.299213
TypeTerrace -0.6641737 0.09180004 -7.235005
ContHigh 0.2223862 0.05812267 3.826153
Intercepts:
Value Std. Error t value
Low|Medium -0.2998 0.0762 -3.9371
Medium|High 0.4267 0.0764 5.5850
Residual Deviance: 3479.689
AIC: 3495.689
------------------------
In Stata
-----------------
. use housing.dta
. xi: oprobit Sat i.Infl i.Type i.Cont [fw=Freq]
i.Infl _IInfl_1-3 (naturally coded; _IInfl_1 omitted)
i.Type _IType_1-4 (naturally coded; _IType_1 omitted)
i.Cont _ICont_1-2 (naturally coded; _ICont_1 omitted)
Iteration 0: log likelihood = -1824.4388
Iteration 1: log likelihood = -1739.9254
Iteration 2: log likelihood = -1739.8444
Ordered probit estimates Number of obs = 1681
LR chi2(6) = 169.19
Prob > chi2 = 0.0000
Log likelihood = -1739.8444 Pseudo R2 = 0.0464
------------------------------------------------------------------------------
Sat | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
_IInfl_2 | .3464228 .064137 5.40 0.000 .2207165 .472129
_IInfl_3 | .7829146 .076426 10.24 0.000 .6331224 .9327069
_IType_2 | -.3475367 .0722908 -4.81 0.000 -.4892241 -.2058493
_IType_3 | -.2178875 .094766 -2.30 0.021 -.4036254 -.0321497
_IType_4 | -.6641735 .0917999 -7.24 0.000 -.844098 -.484249
_ICont_2 | .2223858 .0581226 3.83 0.000 .1084676 .336304
-------------+----------------------------------------------------------------
_cut1 | -.2998279 .0761537 (Ancillary parameters)
_cut2 | .4267208 .0764043
------------------------------------------------------------------------------
-thomas
On Wed, 10 Nov 2004, Jean Eid wrote:
Dear Thomas, Where you also able to replicate the second example? (the exaample that I turned the housing data into numerical variables) That is the one that my estimates differ.
I don't have your second example, but I get the same results from
polr(formula = Sat ~ as.numeric(Infl) + as.numeric(Type) +
as.numeric(Cont), data = housing, weights = Freq, method = "probit")
and
oprobit Sat Infl Type Cont [fw=Freq]
for example.
-thomas
On Wed, 10 Nov 2004, Thomas Lumley wrote:
On Wed, 10 Nov 2004, Jean Eid wrote:
Dear All, I have been struggling to understand why for the housing data in MASS library R and stata give coef. estimates that are really different. I also tried to come up with many many examples myself (see below, of course I did not have the set.seed command included) and all of my `random' examples seem to give verry similar output. For the housing data, I have changed the data into numeric vectors instead of factors/ordered factors. I did so to try and get the same results as in STATA and to have the housing example as close as possible to the one I constructed. I run a debian sid, kernel 2.4, R 2.0.0, and STATA version 8.2, MASS version 7.2-8. here's the example ( I assume that you have STATA installed and can run in batch mode, if not the output is also given below)
That example shows the same results with Stata and polr() from MASS.
For the housing data, I also get the same coefficients in Stata as with
polr():
In R:
library(MASS)
library(foreign)
write.dta(housing, file="housing.dta")
house.probit<-polr(Sat ~ Infl + Type + Cont, data = housing, weights =
Freq, method = "probit")
summary(house.probit)
-------------------------
Re-fitting to get Hessian
Call:
polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq,
method = "probit")
Coefficients:
Value Std. Error t value
InflMedium 0.3464233 0.06413706 5.401297
InflHigh 0.7829149 0.07642620 10.244063
TypeApartment -0.3475372 0.07229093 -4.807480
TypeAtrium -0.2178874 0.09476607 -2.299213
TypeTerrace -0.6641737 0.09180004 -7.235005
ContHigh 0.2223862 0.05812267 3.826153
Intercepts:
Value Std. Error t value
Low|Medium -0.2998 0.0762 -3.9371
Medium|High 0.4267 0.0764 5.5850
Residual Deviance: 3479.689
AIC: 3495.689
------------------------
In Stata
-----------------
. use housing.dta
. xi: oprobit Sat i.Infl i.Type i.Cont [fw=Freq]
i.Infl _IInfl_1-3 (naturally coded; _IInfl_1 omitted)
i.Type _IType_1-4 (naturally coded; _IType_1 omitted)
i.Cont _ICont_1-2 (naturally coded; _ICont_1 omitted)
Iteration 0: log likelihood = -1824.4388
Iteration 1: log likelihood = -1739.9254
Iteration 2: log likelihood = -1739.8444
Ordered probit estimates Number of obs = 1681
LR chi2(6) = 169.19
Prob > chi2 = 0.0000
Log likelihood = -1739.8444 Pseudo R2 = 0.0464
------------------------------------------------------------------------------
Sat | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
_IInfl_2 | .3464228 .064137 5.40 0.000 .2207165 .472129
_IInfl_3 | .7829146 .076426 10.24 0.000 .6331224 .9327069
_IType_2 | -.3475367 .0722908 -4.81 0.000 -.4892241 -.2058493
_IType_3 | -.2178875 .094766 -2.30 0.021 -.4036254 -.0321497
_IType_4 | -.6641735 .0917999 -7.24 0.000 -.844098 -.484249
_ICont_2 | .2223858 .0581226 3.83 0.000 .1084676 .336304
-------------+----------------------------------------------------------------
_cut1 | -.2998279 .0761537 (Ancillary parameters)
_cut2 | .4267208 .0764043
------------------------------------------------------------------------------
-thomas
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Thank you Thomas for your answer. It was the weights that are giving me problems and I still have no idea why. i.e. when I try your example, everything work fine. However when I do not include the weights=Freq and [fw=Freq] in both softwares, I do get verry different results. Jean,
On Thu, 11 Nov 2004, Thomas Lumley wrote:
On Wed, 10 Nov 2004, Jean Eid wrote:
Dear Thomas, Where you also able to replicate the second example? (the exaample that I turned the housing data into numerical variables) That is the one that my estimates differ.
I don't have your second example, but I get the same results from
polr(formula = Sat ~ as.numeric(Infl) + as.numeric(Type) +
as.numeric(Cont), data = housing, weights = Freq, method = "probit")
and
oprobit Sat Infl Type Cont [fw=Freq]
for example.
-thomas
On Wed, 10 Nov 2004, Thomas Lumley wrote:
On Wed, 10 Nov 2004, Jean Eid wrote:
Dear All, I have been struggling to understand why for the housing data in MASS library R and stata give coef. estimates that are really different. I also tried to come up with many many examples myself (see below, of course I did not have the set.seed command included) and all of my `random' examples seem to give verry similar output. For the housing data, I have changed the data into numeric vectors instead of factors/ordered factors. I did so to try and get the same results as in STATA and to have the housing example as close as possible to the one I constructed. I run a debian sid, kernel 2.4, R 2.0.0, and STATA version 8.2, MASS version 7.2-8. here's the example ( I assume that you have STATA installed and can run in batch mode, if not the output is also given below)
That example shows the same results with Stata and polr() from MASS.
For the housing data, I also get the same coefficients in Stata as with
polr():
In R:
library(MASS)
library(foreign)
write.dta(housing, file="housing.dta")
house.probit<-polr(Sat ~ Infl + Type + Cont, data = housing, weights =
Freq, method = "probit")
summary(house.probit)
-------------------------
Re-fitting to get Hessian
Call:
polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq,
method = "probit")
Coefficients:
Value Std. Error t value
InflMedium 0.3464233 0.06413706 5.401297
InflHigh 0.7829149 0.07642620 10.244063
TypeApartment -0.3475372 0.07229093 -4.807480
TypeAtrium -0.2178874 0.09476607 -2.299213
TypeTerrace -0.6641737 0.09180004 -7.235005
ContHigh 0.2223862 0.05812267 3.826153
Intercepts:
Value Std. Error t value
Low|Medium -0.2998 0.0762 -3.9371
Medium|High 0.4267 0.0764 5.5850
Residual Deviance: 3479.689
AIC: 3495.689
------------------------
In Stata
-----------------
. use housing.dta
. xi: oprobit Sat i.Infl i.Type i.Cont [fw=Freq]
i.Infl _IInfl_1-3 (naturally coded; _IInfl_1 omitted)
i.Type _IType_1-4 (naturally coded; _IType_1 omitted)
i.Cont _ICont_1-2 (naturally coded; _ICont_1 omitted)
Iteration 0: log likelihood = -1824.4388
Iteration 1: log likelihood = -1739.9254
Iteration 2: log likelihood = -1739.8444
Ordered probit estimates Number of obs = 1681
LR chi2(6) = 169.19
Prob > chi2 = 0.0000
Log likelihood = -1739.8444 Pseudo R2 = 0.0464
------------------------------------------------------------------------------
Sat | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
_IInfl_2 | .3464228 .064137 5.40 0.000 .2207165 .472129
_IInfl_3 | .7829146 .076426 10.24 0.000 .6331224 .9327069
_IType_2 | -.3475367 .0722908 -4.81 0.000 -.4892241 -.2058493
_IType_3 | -.2178875 .094766 -2.30 0.021 -.4036254 -.0321497
_IType_4 | -.6641735 .0917999 -7.24 0.000 -.844098 -.484249
_ICont_2 | .2223858 .0581226 3.83 0.000 .1084676 .336304
-------------+----------------------------------------------------------------
_cut1 | -.2998279 .0761537 (Ancillary parameters)
_cut2 | .4267208 .0764043
------------------------------------------------------------------------------
-thomas
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
On Thu, 11 Nov 2004, Jean Eid wrote:
Thank you Thomas for your answer. It was the weights that are giving me problems and I still have no idea why. i.e. when I try your example, everything work fine. However when I do not include the weights=Freq and [fw=Freq] in both softwares, I do get verry different results.
I still don't understand what example you are using to find the
difference. I tried two ways of not using weights
1) Expand the data to have a record for each observation (so 1681 rows
instead of 72).
Fitting these expanded data without weights gives the same answers as
fitting the compressed data with weights, in both MASS::polr and Stata's
oprobit.
2) Pretend that the housing data have only 72 observations and ignore the
weights (though why you would do this...)
The true coefficients are all zero in this situation. R gives numbers
zero to about 6 digits and Stata gives zero to about 30 digits. The
intercepts are the same in both packages.
-thomas
Now I understand,
R gives numbers zero to about 6 digits and Stata gives zero to about 30 digits. The intercepts are the same in both packages.
Thank you, Jean,
On Thu, 11 Nov 2004, Thomas Lumley wrote:
On Thu, 11 Nov 2004, Jean Eid wrote:
Thank you Thomas for your answer. It was the weights that are giving me problems and I still have no idea why. i.e. when I try your example, everything work fine. However when I do not include the weights=Freq and [fw=Freq] in both softwares, I do get verry different results.
I still don't understand what example you are using to find the
difference. I tried two ways of not using weights
1) Expand the data to have a record for each observation (so 1681 rows
instead of 72).
Fitting these expanded data without weights gives the same answers as
fitting the compressed data with weights, in both MASS::polr and Stata's
oprobit.
2) Pretend that the housing data have only 72 observations and ignore the
weights (though why you would do this...)
The true coefficients are all zero in this situation. R gives numbers
zero to about 6 digits and Stata gives zero to about 30 digits. The
intercepts are the same in both packages.
-thomas