Skip to content

polr probit versus stata oprobit

7 messages · Thomas Lumley, Jean Eid

#
Dear All,
I have been struggling to understand why for the housing data in MASS
library R and stata give coef. estimates that are really different. I also
tried to come up with many many examples myself (see below, of course I
did not have the set.seed command included) and all of my
`random' examples seem to give verry similar output. For the housing data,
I have changed the data into numeric vectors instead of factors/ordered
factors. I did so to try and get the same results as in STATA and to have
the housing example as close as possible to the one I constructed.

I run a debian sid, kernel 2.4, R 2.0.0, and STATA version 8.2, MASS
version  7.2-8.


here's the example ( I assume that you have STATA installed and can run in
batch mode, if not the output is also given below)

library(MASS)
library(foreign)
set.seed(100)
X <- rnorm(1000)
X1 <- rnorm(1000)
X2 <- rnorm(1000)
X <- X +X1+X2
XX <- X<=quantile(X, .25)
XX[X>quantile(X, .25) & X<=quantile(X, .50)] <- 2
XX[X>quantile(X, .5) & X<=quantile(X, .75)] <- 3
XX[X>quantile(X, .75)] <- 4
temp <- data.frame(XX=XX, X1=X1, X2=X2, X=X)
summary(polr(factor(XX)~X1 +X2, data=temp, method="probit"))
write.dta(temp, "temp.dta")
####################################
#Stata stuff
####################################
cat("use temp.dta\n oprobit XX X1 X2\n", file="temp.ado")
system("stata -b do temp.ado&")
system("cat temp.log")


#
##### here's R's output
#############################
Re-fitting to get Hessian

Call:
polr(formula = factor(XX) ~ X1 + X2, data = temp, method = "probit")

Coefficients:
       Value Std. Error  t value
X1 0.9891735 0.04749225 20.82811
X2 0.9400804 0.04527653 20.76309

Intercepts:
    Value    Std. Error t value
1|2  -1.1411   0.0572   -19.9613
2|3  -0.0372   0.0486    -0.7656
3|4   1.1101   0.0579    19.1865

Residual Deviance: 1969.734
AIC: 1979.734


##############################################
#and here  Stata's output
##############################################

Ordered probit estimates                          Number of obs   =
1000
                                                  LR chi2(2)      =
802.86
                                                  Prob > chi2     =
0.0000
Log likelihood = -984.86675                       Pseudo R2       =
0.2896

------------------------------------------------------------------------------
          XX |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          X1 |   .9891651   .0474922    20.83   0.000     .8960822    1.082248
          X2 |     .94007   .0452765    20.76   0.000     .8513298     1.02881
-------------+----------------------------------------------------------------
       _cut1 |  -1.141119   .0571667          (Ancillary parameters)
       _cut2 |  -.0371779   .0485592
       _cut3 |   1.110117   .0578593
#
On Wed, 10 Nov 2004, Jean Eid wrote:

            
That example shows the same results with Stata and polr() from MASS.

For the housing data, I also get the same coefficients in Stata as with 
polr():

In R:
library(MASS)
library(foreign)
write.dta(housing, file="housing.dta")
house.probit<-polr(Sat ~ Infl + Type + Cont, data = housing, weights = 
Freq, method = "probit")
summary(house.probit)
-------------------------
Re-fitting to get Hessian

Call:
polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq,
     method = "probit")

Coefficients:
                    Value Std. Error   t value
InflMedium     0.3464233 0.06413706  5.401297
InflHigh       0.7829149 0.07642620 10.244063
TypeApartment -0.3475372 0.07229093 -4.807480
TypeAtrium    -0.2178874 0.09476607 -2.299213
TypeTerrace   -0.6641737 0.09180004 -7.235005
ContHigh       0.2223862 0.05812267  3.826153

Intercepts:
             Value   Std. Error t value
Low|Medium  -0.2998  0.0762    -3.9371
Medium|High  0.4267  0.0764     5.5850

Residual Deviance: 3479.689
AIC: 3495.689
------------------------


In Stata
-----------------
. use housing.dta
. xi: oprobit Sat i.Infl i.Type i.Cont [fw=Freq]
i.Infl            _IInfl_1-3          (naturally coded; _IInfl_1 omitted)
i.Type            _IType_1-4          (naturally coded; _IType_1 omitted)
i.Cont            _ICont_1-2          (naturally coded; _ICont_1 omitted)

Iteration 0:   log likelihood = -1824.4388
Iteration 1:   log likelihood = -1739.9254
Iteration 2:   log likelihood = -1739.8444

Ordered probit estimates                          Number of obs   =    1681
                                                   LR chi2(6)      =   169.19
                                                   Prob > chi2     =   0.0000
Log likelihood = -1739.8444                       Pseudo R2       =   0.0464

------------------------------------------------------------------------------
          Sat |      Coef.   Std. Err.      z    P>|z|     [95% Conf. 
Interval]
-------------+----------------------------------------------------------------
     _IInfl_2 |   .3464228    .064137     5.40   0.000     .2207165     .472129
     _IInfl_3 |   .7829146    .076426    10.24   0.000     .6331224    .9327069
     _IType_2 |  -.3475367   .0722908    -4.81   0.000    -.4892241   -.2058493
     _IType_3 |  -.2178875    .094766    -2.30   0.021    -.4036254   -.0321497
     _IType_4 |  -.6641735   .0917999    -7.24   0.000     -.844098    -.484249
     _ICont_2 |   .2223858   .0581226     3.83   0.000     .1084676     .336304
-------------+----------------------------------------------------------------
        _cut1 |  -.2998279   .0761537          (Ancillary parameters)
        _cut2 |   .4267208   .0764043
------------------------------------------------------------------------------



 	-thomas
#
Dear Thomas,

Where you also able to replicate the second example?  (the exaample
that I turned the housing data into numerical variables) That is the one
that my estimates differ.

Jean,
On Wed, 10 Nov 2004, Thomas Lumley wrote:

            
#
On Wed, 10 Nov 2004, Jean Eid wrote:

            
I don't have your second example, but I get the same results from
    polr(formula = Sat ~ as.numeric(Infl) + as.numeric(Type) +
      as.numeric(Cont), data = housing, weights = Freq, method = "probit")
and
    oprobit Sat Infl Type Cont [fw=Freq]
for example.

 	-thomas
Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
Thank you Thomas for your answer. It was the weights that are giving me
problems and I still have no idea why. i.e. when I try your example,
everything work fine. However when I do not include the weights=Freq and
[fw=Freq] in both softwares, I do get verry different results.


Jean,
On Thu, 11 Nov 2004, Thomas Lumley wrote:

            
#
On Thu, 11 Nov 2004, Jean Eid wrote:

            
I still don't understand what example you are using to find the 
difference.  I tried two ways of not using weights

1)  Expand the data to have a record for each observation (so 1681 rows 
instead of 72).
     Fitting these expanded data  without weights gives the same answers as 
fitting the compressed data with weights, in both MASS::polr and Stata's 
oprobit.


2) Pretend that the housing data have only 72 observations and ignore the 
weights (though why you would do this...)
    The true coefficients are all zero in this situation. R gives numbers 
zero to about 6 digits and Stata gives zero to about 30 digits.  The 
intercepts are the same in both packages.


 	-thomas
#
Now I understand,
Thank you,


Jean,
On Thu, 11 Nov 2004, Thomas Lumley wrote: