Skip to content

maximum likelihood convergence reproducing Anderson Blundell 1982 Econometrica R vs Stata

20 messages · Ravi Varadhan, Alex Olssen, Martyn Byng +4 more

#
Dear R-help,

I am trying to reproduce some results presented in a paper by Anderson
and Blundell in 1982 in Econometrica using R.
The estimation I want to reproduce concerns maximum likelihood
estimation of a singular equation system.
I can estimate the static model successfully in Stata but for the
dynamic models I have difficulty getting convergence.
My R program which uses the same likelihood function as in Stata has
convergence properties even for the static case.

I have copied my R program and the data below.  I realise the code
could be made more elegant - but it is short enough.

Any ideas would be highly appreciated.

## model 18
lnl <- function(theta,y1, y2, x1, x2, x3) {
  n <- length(y1)
  beta <- theta[1:8]
  e1 <- y1 - theta[1] - theta[2]*x1 - theta[3]*x2 - theta[4]*x3
  e2 <- y2 - theta[5] - theta[6]*x1 - theta[7]*x2 - theta[8]*x3
  e <- cbind(e1, e2)
  sigma <- t(e)%*%e
  logl <- -1*n/2*(2*(1+log(2*pi)) + log(det(sigma)))
  return(-logl)
}
p <- optim(0*c(1:8), lnl, method="BFGS", hessian=TRUE, y1=y1, y2=y2,
x1=x1, x2=x2, x3=x3)

"year","y1","y2","x1","x2","x3"
1929,0.554779,0.266051,9.87415,8.60371,3.75673
1930,0.516336,0.297473,9.68621,8.50492,3.80692
1931,0.508201,0.324199,9.4701,8.27596,3.80437
1932,0.500482,0.33958,9.24692,7.99221,3.76251
1933,0.501695,0.276974,9.35356,7.98968,3.69071
1934,0.591426,0.287008,9.42084,8.0362,3.63564
1935,0.565047,0.244096,9.53972,8.15803,3.59285
1936,0.605954,0.239187,9.6914,8.32009,3.56678
1937,0.620161,0.218232,9.76817,8.42001,3.57381
1938,0.592091,0.243161,9.51295,8.19771,3.6024
1939,0.613115,0.217042,9.68047,8.30987,3.58147
1940,0.632455,0.215269,9.78417,8.49624,3.57744
1941,0.663139,0.184409,10.0606,8.69868,3.6095
1942,0.698179,0.164348,10.2892,8.84523,3.66664
1943,0.70459,0.146865,10.4731,8.93024,3.65388
1944,0.694067,0.161722,10.4465,8.96044,3.62434
1945,0.674668,0.197231,10.279,8.82522,3.61489
1946,0.635916,0.204232,10.1536,8.77547,3.67562
1947,0.642855,0.187224,10.2053,8.77481,3.82632
1948,0.641063,0.186566,10.2227,8.83821,3.96038
1949,0.646317,0.203646,10.1127,8.82364,4.0447
1950,0.645476,0.187497,10.2067,8.84161,4.08128
1951,0.63803,0.197361,10.2773,8.9401,4.10951
1952,0.634626,0.209992,10.283,9.01603,4.1693
1953,0.631144,0.219287,10.3217,9.06317,4.21727
1954,0.593088,0.235335,10.2101,9.05664,4.2567
1955,0.60736,0.227035,10.272,9.07566,4.29193
1956,0.607204,0.246631,10.2743,9.12407,4.32252
1957,0.586994,0.256784,10.2396,9.1588,4.37792
1958,0.548281,0.271022,10.1248,9.14025,4.42641
1959,0.553401,0.261815,10.2012,9.1598,4.4346
1960,0.552105,0.275137,10.1846,9.19297,4.43173
1961,0.544133,0.280783,10.1479,9.19533,4.44407
1962,0.55382,0.281286,10.197,9.21544,4.45074
1963,0.549951,0.28303,10.2036,9.22841,4.46403
1964,0.547204,0.291287,10.2271,9.23954,4.48447
1965,0.55511,0.281313,10.2882,9.26531,4.52057
1966,0.558182,0.280151,10.353,9.31675,4.58156
1967,0.545735,0.294385,10.3351,9.35382,4.65983
1968,0.538964,0.294593,10.3525,9.38361,4.71804
1969,0.542764,0.299927,10.3676,9.40725,4.76329
1970,0.534595,0.315319,10.2968,9.39139,4.81136
1971,0.545591,0.315828,10.2592,9.34121,4.84082
#
On May 6, 2011, at 14:29 , Alex Olssen wrote:

            
Better starting values would help. In this case, almost too good values are available:

start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))

which appears to be the _exact_ solution. 

Apart from that, it seems that the conjugate gradient methods have difficulties with this likelihood, for some less than obvious reason. Increasing the maxit gets you closer but still not satisfactory. 

I would suggest trying out the experimental optimx package. Apparently, some of the algorithms in there are much better at handling this likelihood, notably "nlm" and "nlminb".

  
    
#
There is something strange in this problem.  I think the log-likelihood is incorrect.  See the results below from "optimx".  You can get much larger log-likelihood values than for the exact solution that Peter provided.

## model 18
lnl <- function(theta,y1, y2, x1, x2, x3) {
  n <- length(y1)
  beta <- theta[1:8]
  e1 <- y1 - theta[1] - theta[2]*x1 - theta[3]*x2 - theta[4]*x3
  e2 <- y2 - theta[5] - theta[6]*x1 - theta[7]*x2 - theta[8]*x3
  e <- cbind(e1, e2)
  sigma <- t(e)%*%e
  logl <- -1*n/2*(2*(1+log(2*pi)) + log(det(sigma)))  # it looks like there is something wrong here
  return(-logl)
}

data <- read.table("e:/computing/optimx_example.dat", header=TRUE, sep=",")

attach(data)

require(optimx)

start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))

# the warnings can be safely ignored in the "optimx" calls
p1 <- optimx(start, lnl, hessian=TRUE, y1=y1, y2=y2,
+ x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

p2 <- optimx(rep(0,8), lnl, hessian=TRUE, y1=y1, y2=y2,
+ x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

p3 <- optimx(rep(0.5,8), lnl, hessian=TRUE, y1=y1, y2=y2,
+ x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

Ravi.
#
On May 7, 2011, at 17:51 , Ravi Varadhan wrote:

            
Hm? I get stuff like below (for p3). Some of the entries have a considerably larger NEGATIVE log likelihood, but that's hardly a problem with the likelihood per se.  

         fvalues      method   fns  grs itns conv  KKT1  KKT2 xtimes
12 8.988466e+307      newuoa    NA   NA   NA 9999    NA    NA  0.002
11 8.988466e+307      bobyqa    NA   NA   NA 9999    NA    NA  0.001
3       23.66768 Nelder-Mead  1501   NA NULL    1 FALSE FALSE   0.18
7      -51.76068         spg  1925   NA 1501    1 FALSE FALSE  2.322
4      -55.78708    L-BFGS-B  2093 2093 NULL    0 FALSE FALSE  4.176
2      -70.57023          CG  5360 1501 NULL    1 FALSE  TRUE  3.465
1      -70.66286        BFGS 21481 1500 NULL    1 FALSE  TRUE  5.383
8      -76.73765      ucminf  1500 1500 NULL    0 FALSE  TRUE  0.067
9      -76.73871      Rcgmin  2434  867 NULL    0 FALSE  TRUE  1.514
10     -76.73877      Rvmmin   231   45 NULL    0 FALSE  TRUE  0.101
6      -76.73878      nlminb   130  581   67    0  TRUE  TRUE  0.085
5      -76.73878         nlm    NA   NA   46    0  TRUE  TRUE  0.058

I must admit that I didn't check the likelihood in detail, but minimizing the determinant of the residual SSD matrix is generally what you end up doing in this sort of models. (Of course, you can safely lose the constants, and you can also think up more stable ways of computing the log-determinant, but it's only 2x2 for cryin' out loud.)
1 day later
#
Thank you all for your input.

Unfortunately my problem is not yet resolved.  Before I respond to
individual comments I make a clarification:

In Stata, using the same likelihood function as above, I can reproduce
EXACTLY (to 3 decimal places or more, which is exactly considering I
am using different software) the results from model 8 of the paper.

I take this as an indication that I am using the same likelihood
function as the authors, and that it does indeed work.
The reason I am trying to estimate the model in R is because while
Stata reproduces model 8 perfectly it has convergence
difficulties for some of the other models.

Peter Dalgaard,

"Better starting values would help. In this case, almost too good
values are available:

start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))

which appears to be the _exact_ solution."

Thanks for the suggestion.  Using these starting values produces the
exact estimate that Dave Fournier emailed me.
If these are the exact solution then why did the author publish
different answers which are completely reproducible in
Stata and Tsp?

Ravi,

Thanks for introducing optimx to me, I am new to R.  I completely
agree that you can get higher log-likelihood values
than what those obtained with optim and the starting values suggested
by Peter.  In fact, in Stata, when I reproduce
the results of model 8 to more than 3 dp I get a log-likelihood of 54.039139.

Furthermore if I estimate model 8 without symmetry imposed on the
system I reproduce the Likelihood Ratio reported
in the paper to 3 decimal places as well, suggesting that the
log-likelihoods I am reporting differ from those in the paper
only due to a constant.

Thanks for your comments,

I am still highly interested in knowing why the results of the
optimisation in R are so different to those in Stata?

I might try making my convergence requirements more stringent.

Kind regards,

Alex
#
On May 9, 2011, at 06:07 , Alex Olssen wrote:

            
Ahem! You might get us interested in your problem, but not to the level that we are going to install Stata and Tsp and actually dig out and study the scientific paper you are talking about. Please cite the results and explain the differences.

Are we maximizing over the same parameter space? You say that the estimates from the paper gives a log-likelihood of 54.04, but the exact solution clocked in at 76.74, which in my book is rather larger.

Confused....   

-p

  
    
#
I wonder if someone with more experience than me on using R to summarise
by group wants to post a reply to this 

http://www.analyticbridge.com/group/sasandstatisticalprogramming/forum/t
opics/why-still-use-sas-with-a-lot

To save everyone having to follow the link, the text is copied below

"SAS has some nice features, such as the SQL procedure or simple "group
by" features. Try to compute correlations "by group" in R: say you have
2,000 groups, 2 variables e.g. salary and education level, and 2 million
observations - you want to compute correlation between salary and
education within each group.

It is not obvious, your best bet is to use some R package (see sample
code on Analyticbridge to do it), and the solution is painful, you can
not return both correlation and stdev "by group", as the function can
return only one argument, not a vector. So if you want to return not
just two, but say 100 metrics, it becomes a nightmare."



________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. Th...{{dropped:4}}
#
On May 9, 2011, at 11:12 , Martyn Byng wrote:

            
Not really (but how did those people manage to overlook by()?). However, another matter

PLEASE DO NOT BE SO B***DY LAZY AND START A NEW TOPIC AS A REPLY TO SOMETHING COMPLETELY UNRELATED. 

Those of us with threading mail clients, will see your post as part of the thread 

"maximum likelihood convergence reproducing Anderson Blundell 1982 Econometrica R vs Stata"

and, at least with Mac OSX's Mail, the header of the entire thread is now "Summarising by group".

Peter D.

  
    
#
Peter said

"Ahem! You might get us interested in your problem, but not to the
level that we are going to install Stata and Tsp and actually dig out
and study the scientific paper you are talking about. Please cite the
results and explain the differences."

Apologies Peter, will do,

The results which I can emulate in Stata but not (yet) in R are reported below.
They come from Econometrica Vol. 50, No. 6 (Nov., 1982), pp. 1569

TABLE II - model 18s

         coef     std err
p10     -0.19     0.078
p11     0.220    0.019
p12     -0.148   0.021
p13     -0.072
p20     0.893    0.072
p21     -0.148
p22     0.050    0.035
p23     0.098

The results which I produced in Stata are reported below.
I spent the last hour rewriting the code to reproduce this - since I
am now at home and not at work :(
My results are "identical" to those published.  The estimates are for
a 3 equation symmetrical singular system.
I have not bothered to report symmetrical results and have backed out
an extra estimate using adding up constraints.
I have also backed out all standard errors using the delta method.

. ereturn display
------------------------------------------------------------------------------
            |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
a            |
         a1 |  -.0188115   .0767759    -0.25   0.806    -.1692895    .1316664
         a2 |   .8926598   .0704068    12.68   0.000     .7546651    1.030655
         a3 |   .1261517   .0590193     2.14   0.033      .010476    .2418275
-------------+----------------------------------------------------------------
g            |
        g11 |   .2199442   .0184075    11.95   0.000      .183866    .2560223
        g12 |  -.1476856   .0211982    -6.97   0.000    -.1892334   -.1061378
        g13 |  -.0722586   .0145154    -4.98   0.000    -.1007082   -.0438089
        g22 |   .0496865   .0348052     1.43   0.153    -.0185305    .1179034
        g23 |   .0979991   .0174397     5.62   0.000     .0638179    .1321803
        g33 |  -.0257405   .0113869    -2.26   0.024    -.0480584   -.0034226
------------------------------------------------------------------------------

In R I cannot get results like this - I think it is probably to do
with my inability at using the optimisers well.
Any pointers would be appreciated.

Peter said "Are we maximizing over the same parameter space? You say
that the estimates from the paper gives a log-likelihood of 54.04, but
the exact solution clocked in at 76.74, which in my book is rather
larger."

I meant +54.04 > -76.74.  It is quite common to get positive
log-likelihoods in these system estimation.

Kind regards,

Alex
On 9 May 2011 19:04, peter dalgaard <pdalgd at gmail.com> wrote:
#
----------------------------------------
did you actually cut/paste code anywhere and is your first coefficient -.19 or -.019?
Presumably typos would be one possible problem.
is this it, page 1559?

http://www.jstor.org/pss/1913396

generally it helps if we could at least see the equations to check your
code against typos ( note page number ?) in lnl that may fix part of the
mystery.? Is full text available
on author's site, doesn't come up on citeseer AFAICT,


http://citeseerx.ist.psu.edu/search?q=blundell+1982&sort=ascdate

I guess one question would be " what is beta" in lnl supposed to be -
it isn't used anywhere but I will also mentioned I'm not that familiar
with the R code ( I'm trying to work through this to learn R and the optimizers). 

maybe some words would help, is sigma supposed to be 2x2 or 8x8 and what are
e1 and e2 supposed to be?
#
Hi Mike,

Mike said
"is this it, page 1559?"

That is the front page yes, page 15*6*9 has the table, of which the
model labelled 18s is the one I replicated.

"did you actually cut/paste code anywhere and is your first
coefficient -.19 or -.019?
Presumably typos would be one possible problem."

-0.19 is not a typo, it is pi10 in the paper, and a1 in my Stata
estimation - as far as I can tell cutting and pasting is not the
problem.

"generally it helps if we could at least see the equations to check your
code against typos ( note page number ?) in lnl that may fix part of the
mystery.  Is full text available
on author's site, doesn't come up on citeseer AFAICT,"

Unfortunately I do not know of any place to get the full version of
this paper that doesn't require access to a database such as JSTOR.
The fact that this likelihood function reproduces the published
results in Stata makes me confident that it is correct - I also have
read a lot on systems estimation and it is a pretty standard
likelihood function.

"I guess one question would be " what is beta" in lnl supposed to be -
it isn't used anywhere but I will also mentioned I'm not that familiar
with the R code ( I'm trying to work through this to learn R and the
optimizers).

maybe some words would help, is sigma supposed to be 2x2 or 8x8 and what are
e1 and e2 supposed to be?"

The code above is certainly not as elegant as it could be - in this
case the beta is rather superfluous.  It is just a vector to hold
parameters - but since I called all the parameters theta anyway there
is no need for it.  e1 and e2 are the residuals from the first and
second equations of the system.  Sigma is a 2x2 matrix which is the
outer product of the two vectors of residuals.

Kind regards,

Alex
On 9 May 2011 23:12, Mike Marchywka <marchywka at hotmail.com> wrote:
#
On May 9, 2011, at 13:40 , Alex Olssen wrote:

            
However, the R code you posted will at best replicate model 18. For 18s, you need to impose symmetry conditions.

The fit of model 18 does seem to give similar results to those in Table II, but not exactly. 

Even more puzzling, if you start from the published coefficients, nlm seems to be getting stuck at a non-concave point of the negative likelihood.
+ x1=x1, x2=x2, x3=x3, gradtol=1e-15)
$minimum
[1] -76.60681

$estimate
[1] -0.64701504  0.27284992 -0.13713084 -0.07205753  1.28200342 -0.24596766  0.13502787  0.06001181

$gradient
[1]   2.782425  12.648629  -1.830244 -29.296795  -2.577734  -4.954042   1.495517  11.016190

$hessian
          [,1]     [,2]     [,3]      [,4]      [,5]     [,6]     [,7]     [,8]
[1,]  197435.9  1977197  1741724  805132.7  135821.9  1358013  1196714   553746
[2,] 1977197.4 19712076 17383474 8064540.2 1359856.0 13527720 11935305  5544903
[3,] 1741724.0 17383474 15336008 7121047.7 1197868.5 11929667 10529439  4896008
[4,]  805132.7  8064540  7121048 3315377.8  553733.1  5534923  4889523  2279404
[5,]  135821.9  1359856  1197868  553733.1  271511.7  2713733  2391332  1106572
[6,] 1358012.7 13527720 11929667 5534923.0 2713732.9 26955636 23784039 11054174
[7,] 1196714.3 11935305 10529439 4889522.6 2391331.9 23784039 20993413  9765022
[8,]  553746.0  5544903  4896008 2279404.1 1106571.9 11054174  9765022  4552693

$code
[1] 2

$iterations
[1] 4
#
Hmm, I tried replacing the x's in the model with their principal component scores, and suddenly everything converges as a greased lightning:
+ x1=Z$a, x2=Z$b, x3=Z$c, method="BFGS")
                                                                                                par
1 0.59107682, 0.01043568, -0.20639153, 0.25902746, 0.24675162, -0.01426477, 0.18045177, -0.23657327
    fvalues method fns grs itns conv KKT1 KKT2 xtimes
1 -76.73878   BFGS 157  37 NULL    0 TRUE TRUE  0.055
Warning messages:
1: In max(logpar) : no non-missing arguments to max; returning -Inf
2: In min(logpar) : no non-missing arguments to min; returning Inf

The likelihood appears to be spot on, but, obviously, the parameter estimates need to be back-transformed to the original x1,x2,x3 system. I'll leave that as the proverbial "exercise for the reader"...

However, the whole thing leads me to a suspicion that maybe our numerical gradients are misbehaving in cases with high local collinearity?

-pd
On May 9, 2011, at 13:40 , Alex Olssen wrote:

            

  
    
#
Trolling? (but see in line below)
On Mon, May 9, 2011 at 5:12 AM, Martyn Byng <Martyn.Byng at nag.co.uk> wrote:
The wealth of R packages is a core strength of the platform. It is not
a disadvantage to have a wealth of well-developed code for almost any
statistical application.

(see sample
Well, the Analyticbridge code is painful, but that example says far
more about the person who wrote it that it does about R. All they
really needed to do was

v <- ddply(xx, .(country), summarize,
           COR = cor(income, age),
           MEAN_age = mean(age),
           MEAN_income = mean(income),
           MAX_income = max(income),
           STDEV_income = sd(income))

I'm not intested in signing up so I can post a reply to the orignial
post, but feel free to copy my answer there if you want.

Best,
Ista

you can
Wrong, see example above.

Best,
Ista

  
    
#
Wow that is really interesting,

Sorry I was asleep when you emailed these.

And yes, of course, I had been trying to implement model 18, not 18s,
that was a typo, sorry.

I will have a look at the code you posted.

Thanks,

Alex
On 10 May 2011 02:18, peter dalgaard <pdalgd at gmail.com> wrote:
#
Alex Olssen wrote:
I have run nlm with a zero starting vector on model 18.
+                  y1=y1, y2=y2, x1=x1, x2=x2, x3=x3)
$minimum
[1] -76.73878

$estimate
[1] -0.68124025  0.28686058 -0.15394903 -0.06173779  1.30304998 -0.25446946
[7]  0.14531856  0.05349810

$gradient
[1]  0.0001356761  0.0014407493  0.0013240913  0.0006367333 -0.0005172628
[6] -0.0051911383 -0.0044429086 -0.0019081432

$hessian
          [,1]     [,2]     [,3]      [,4]      [,5]     [,6]     [,7]
[1,]  197993.1  1982843  1746582  807261.9  135628.5  1355944  1194854
[2,] 1982842.8 19766730 17430209 8084282.7 1357883.8 13506353 11915748
[3,] 1746581.8 17430209 15376070 7138194.2 1196134.9 11911149 10512490
[4,]  807261.9  8084283  7138194 3323204.7  552932.9  5526861  4882154
[5,]  135628.5  1357884  1196135  552932.9  271571.7  2713953  2391658
[6,] 1355943.5 13506353 11911149 5526860.9 2713953.0 26956003 23785984
[7,] 1194854.2 11915748 10512490 4882154.2 2391658.1 23785984 20996480
[8,]  552860.7  5534816  4887292 2275676.3 1106862.5 11057031  9768028
           [,8]
[1,]   552860.7
[2,]  5534816.0
[3,]  4887292.4
[4,]  2275676.3
[5,]  1106862.5
[6,] 11057030.7
[7,]  9768027.6
[8,]  4554570.6

$code
[1] 2

$iterations
[1] 68
[1] -0.68124025  0.28686058 -0.15394903 -0.06173779  1.30304998 -0.25446946
[7]  0.14531856  0.05349810

Quite different from what you get if one starts with AB's model 18 results
of Table II.
I have also used Gretl's FIML procedure to estimate the model (I don't know
what starting values Gretl uses).
The results are very similar to those obtained with nlm as above.

Berend

--
View this message in context: http://r.789695.n4.nabble.com/maximum-likelihood-convergence-reproducing-Anderson-Blundell-1982-Econometrica-R-vs-Stata-tp3502516p3512807.html
Sent from the R help mailing list archive at Nabble.com.
1 day later
#
So what was the final verdict on this discussion? I kind of 
lost track if anyone has a minute to summarize and critique my summary below.


Apparently there were two issues, the comparison between R and Stata
was one issue and the "optimum" solution another. As I understand it,
there was some question about R numerical gradient calculation. This would
suggest some features of the function may be of interest to consider. 

The function to be optimized appears to be, as OP stated, 
some function of residuals of two ( unrelated ) fits. 
The residual vectors e1 and e2 are dotted in various combinations
creating a matrix whose determinant is (e1.e1)(e2.e2)-(e1.e2)^2 which
is the result to be minimized by choice of theta. Theta it seems is
an 8 component vector, 4 components determine e1 and the other 4 e2.
Presumably a unique solution would require that e1 and e2, both n-component vectors,
?point in different directions or else both could become aribtarily large
while keeping the error signal at zero. For fixed magnitudes, colinearity
would reduce the "Error."? The intent would appear to be to 
keep the residuals distributed similarly in the two ( unrelated) fits. 
?I guess my question is,
" did anyone determine that there is a unique solution?" or
am I totally wrong here ( I haven't used these myself to any
extent and just try to run some simple teaching examples, asking
for my own clarification as much as anything).

Thanks.










----------------------------------------
9 days later
#
Hi,

I don't think the final verdict has been spoken.  Peter's posts have hinted at ill-conditioning as the crux of the problem.  So, I decided to try a couple of  more things: (1) standardizing the covariates, (2) exact gradient, and (3) both (1) and (2).

I compute the "exact" gradient using a complex-step derivative approach.  This works just like the standard first-order, forward differencing.  The only (but, essential) difference is that an imaginary increment, i*dx, is used.  This, incredibly, gives exact gradients (up to machine precision). 

Here are the code and the results of my experiments:

data <- read.table("h:/computations/optimx_example.dat", header=TRUE, sep=",")
attach(data)
require(optimx)

## model 18
lnl <- function(theta,y1, y2, x1, x2, x3) {
  n <- length(y1)
  beta <- theta[1:8]
  e1 <- y1 - theta[1] - theta[2]*x1 - theta[3]*x2 - theta[4]*x3
  e2 <- y2 - theta[5] - theta[6]*x1 - theta[7]*x2 - theta[8]*x3
  e <- cbind(e1, e2)
  sigma <- t(e)%*%e
  det.sigma <- sigma[1,1] * sigma[2,2] - sigma[1,2] * sigma[2,1]
  logl <- -1*n/2*(2*(1+log(2*pi)) + log(det.sigma)) 
  return(-logl)
}

csd <- function(fn, x, ...) {
# Complex step derivatives; yields exact derivatives
h <- .Machine$double.eps 
n <- length(x)
h0 <- g <- rep(0, n)
for (i in 1:n) {
h0[i] <- h * 1i
g[i] <- Im(fn(x+h0, ...))/h 
h0[i]  <- 0
}
g
}

gr.csd <- function(theta,y1, y2, x1, x2, x3) {
csd(lnl, theta, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3)
}

# exact solution as the starting value
start <- c(coef(lm(y1~x1+x2+x3)), coef(lm(y2~x1+x2+x3)))
p1 <- optimx(start, lnl, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

# numerical gradient
p2 <- optimx(rep(0,8), lnl, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

# exact gradient
p2g <- optimx(rep(0,8), lnl, gr.csd, y1=y1, y2=y2, x1=x1, x2=x2, x3=x3, control=list(all.methods=TRUE, maxit=1500))

# comparing p2 and p2g, we see the dramatic improvement in BFGS when exact gradient is used, we also see a major difference for L-BFGS-B
# Exact gradient did not affect the gradient methods, CG and spg, much.  However, convergence of Rcgmin improved when exact gradient was used

x1s <- scale(x1)
x2s <- scale(x2)
x3s <- scale(x3)
                                                                                              
p3 <- optimx(rep(0,8),lnl, y1=y1, y2=y2, x1=x1s, x2=x2s, x3=x3s, control=list(all.methods=TRUE, maxit=1500))

# both scaling and exact gradient
p3g <- optimx(rep(0,8),lnl, gr.csd, y1=y1, y2=y2, x1=x1s, x2=x2s, x3=x3s, control=list(all.methods=TRUE, maxit=1500))

# Comparing p3 and p3g, use of exact gradient improved spg dramatically.  However, it made CG worse! 

# Of course, derivative-free methods newuoa, and Nelder-Mead are improved by scaling, but not by the availability of exact gradients.  I don't know what is wrong with bobyqa in this example.

In short, even with scaling and exact gradients, this optimization problem is recalcitrant.

Best,
Ravi.
#
----------------------------------------
Cool, I thought everyone lost interest. I'll get back to this then.
Before launching into this, I was curious however if the STATA
solution ( or whatever other proudct was used) was thought
to reprsent a good solution or the actual global optimum. 

IIRC, your earlier post along these lines,
was what got me started. 


I guess my point was that the problem would not obviously
have a nice surface to optimize and IIRC the title of the paper
suggested the system was a bit difficult ( I won't pay for med papers,
not going to pay for econ LOL). The function to be minimized, and this
is stated as fact but only for sake of eliciting criticism, 
is the determinant of 2 unrelated residuals vectors. And, from memory,
this gives something like E=|e1|^2*|e2|^2*(1-Cos^2(d))? with
d being angle between residual vectors ( in space with dimension of number
of data points). Or, E=F*Sin^2(d) and depending on data is would seem
possible to move in such a way that F increases but just more slowly than
Sin^2(d). Any solution for which they are colinear would seem to be optimal.

I guess if my starting point is not too far off I may see if
I can find some diagnostics to determine is the data set creates
a condition as I have outlined and optimizer.

It might, for example, be interesting to see what happens to |e1||e2|
and Sin(d) at various points.
( T1 is just theta components 1-4 and T2 is 5-8, I guess presuming "x0"=1 LOL), 
E1=Y1-T1*X , E2=Y2-T2*x
E1.E2=Y1.Y2-Y1.(T2*X)-(T1*X).Y2+(T1*X).(T2*X)

I guess I keep working on the above and see if it points
to anything pathological or that doesn't play nice with
optimizer.
#
----------------------------------------
I was too lazy to continue the above on paper but empirically this seems to
be part of symptoms. For example, 

e1m=sum(e1*e1);
e2m=sum(e2*e2);
e12=sum(e1*e2);
phit=e12/sqrt(e1m*e2m);
dete=det(sigma);
xxx=paste("e1m*e2m=",e1m*e2m," aphi=",phit, " dete=",dete,sep="");
print(xxx);

on this call, 

print("doing BGFS from zero");
p <- optim(0*c(1:8), lnl, method="BFGS", hessian=TRUE, y1=data$y1, y2=data$y2, x
1=data$x1, x2=data$x2, x3=data$x3)


shows that the the product of e1 and e2 explodes with the error vectors become parallel,
( since I got an answer I like, I haven't bothered to check for typos
or coding errors LOL note in particular things like sqrt() etc )

[1] "e1m*e2m=41.1669479546532 aphi=0.960776458479452 dete=3.16609220303528"
[1] "e1m*e2m=41.0289397617336 aphi=0.960734505863182 dete=3.15878562836847"
[1] "e1m*e2m=41.3051898313859 aphi=0.960818247708638 dete=3.17340730403131"
[1] "e1m*e2m=39.7867410499191 aphi=0.960397122114274 dete=3.08893784983331"
[1] "e1m*e2m=42.5708634641545 aphi=0.961141202950479 dete=3.24422282329344"
[1] "e1m*e2m=39.952815544519 aphi=0.960383767149008 dete=3.10285630456706"
[1] "e1m*e2m=42.3994556719457 aphi=0.961155360764465 dete=3.23000632577225"
[1] "e1m*e2m=40.6077431942131 aphi=0.960502159746792 dete=3.14448500444145"
[1] "e1m*e2m=41.7300849780045 aphi=0.96104579811445 dete=3.18780183350788"
[1] "e1m*e2m=40.8461025929321 aphi=0.960565688832973 dete=3.15795749888956"
[1] "e1m*e2m=41.4890962339491 aphi=0.960985035909235 dete=3.17423784875441"
[1] "e1m*e2m=38.0026479775113 aphi=0.958424486959574 dete=3.09427071121483"
[1] "e1m*e2m=44.4634366247632 aphi=0.962889636822005 dete=3.23887444893151"
[1] "e1m*e2m=38.3679152097842 aphi=0.958864175571152 dete=3.09166714763783"
[1] "e1m*e2m=44.0684332841423 aphi=0.962518076859826 dete=3.24162775623887"
[1] "e1m*e2m=39.8519719942113 aphi=0.960141707858905 dete=3.11355091583515"
[1] "e1m*e2m=42.5038484736989 aphi=0.961384528172175 dete=3.21923251471042"
[1] "e1m*e2m=21375239704009490432 aphi=0.999977697068197 dete=953450394271031"
[1] "e1m*e2m=34177731168061536 aphi=0.999977905000934 dete=1510297191323.93"
[1] "e1m*e2m=54503403593545 aphi=0.999978923036089 dete=2297508328.68597"
[1] "e1m*e2m=85767789487.156 aphi=0.999983466344755 dete=2836086.67942723"
[1] "e1m*e2m=126124258.261637 aphi=0.999991704142303 dete=2092.60911721655"
[1] "e1m*e2m=127695.808241124 aphi=0.999539670386456 dete=117.537264947809"
[1] "e1m*e2m=1.79314748193414 aphi=0.458869014722204 dete=1.41558096262301"
[1] "e1m*e2m=1.82076399719734 aphi=0.469891415967073 dete=1.41874305229269"
[1] "e1m*e2m=1.76630916851626 aphi=0.447604317206231 dete=1.41242978935563"
[1] "e1m*e2m=2.10765446888618 aphi=0.559484707051066 dete=1.44790985442968"
[1] "e1m*e2m=1.55759367420566 aphi=0.33387034624138 dete=1.38396962928268"
[1] "e1m*e2m=2.07345917255291 aphi=0.547539248673881 dete=1.45183771159372"
[1] "e1m*e2m=1.57402829646691 aphi=0.351106401124671 dete=1.37998884867053"
[1] "e1m*e2m=1.92570387464855 aphi=0.500413807670452 dete=1.44348070520072"
[1] "e1m*e2m=1.67368610942138 aphi=0.413157386356889 dete=1.38798952087868"
[1] "e1m*e2m=1.80091875441907 aphi=0.458998972457491 dete=1.42150108909529"
[1] "e1m*e2m=1.78539325145298 aphi=0.458738462509367 dete=1.40967335131397"
[1] "e1m*e2m=1.87239816244392 aphi=0.46015903715983 dete=1.4759247054976"
[1] "e1m*e2m=1.71562581348667 aphi=0.457519283487653 dete=1.3565043362516"
[1] "e1m*e2m=1.86276444029005 aphi=0.460322529304816 dete=1.46805055851996"
[1] "e1m*e2m=1.72487059095196 aphi=0.457356617377108 dete=1.36407065493321"
[1] "e1m*e2m=1.82498046816382 aphi=0.460003447183359 dete=1.43880881331975"
[1] "e1m*e2m=1.76160126627035 aphi=0.457713483547811 dete=1.39254292425402"
[1] "e1m*e2m=1351553015129414 aphi=-0.999994920076885 dete=13731535927.7031"
[1] "e1m*e2m=2152851612889.67 aphi=-0.999994409734947 dete=24069954.9936224"
[1] "e1m*e2m=3367962636.07941 aphi=-0.999991168164692 dete=59490.3199496781"
[1] "e1m*e2m=4794712.10051354 aphi=-0.999956405381663 dete=418.038175816592"
[1] "e1m*e2m=3699.85578987196 aphi=-0.998908076827754 dete=8.07550521781124"
[1] "e1m*e2m=12.6686499822451 aphi=0.956985935569274 dete=1.06642059359798"
[1] "e1m*e2m=12.7330364607788 aphi=0.957056380194584 dete=1.07012366722186"
[1] "e1m*e2m=12.6044297074752 aphi=0.95691501854723 dete=1.0627254405059"

but what is interesting is that each time it emerges, it seems to get a better
plateau.

If you actually look at something like Poincare map, it is interesting that 
aphi heads to +/-1 as the error magnitude product gets large which is about what
I motivated. There seem to be a cluster of points around aphi=-.5 with minimal
product. Using scatterplot3d on the above with log for magnitudes there seem
to be "paths" but interpreation is not immediately clear. I admit at this point
an interactive viewer would be helpful to rotate the thing, and in fact rgl.surface
may be useful ...