optim or nlminb for minimization, which to believe?

Mon, Nov 30, 2009 6:42 PM

Hi Harold,

Hans is right.  You can see this if you use BB::spg

require(BB)

opt2 <- spg(startVal, fn)  # this is fine

opt3 <- spg(startVal, fn,gradient)  # this is not fine!

Gradient check details:  max. relative difference in gradients=  0.001697913 

  analytic gradient: 0.1724284 -0.3382045 0.1724284 -0.3382045 -0.4902559 

  numerical gradient: 0.2826690 -0.2569404 0.2826690 -0.2569404 -0.4223231
Error in spg(startVal, fn, gradient) : 
  Analytic gradient does not seem correct! See comparison above. Fix it, remove it, or increase checkGrad.tol.

Ravi.

____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvaradhan at jhmi.edu


----- Original Message -----
From: Hans W Borchers <hwborchers at googlemail.com>
Date: Sunday, November 29, 2009 6:18 pm
Subject: Re: [R] optim or nlminb for minimization, which to believe?
To: r-help at r-project.org

Your function named 'gradient' is not the correct gradient. Take as an
example the following point x0, very near to the true minimum,

    x0 <- c(-0.2517964, 0.4898680, -0.2517962, 0.4898681, 0.7500995)
 
then you get

    > gradient(x0)

    [1] -0.0372110470  0.0001816991 -0.0372102284  0.0001820976 
0.0144292657

but the numerical gradient is different:

    > library(numDeriv)
    > grad(fn, x0)

    [1] -6.151645e-07 -5.507219e-07  1.969143e-07 -1.563892e-07
-4.955502e-08

that is the derivative is close to 0 in any direction -- as it should 
be for
an optimum.

No wonder, optim et al. get confused when applying this 'gradient'.

Regards
Hans Werner


Doran, Harold wrote:

I have constructed the function mml2 (below) based on the likelihood
function described in the minimal latex I have pasted below for

anyone who

wants to look at it. This function finds parameter estimates for a basic
Rasch (IRT) model. Using the function without the gradient, using either
nlminb or optim returns the correct parameter estimates and, in the

case

of optim, the correct standard errors.

By correct, I mean they match another software program as well as the
rasch() function in the ltm package.

Now, when I pass the gradient to optim, I get a message of successful
convergence, but the parameter estimates are not correct, but they are
*very* close to being correct. But, when I use nlminb with the

gradient, I

get a message of false convergence and again, the estimates are off,

but

again very close to being correct.

This is best illustrated via the examples:

### Sample data set
set.seed(1234)
tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 =
sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20,
replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 =
sample(c(0,1), 20, replace=TRUE))

## Use function mml2 (below) with optim  with use of gradient

mml2(tmp,Q=10)

$par
[1] -0.2438733  0.4889333 -0.2438733  0.4889333  0.7464162

$value
[1] 63.86376

$counts
function gradient
      45        6

$convergence
[1] 0

$message
NULL

$hessian
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 4.095479 0.000000 0.000000 0.000000 0.000000
[2,] 0.000000 3.986293 0.000000 0.000000 0.000000
[3,] 0.000000 0.000000 4.095479 0.000000 0.000000
[4,] 0.000000 0.000000 0.000000 3.986293 0.000000
[5,] 0.000000 0.000000 0.000000 0.000000 3.800898

## Use same function but use nlminb with use of gradient

mml2(tmp,Q=10)

$par
[1] -0.2456398  0.4948889 -0.2456398  0.4948889  0.7516308

$objective
[1] 63.86364

$convergence
[1] 1

$message
[1] "false convergence (8)"

$iterations
[1] 4

$evaluations
function gradient
      41        4


### use nlminb but turn off use of gradient

mml2(tmp,Q=10)

$par
[1] -0.2517961  0.4898682 -0.2517961  0.4898682  0.7500994

$objective
[1] 63.8635

$convergence
[1] 0

$message
[1] "relative convergence (4)"

$iterations
[1] 8

$evaluations
function gradient
      11       64

### Use optim and turn off gradient

mml2(tmp,Q=10)

$par
[1] -0.2517990  0.4898676 -0.2517990  0.4898676  0.7500906

$value
[1] 63.8635

$counts
function gradient
      22        7

$convergence
[1] 0

$message
NULL

$hessian
           [,1]       [,2]       [,3]       [,4]       [,5]
[1,]  3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526
[2,] -0.3992959  3.5338195 -0.3992959 -0.3960956 -0.3798141
[3,] -0.4224747 -0.3992959  3.6311153 -0.3992959 -0.3764526
[4,] -0.3992959 -0.3960956 -0.3992959  3.5338195 -0.3798141
[5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141  3.3784816

The parameter estimates with and without the use of the gradient are

so

close that I am inclined to believe that the gradient is correct and

maybe

the problem is elsewhere.

It seems odd that optim seems to converge but nlminb does not with

the use

of the gradient. But, with the use of the gradient in either case, the
parameter estimates differ from what I think are the correct values.

So,

at this point I am unclear if the problem is somewhere in the way the
functions are used or how I am passing the gradient or if the

problem lies

in the way I have constructed the gradient itself.

Below is the function and also some latex for those interested in looking
at the likelihood function.

Thanks for any reactions
Harold

sessionInfo()

R version 2.10.0 (2009-10-26)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ltm_0.9-2      polycor_0.7-7  sfsmisc_1.0-9  mvtnorm_0.9-8

msm_0.9.4

MASS_7.3-3     MiscPsycho_1.5
[8] statmod_1.4.1

loaded via a namespace (and not attached):
[1] splines_2.10.0  survival_2.35-7 tools_2.10.0


mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){
                if(!is.null(startVal) && length(startVal) !=

ncol(data) ){

                                stop("Length of argument startVal not
equal to the number of parameters estimated")
                }
                if(!is.null(startVal)){
                                startVal <- startVal
                                } else {
                                p <- colMeans(data)
                                startVal <- as.vector(log((1 - p)/p))
                }
                qq <- gauss.quad.prob(Q, dist = 'normal')
                rr1 <- matrix(0, nrow = Q, ncol = nrow(data))
                data <- as.matrix(data)
                L <- nrow(data)
                C <- ncol(data)
                fn <- function(b){
                                for(j in 1:Q){
                                                for(i in 1:nrow(data)){
                                                                rr1[j,i]
<- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE)))

                                                               
qq$weights[j]
                                                }
                                }
                -sum(log(colSums(rr1)))
                }
                gradient <- function(b){
                                p <- outer(qq$nodes, b, plogis) * L
                                x <- t(matrix(colSums(data), nrow=

C, Q))

                                w <- matrix(qq$weights, Q, C)
                                rr2 <- (-x + p) * w
                                -colSums(rr2)
                }
                opt <- optim(startVal, fn, gr = gradient, hessian =

TRUE,

method = "BFGS")
                #opt <- nlminb(startVal, fn, gradient=gradient)
                #list("coefficients" = opt$par, "LogLik" = -opt$value,
"Std.Error" = 1/sqrt(diag(opt$hessian)))
                opt
}



### latex describing the likelihood function
\documentclass{article}
\usepackage{latexsym}
\usepackage{amssymb,amsmath,bm}
\newcommand{\Prb}{\operatorname{Prob}}
\title{Marginal Maximum Likelihood for IRT Models}
\author{Harold C. Doran}
\begin{document}
\maketitle
The likelihood function is written as:
\begin{equation}
\label{eqn:mml}
f(x) =
\prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta
\end{equation}
\noindent where $N$ indexes the total number of individuals, $K$ indexes
the total number of items, $p(x|\theta,\beta)$ is the data

likelihood and

$f(\theta)$ is a population distribution. For the rasch model, the data
likelihood is:
\begin{equation}
p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i,
\beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i,
\beta_j)\right]^{(1-x_{ij})}
\end{equation}
\begin{equation}
\label{rasch}
\Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 +
e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N)
\end{equation}
\noindent where $\theta_i$ is the ability estimate of person $i$ and
$\beta_j$ is the location parameter for item $j$. The integral in
Equation~(\ref{eqn:mml}) has no closed form expression, so it is
approximated using Gauss-Hermite quadrature as:
\begin{equation}
\label{eqn:mml:approx}
f(x) \approx
\prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q
\end{equation}
\noindent where $q$ indexes the node at quadrature point $q$ and $w$

is

the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}),
the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}}
\, f(x)$.
\end{document}

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list

PLEASE do read the posting guide

and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list

PLEASE do read the posting guide 
and provide commented, minimal, self-contained, reproducible code.

optim or nlminb for minimization, which to believe?

Thread (3 messages)