Skip to content

error with princomp

14 messages · Bill Venables, Faheem Mitha, Göran Broström +3 more

#
At 01:08 PM 5/20/00 -0400, Faheem Mitha wrote:
Did traceback() say it came from princomp?  

Did you use traceback() at all?

Do you know about traceback()?

Don't worry if the answer to all three questions is a shrug - many users
are like that, but it really is a fundamental tool for errors like this and
I recommend it to all.

traceback() would have told you that the error is not in princomp() but in
eigen(), the workhorse that princomp() uses.  That has to be a clue.  Have
you looked at the code for eigen()?

Bill Venables.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Sun, 21 May 2000, Bill Venables wrote:

            
No.
[1] "eigen(cv)"                  "princomp(prin.df, cor = T)"
Not till Douglas Bates mentioned it.
Used it in Splus, but Splus helpfully reminds you about it. Never realised
it was in R, never thought of using it. I will from now on, though. I
really would like to learn this stuff.
Yup. I'm guessing the relevant part is 

if (symmetric) {
        if (complex.x) {
            xr <- Re(x)
            xi <- Im(x)
            z <- .Fortran("ch", n, n, xr, xi, values = dbl.n, 
                !only.values, vectors = xr, ivectors = xi, dbl.n, 
                dbl.n, double(2 * n), ierr = integer(1), PACKAGE = "base")
            if (z$ierr) 
                stop(paste("ch returned code ", z$ierr, " in eigen"))
            if (!only.values) 
                z$vectors <- matrix(complex(re = z$vectors, im =
z$ivectors), 
                  nc = n)
        }

But this does not enlighten me. It seems to involve a call to some Fortran
code, and I don't know Fortran.  And I am still freaked that it worked in
Splus and not in R (same data set, same code).

Could someone speculate on possible reasons? I have given up hope of
running this successfully on R (even if I figure out what it wrong,
princomp for a data set this large requires an unholy amount of memory,
probably more than I have available) but I am curious what it causing it
to go kerblooey. I tried princomp with small data sets and it works fine.

                                          Faheem.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Faheem Mitha <faheem at email.unc.edu> writes:
The error message said that the problem was in "if (symmetric) {..."
so I'd suspect that it is because symmetric is not a logical - you're
not getting to the Fortran at all. Now, since eigen is not explicitly
being passed a "symmetric" argument, that must come from missing
values in the covariance matrix itself. So where do they come from?

One more hint: debug(princomp) and have a look at the generation of
"cv". It looks like it should be protected from missings, but there
might be a bug somewhere.
#
Is there a function somewhere for estimating the survivor function
(plus confidence limits) when data are both left truncated and right
censored?  survfit in  survival5  doesn't like left truncation.
#
Dear R people,

I have been doing some more sherlocking of a rather primitive sort. Thanks
very much for hints supplied by Peter Dalgaard and Bill Venables. I found
debug(princomp) would just exit with the same error. So, since the
calculations are really quite simple, I decided to follow them along,
creating the objects as needed. This approach was successful in detecting
the source of the prolem. The first column of my data set, corresponding
to the first variable, was all ones. This reasonably enough gave a lot of
zero covariances, and hence NaN's are generated down the line when
calculating the correlation matrix, hence throwing R into a tizzy. This
seems reasonable enough. I don't know enough about principal components
analysis to know whether R should be expected to do anything further with
a correlation matrix with missing values, but I doubt it. (Yes, the
correlation matrix seems to be the problem). I wonder how Splus coped with
it. I may take a look later if I have the energy.

I deleted this column and then ran into a further problem. When I tried to
do princomp(prin.df, cor = T) again I got

Error in princomp(prin.df, cor = T) : covariance matrix is not
non-negative definite.

I took a look at the covariance matrix and indeed some of the eigenvalues
were negative. However, they were very small negative numbers, ie

[86] -1.861163e-18 -3.507169e-18 -1.400653e-17 -4.185918e-17 -5.495597e-17
[91] -8.742655e-17 -1.227594e-16 -1.515011e-16 -2.975572e-16 -3.650603e-16
[96] -5.166079e-16 -1.719909e-15

Now, I am no expert on this stuff, but I think the covariance matrix
estimator is simply calculating the covariance matrix of the sample
distribution, so should be guaranteed to give a non-negative definite
matrix. So, I am inclined to think that this is probably a rounding error
and not my fault. What should be done about it? No, please don't tell me
to rewrite the code. I don't know enough.
                                                 Faheem.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Dear R people,

A followup on my previous message. I did a little more investigation and
experimentation. 

Firstly, I tried princomp, with cor = F. I was relieved that it gave me
the "covariance matrix is not non-negative definite" error, otherwise I
would have been really puzzled. So it seems clear that the first column of
my matrix being all ones is what is causing the if(symmetric) error.

Moving on, I was curious why Splus was not giving me a error for the same
reason. I looked at the covariance matrix, and guess what, I get (for
example)

             tanf2    
 tanf2  1.215290e-27

where R gave me zero. (tanf2 is the variable corresponding to the first
column). Clearly zero is the correct answer, so R is the voice of sanity
here. Three cheers for R!

So, I have decided that I will deleted that first column of ones and run
the results again. Goodness knows what terrible things this rounding error
may be doing later on in the calculation.

To close off, I would like to briefly explain what I am doing (which is
very simple) and ask if it seems reasonable. I don't want to be one of
those people out there doing nonsense statistics.

I have a data set, of 2061 rows and 99 columns originally. Now I guess it
is going to be 97 columns, since the first column was all zeros (even
Splus choked on this and I deleted it earlier) and the second one was all
ones. Anyway, the first 64 (was 66) columns are binary data. The last 33
are numeric data. Now, I thought that a reasonable thing to do (in fact,
the only thing I could think of) was to treat the first 64 columns as
numeric zeros and ones, and then use the cor=TRUE flag (ie use the
correlation matrix instead of the corelation matrix). This is advertised
as a way of handling cases when the data is not all of the same scale. So
that is what I did. Any comments/suggestions?
                                    Sincerely, Faheem Mitha.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Faheem Mitha <faheem at email.unc.edu> writes:
Whatever the software, you're likely to get in trouble trying to
interpret the result of a PCA on binary data (using correlations or
not). It can be tricky enough with continuous data....

Anyway, I bet some of those 64 binary columns will turn out to be
linearly dependent, either overtly by some of them summing to a
constant or more subtly because of some combinations being absent.

A QR decomposition of your data matrix might be enlightening. Look at
the rank and the pivoting information. 

It does seem that we're handling the singular case suboptimally,
though. Ideally, one should apply a fuzz factor before declaring that
the matrix isn't NND and I don't think there's a problem with
factoring out a null space in the PCA.
2 days later
#
On Sun, 21 May 2000, gb wrote:

            
I got one answer, but a function for left censoring (for which I am 
grateful). So I wrote my own function. Here it is. Be careful, there
are a few nasty loops... and no serious testing yet.

Tips on improvements are welcome! (Note, it just plots, no return
values. Easy to change, though.)

G?ran
------------------------------------------------------------------------
plot.surv <- function(enter = rep( 0, length(exit) ),
                      exit,
                      event = rep( 1, length(exit) ),
                      group = rep( 1, length(exit) ),
                      limits = F,
                      conf = 0.95)
  {
    ## Input data:
    ##
    ## enter : left truncation point
    ## exit  : exit time point
    ## event : if zero, a censored observation; otherwise an event.
    ## group : one curve for each distinct value in group.
    ## limits: if TRUE, and only one group, pointwise confidence
    ##         limits (Greenwoods formula, log(-log) type).
    ## conf  : confidence level. Can be given as a percentage.
    
    ## Check input data:
    n <- length(exit)
    if (length(enter) != n)stop("enter and exit must have equal length.")
    if (length(event) != n)
      stop("event and exit must have equal length.")
    if (length(group) != n)
      stop("group and exit must have equal length.")
    if (min(exit - enter) <= 0) stop("Interval length must be positive.")
    if (conf > 1) conf <- conf / 100 ## conf given as a percentage(?)
    if ( (conf < 0.5) | (conf >=1) ) stop("Bad conf value")

    grupp <- as.character(group)
   
    strata <- sort(unique(grupp))
    if (length(strata) > 9)
      stop("Too many groups. No more than 9 are allowed.")
    
    ##
    if (length(strata) > 1) limits <- F # No limits if multiple curves.
    gang <- 0
    for (stratum in strata)
      {
        atom <- table.events(enter[grupp == stratum],
                             exit[grupp == stratum],
                             event[grupp == stratum])
        
        gang <- gang + 1
        surv <- c( 1, cumprod(1 - atom$events / atom$riskset.sizes) )
        if (gang == 1)
          {
            plot(c(0, atom$times), surv, type = "s",
                 xlab = "duration", ylab = "Surviving fraction",
                 main = "Survivor function(s)", ylim = c(0, 1),
                 col = gang%%5)
            if (limits) ## Greenwood's formula,
                        ## Kalbfleisch & Prentice, p. 15 (note error!).
              {
                q.alpha <- abs(qnorm((1 - conf) / 2))
                survived <- (atom$riskset.size - atom$events)
                se <- sqrt(cumsum(atom$events /
                                   ( atom$riskset.sizes * survived )
                                   )
                           )/
                            cumsum(-log(survived / atom$riskset.sizes))
                upper <- surv ^ exp(q.alpha * c(0, se))
                lower <- surv ^ exp(-q.alpha * c(0, se))
                lines(c(0, atom$times), upper, type = "s",
                      col = gang%%5 + 1)
                lines(c(0, atom$times), lower, type = "s",
                      col = gang%%5 + 1)
              }
          }
        else
          {
            lines(c(0, atom$times), surv, type = "s", 
                  col = gang%%5)
          }
      }
    abline(h=0)
    if (length(strata) > 1)
      {
        ## legend.txt <- as.character(strata)
        colors <- (1:length(strata))%%5
        legend(0, 0.6, lty = 1, legend = strata, col = colors)
      }
  }

table.events <- function(enter = rep(0, length(exit)),
                         exit,
                         event)
{
  n <- length(exit)

  ## Check input data:
  if ( length(enter) != n ) stop("enter and exit must have equal length.")
  if ( length(event) != n ) stop("event and exit must have equal length.")
  ##
  
  event <- (event != 0) ## 0 (F) = censoring, else (T) = event

  times <- c(unique(sort(exit[event])))
  nn <- length(times)

  rs.size <- double(nn)
  n.events <- double(nn)

  for (i in 1:nn) ## Try to avoid this loop!
    {
      rs.size[i] <- sum( (enter < times[i]) &
                        (exit >= times[i]) )
      n.events[i] <- sum( (exit == times[i]) & event )
    }

  stop.at <- which(rs.size == n.events)
  if (length(stop.at))
    {
      stop.at <- min(stop.at) - 1
      if (stop.at <= 0) stop("Bad data. All died immediately!")
      times <- times[1:stop.at]
      n.events <- n.events[1:stop.at]
      rs.size <- rs.size[1:stop.at]
    }
      
  return ( list(times         = times,
                events        = n.events,
                riskset.sizes = rs.size)
          )
}

Have fun! And send bug reports to
1 day later
#
I have got two late answers to this question, from Mai Zhou and Bendix
Carstensen. Both point out a 'trick': Use  coxph  to fit a model with
no covariates and ask for the 'baseline hazard' of that fit!  
Unlike  survfit,  coxph  allows for left truncated data.

Thanks to both!

G?ran
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
R-users!

How do I ask for the 'baseline hazard' of a coxph fit??

Fredrik Lundgren
----- Ursprungligt meddelande ----- 
Fr?n: "gb" <gb at stat.umu.se>
Till: <r-help at stat.math.ethz.ch>
Skickat: den 25 maj 2000 21:25
?mne: Re: [R] Kaplan-Meier for left truncated data?
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Fri, 26 May 2000, Fredrik Lundgren wrote:

            
This interesting:
Error in survfit.km(X, Y, casewt, ...) : Can only handle right censored
data

Does not work :-(
Is ok! :-)

G?ran

  
    
#
On Fri, 26 May 2000, Fredrik Lundgren wrote:

            
One way is to take -log of the survival function
    survfit(a.coxph.fit)$surv
gives the survival function at the mean covariates, so
   -log(survfit(a.coxph.fit)$surv)
gives the hazard at the mean covariates.

Many people define the baseline hazard as the hazard at all covariates=0
in which case you need to supply these values to survfit. This can be
numerically unstable and in any case is usually not what you really want.

	-thomas
Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
summary of "How do I ask for the 'baseline hazard' of a coxph fit??"

Thank you to Thomas Lumley, Brian Ripley, G?ran Bodstr?m and Bendix Cartensen for their tips to use 
plot(survfit(a.coxph.fit)). Thomas Lumley suggests that to get the "real baseline hazard" I should use
-log(survfit(a.coxph.fit)$surv) and set all covariates=0
in which case I need to supply these values to survfit. But how do you supply covariate values to survfit?
Shouldn't  they be supplied to coxph in some way?

Fredrik Lundgren

"Thomas Lumley wrote may 26 2000 "
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Fri, 26 May 2000, Fredrik Lundgren wrote:

            
No, to survfit.  coxph fits the model, it doesn't care what you do to it
afterwards.  If you look at the documentation for survfit() it has a
data= argument to specify which covariates values you want the survival
curve for.

	-thomas

Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._