Dear all, not really a R question but: If I want to check for the classification accuracy of a LDA with previous PCA for dimensionality reduction by means of the LOOCV method: Is it ok to do the PCA on the WHOLE dataset ONCE and then run the LDA with the CV option set to TRUE (runs LOOCV) -- OR-- do I need - to compute for each 'test-bag' (the n-1 observations) a PCA (my.princomp.1), - then run the LDA on the test-bag scores (-> my.lda.1) - then compute the scores of the left-out-observation using my.princomp.1 (-> my.scores.2) - and only then use predict.lda(my.lda.1, my.scores.2) on the scores of the left-out-observation ? I read some articles, where they choose procedure 1, but I am not sure, if this is really correct? many thanks for a hint Christoph
LDA with previous PCA for dimensionality reduction
7 messages · Christoph Lehmann, Torsten Hothorn, David Enot +2 more
Dear Cristoph,
I guess you want to assess the error rate of a LDA that has been fitted to a
set of currently existing training data, and that in the future you will get
some new observation(s) for which you want to make a prediction.
Then, I'd say that you want to use the second approach. You might find that
the first step turns out to be crucial and, after all, your whole subsequent
LDA is contingent on the PC scores you obtain on the previous step. Somewhat
similar issues have been discussed in the microarray literature. Two
references are:
@ARTICLE{ambroise-02,
author = {Ambroise, C. and McLachlan, G. J.},
title = {Selection bias in gene extraction on the basis of microarray
gene-expression data},
journal = {Proc Natl Acad Sci USA},
year = {2002},
volume = {99},
pages = {6562--6566},
number = {10},
}
@ARTICLE{simon-03,
author = {Simon, R. and Radmacher, M. D. and Dobbin, K. and McShane, L. M.},
title = {Pitfalls in the use of DNA microarray data for diagnostic and
prognostic classification},
journal = {Journal of the National Cancer Institute},
year = {2003},
volume = {95},
pages = {14--18},
number = {1},
}
I am not sure, though, why you use PCA followed by LDA. But that's another
story.
Best,
R.
On Wednesday 24 November 2004 11:16, Christoph Lehmann wrote:
Dear all, not really a R question but: If I want to check for the classification accuracy of a LDA with previous PCA for dimensionality reduction by means of the LOOCV method: Is it ok to do the PCA on the WHOLE dataset ONCE and then run the LDA with the CV option set to TRUE (runs LOOCV) -- OR-- do I need - to compute for each 'test-bag' (the n-1 observations) a PCA (my.princomp.1), - then run the LDA on the test-bag scores (-> my.lda.1) - then compute the scores of the left-out-observation using my.princomp.1 (-> my.scores.2) - and only then use predict.lda(my.lda.1, my.scores.2) on the scores of the left-out-observation ? I read some articles, where they choose procedure 1, but I am not sure, if this is really correct? many thanks for a hint Christoph
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Ram??n D??az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol??gicas (CNIO) (Spanish National Cancer Center) Melchor Fern??ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)
On Wed, 24 Nov 2004, Ramon Diaz-Uriarte wrote:
Dear Cristoph, I guess you want to assess the error rate of a LDA that has been fitted to a set of currently existing training data, and that in the future you will get some new observation(s) for which you want to make a prediction. Then, I'd say that you want to use the second approach. You might find that the first step turns out to be crucial and, after all, your whole subsequent LDA is contingent on the PC scores you obtain on the previous step.
Ramon, as long as one does not use the information in the response (the class variable, in this case) I don't think that one ends up with an optimistically biased estimate of the error (although leave-one-out is a suboptimal choice). Of course, when one starts to "tune" the method used for dimension reduction, a selection of the procedure with minimal error will produce a bias. Or am I missing something important? Btw, `ipred::slda' implements something not completely unlike the procedure Christoph is interested in. Best, Torsten
Somewhat
similar issues have been discussed in the microarray literature. Two
references are:
@ARTICLE{ambroise-02,
author = {Ambroise, C. and McLachlan, G. J.},
title = {Selection bias in gene extraction on the basis of microarray
gene-expression data},
journal = {Proc Natl Acad Sci USA},
year = {2002},
volume = {99},
pages = {6562--6566},
number = {10},
}
@ARTICLE{simon-03,
author = {Simon, R. and Radmacher, M. D. and Dobbin, K. and McShane, L. M.},
title = {Pitfalls in the use of DNA microarray data for diagnostic and
prognostic classification},
journal = {Journal of the National Cancer Institute},
year = {2003},
volume = {95},
pages = {14--18},
number = {1},
}
I am not sure, though, why you use PCA followed by LDA. But that's another
story.
Best,
R.
On Wednesday 24 November 2004 11:16, Christoph Lehmann wrote:
Dear all, not really a R question but: If I want to check for the classification accuracy of a LDA with previous PCA for dimensionality reduction by means of the LOOCV method: Is it ok to do the PCA on the WHOLE dataset ONCE and then run the LDA with the CV option set to TRUE (runs LOOCV) -- OR-- do I need - to compute for each 'test-bag' (the n-1 observations) a PCA (my.princomp.1), - then run the LDA on the test-bag scores (-> my.lda.1) - then compute the scores of the left-out-observation using my.princomp.1 (-> my.scores.2) - and only then use predict.lda(my.lda.1, my.scores.2) on the scores of the left-out-observation ? I read some articles, where they choose procedure 1, but I am not sure, if this is really correct? many thanks for a hint Christoph
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-- Ram??n D??az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol??gicas (CNIO) (Spanish National Cancer Center) Melchor Fern??ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thank you, Torsten; that's what I thought, as long as one does not use the 'class label' as a constraint in the dimension reduction, the procedure is ok. Of course it is computationally more demanding, since for each new (unknown in respect of the class label) observation one has to compute a new PCA as well. Cheers Christoph
Torsten Hothorn wrote:
On Wed, 24 Nov 2004, Ramon Diaz-Uriarte wrote:
Dear Cristoph, I guess you want to assess the error rate of a LDA that has been fitted to a set of currently existing training data, and that in the future you will get some new observation(s) for which you want to make a prediction. Then, I'd say that you want to use the second approach. You might find that the first step turns out to be crucial and, after all, your whole subsequent LDA is contingent on the PC scores you obtain on the previous step.
Ramon, as long as one does not use the information in the response (the class variable, in this case) I don't think that one ends up with an optimistically biased estimate of the error (although leave-one-out is a suboptimal choice). Of course, when one starts to "tune" the method used for dimension reduction, a selection of the procedure with minimal error will produce a bias. Or am I missing something important? Btw, `ipred::slda' implements something not completely unlike the procedure Christoph is interested in. Best, Torsten
Somewhat
similar issues have been discussed in the microarray literature. Two
references are:
@ARTICLE{ambroise-02,
author = {Ambroise, C. and McLachlan, G. J.},
title = {Selection bias in gene extraction on the basis of microarray
gene-expression data},
journal = {Proc Natl Acad Sci USA},
year = {2002},
volume = {99},
pages = {6562--6566},
number = {10},
}
@ARTICLE{simon-03,
author = {Simon, R. and Radmacher, M. D. and Dobbin, K. and McShane, L. M.},
title = {Pitfalls in the use of DNA microarray data for diagnostic and
prognostic classification},
journal = {Journal of the National Cancer Institute},
year = {2003},
volume = {95},
pages = {14--18},
number = {1},
}
I am not sure, though, why you use PCA followed by LDA. But that's another
story.
Best,
R.
On Wednesday 24 November 2004 11:16, Christoph Lehmann wrote:
Dear all, not really a R question but: If I want to check for the classification accuracy of a LDA with previous PCA for dimensionality reduction by means of the LOOCV method: Is it ok to do the PCA on the WHOLE dataset ONCE and then run the LDA with the CV option set to TRUE (runs LOOCV) -- OR-- do I need - to compute for each 'test-bag' (the n-1 observations) a PCA (my.princomp.1), - then run the LDA on the test-bag scores (-> my.lda.1) - then compute the scores of the left-out-observation using my.princomp.1 (-> my.scores.2) - and only then use predict.lda(my.lda.1, my.scores.2) on the scores of the left-out-observation ? I read some articles, where they choose procedure 1, but I am not sure, if this is really correct? many thanks for a hint Christoph
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-- Ram??n D??az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol??gicas (CNIO) (Spanish National Cancer Center) Melchor Fern??ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On 24 Nov 2004, at 10:16, Christoph Lehmann wrote:
Dear all, not really a R question but: If I want to check for the classification accuracy of a LDA with previous PCA for dimensionality reduction by means of the LOOCV method: Is it ok to do the PCA on the WHOLE dataset ONCE and then run the LDA with the CV option set to TRUE (runs LOOCV) -- OR-- do I need - to compute for each 'test-bag' (the n-1 observations) a PCA (my.princomp.1), - then run the LDA on the test-bag scores (-> my.lda.1) - then compute the scores of the left-out-observation using my.princomp.1 (-> my.scores.2) - and only then use predict.lda(my.lda.1, my.scores.2) on the scores of the left-out-observation ? I read some articles, where they choose procedure 1, but I am not sure, if this is really correct?
As far as understand your problem (assessing the predictive ability of your model), the second solution should be done: the test set is something that should be never seen by the training data. If you run your PCA on the whole set, then you will take into account your test bag while forming your training data. Keep in mind that your classifier is made up with 2 components: PCA followed by LDA. This is fine if you build your model with a given number of PC's: the procedure to get an optimal number of PC's would be similar as above but considering the (n-1) examples. A proper validation of the model can become quickly tricky: this requires a bit of computing skills and this may take longer (especially with LOO)! Hope it helps David
many thanks for a hint Christoph
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Torsten Hothorn writes:
as long as one does not use the information in the response (the class variable, in this case) I don't think that one ends up with an optimistically biased estimate of the error
I would be a little careful, though. The left-out sample in the LDA-cross-validation, will still have influenced the PCA used to build the LDA on the rest of the samples. The sample will have a tendency to lie closer to the centre of the "complete" PCA than of a PCA on the remaining samples. Also, if the sample has a high leverage on the PCA, the directions of the two PCAs can be quite different. Thus, the LDA is built on data that "fits" better to the left-out sample than if the sample was a completely new sample. I have no proofs or numerical studies showing that this gives over-optimistic error rates, but I would not recommend placing the PCA "outside" the cross-validation. (The same for any resampling-based validation.)
Bj??rn-Helge Mevik
Dear Cristoph, David, Torsten and Bj??rn-Helge, I think that Bj??rn-Helge has made more explicit what I had in mind (which I think is close also to what David mentioned). As well, at the very least, not placing the PCA inside the cross-validation will underestimate the variance in the predictions. Best, R.
On Thursday 25 November 2004 15:05, Bj??rn-Helge Mevik wrote:
Torsten Hothorn writes:
as long as one does not use the information in the response (the class variable, in this case) I don't think that one ends up with an optimistically biased estimate of the error
I would be a little careful, though. The left-out sample in the LDA-cross-validation, will still have influenced the PCA used to build the LDA on the rest of the samples. The sample will have a tendency to lie closer to the centre of the "complete" PCA than of a PCA on the remaining samples. Also, if the sample has a high leverage on the PCA, the directions of the two PCAs can be quite different. Thus, the LDA is built on data that "fits" better to the left-out sample than if the sample was a completely new sample. I have no proofs or numerical studies showing that this gives over-optimistic error rates, but I would not recommend placing the PCA "outside" the cross-validation. (The same for any resampling-based validation.)
Ram??n D??az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol??gicas (CNIO) (Spanish National Cancer Center) Melchor Fern??ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)