Skip to content

On The Choice of a Classification Approach

3 messages · Alexandre F. Souza, Sarah Goslee, francois gillet

#
Hello,

I am trying to find a method to cluster species based on their quantitative
traits and at the same time obtain threshold value for each node in the
decision tree. My difficulty is that my dependent variable is the list of
species names, each species appearing as a single line with no repetition.
All explanatory variables are quantitative. As far as I understood,
classification trees need a dependent variable with repeated levels as in
the iris dataset, in which each species appears several times. All the
examples employing classification trees I found use a dependent variable,
but I do not have one except for the species names. MRT uses a species by
location matrix as dependent variable, and traditional hierarchical cluster
analysis do cluster species but do not use quantitative data to that aim,
nor produce threshold values. I can run a non-hierarquical cluster analysis
like kmeans, but these do not generate threshold values. My concern is that
without threshold values any classification I produce will be restricted to
the studied species and will not be applicable to different species that
can be found in the studied region, what would be a strong limitation to
the use of such classification.

Thank you very much in advance for any ideas.

Regards,

Alexandre
#
Hi,

I think I understand what you're after, but if not then please correct
me/expand on your question.

The datasets like iris that you reference are for supervised
classification, where the true values are known and the goal is to
quantitatively predict which species an individual belongs to based on
its attributes.

What you have is an unsupervised classification problem, where you
want to group species based on their attributes without knowing the
answer first. The species name is irrelevant here, except as a
reference for you, because you don't yet know anything about the group
membership. You're right that you can't do a classification tree,
because you don't already know the groups.

I think by threshold values you're actually looking for a predict
method that can assign new species to the existing clusters, right?
That's doable. Many clustering functions, like flexclust::kcca offer
predict methods directly.

After that, the better question is how do the assumptions of different
clustering methods fit your understanding of the structure of the
data? Do you want hierarchical classification into nested groups and
subgroups? Or do you NOT want that structure, in which case a
partitioning method might be more appropriate.

Are the variables correlated? Do you need to create uncorrelated variables?
Is there a particular distance metric that would be more suitable for
your data? Does it have to work with correlated variables?
And so on.

If it were my data, I'd also do an ordination and look at the
structure in reduced dimensions.

Sarah

On Mon, Nov 1, 2021 at 3:39 PM Alexandre F. Souza
<alexsouza.cb.ufrn.br at gmail.com> wrote:

  
    
2 days later
#
Dear Alexandre,

To build a predictive model of functional groups of species based on a set of traits, you can simply apply a classification tree to the clusters obtained from the same species x traits table.
Here is an example with trait data from New Zealand vascular plant species, using the mvpart() function in the mvpart package:

library(FD)
library(mvpart)
# Gower dissimilarity matrix for mixed trait variables
gd <- gowdis(tussock$trait)
# Ward hierarchical clustering
gc <- hclust(gd, "ward.D2")
plot(gc, hang = -1)
rect.hclust(gc, 6)
# 6 clusters or plant functional types
fg <- cutree(gc, 6)
# Classification tree
tra.ct <- mvpart(as.factor(fg) ~ ., tussock$trait)

You get a decision tree with threshold values for the discriminating qualitative or quantitative traits.

Unfortunately and for obscure reasons, mvpart is no longer available from CRAN for years. However, you can install this great package from the archive:
devtools::install_github("cran/mvpart", force = TRUE)
You can also use the more limited rpart::rpart function instead.

Best,

Fran?ois



----- Mail original -----
De: "Alexandre F. Souza" <alexsouza.cb.ufrn.br at gmail.com>
?: "r-sig-ecology" <r-sig-ecology at r-project.org>
Envoy?: Lundi 1 Novembre 2021 20:37:20
Objet: [R-sig-eco] On The Choice of a Classification Approach

Hello,

I am trying to find a method to cluster species based on their quantitative
traits and at the same time obtain threshold value for each node in the
decision tree. My difficulty is that my dependent variable is the list of
species names, each species appearing as a single line with no repetition.
All explanatory variables are quantitative. As far as I understood,
classification trees need a dependent variable with repeated levels as in
the iris dataset, in which each species appears several times. All the
examples employing classification trees I found use a dependent variable,
but I do not have one except for the species names. MRT uses a species by
location matrix as dependent variable, and traditional hierarchical cluster
analysis do cluster species but do not use quantitative data to that aim,
nor produce threshold values. I can run a non-hierarquical cluster analysis
like kmeans, but these do not generate threshold values. My concern is that
without threshold values any classification I produce will be restricted to
the studied species and will not be applicable to different species that
can be found in the studied region, what would be a strong limitation to
the use of such classification.

Thank you very much in advance for any ideas.

Regards,

Alexandre