5 Community (dis)similarity

Here we explore how similar ecological communities are. With this, we mean how much their species composition overlaps. Dissimilarity is just the opposite of similarity, both measures carry the same information.

5.1 Reading and transforming community data

Let’s read our community data and explore abundance data. Because abundance data tends to be skewed (few species have high abundance and many have low abundance), it is often reasonable to transform abundance to get a better distribution.

# Reading previously saved data
load("community.rda", verbose = T) # loading from previous
## Loading objects:
##   vas.plants
##   tree.counts
##   forest.types
##   xy
##   soil.data
##   tables.join
comm.data <- vas.plants # copy of a community data


# Most abundant species

max.spp <- which.max(colSums(comm.data))
max.spp
## OXALacet 
##       33

hist(comm.data[comm.data > 0])  # histogram of requencies (other than 0)

hist(sqrt(comm.data[comm.data > 0])) # square root transformation

hist(log1p(comm.data[comm.data > 0])) # log (x+1) transformation

We can see that log-tranformation gives approximately normal distribution, it downweights frequent taxa.

5.2 Community distance matrix

We can measure dissimilarity by using several indices. Here are some examples.

Euclidean distance:

\(d_{jk} = \sqrt{\sum_{i=1}^{n}{(x_{ij}-x_{ik})^2}}\)

where \(x\) is abundance of species \(i\) in sites \(j\) and \(k\) and \(n\) is the total number of taxa.

Bray-Curtis distance:

\(d_{jk} = \frac{\sum_{i=1}^{n}{|x_{ij}-x_{ik}|}}{\sum_{i=1}^{n}{(x_{ij}+x_{ik})}}\)

Euclidean distance does not have an upper limit, whereas Bray-Curtis distance is bounded between 0 and 1.

With the vegdist function we can calculate various distance measures. It returns a triangular distance matrix, because distance from sample A to B is the same as from B to A.


library(vegan)

vegdist(comm.data[1:5, ], "euclidean") # triangular distance matrix
##              X001Vapramae X002Illi X003Vitipalu X004Konguta
## X002Illi         6.422616                                  
## X003Vitipalu     6.164414 7.729812                         
## X004Konguta      7.297260 8.544004     9.473648            
## X005Vehendi      8.154753 5.766281     9.746794    9.708244


# Calculating some distances and ploting against each other.


vd1 <- vegdist(comm.data, "euclidean")
vd2 <- vegdist(log1p(comm.data), "euclidean") # log transformation
vd3 <- vegdist(log1p(comm.data), "bray")
pairs(cbind(vd1, vd2, vd3))

Explore other community (dis)similarity measures.

5.3 Hierachical clustering

Using a distance matrix we can perform a clustering of our samples. Initially all samples form their own clusters, then we start to join the most similar sites and form clustering trees.

Single linkage is based on the most similar members of two clusters. Complete linkage is based on the most dissimilar member of two clusters. Average linkage is based on calculating the average similarity between all members. Ward method is more complex, aiming to minimize the variance within clusters. You can check the Wildi book for more details.

If we have a hierarchical cluster tree, we can always cut this to any number of clusters.


vd <- vd2  # selecting a distance matrix for future calculations
o.clu.s <- hclust(vd, method = "single")
o.clu.c <- hclust(vd, method = "complete")
o.clu.a <- hclust(vd, method = "average")
o.clu.w <- hclust(vd, method = "ward.D2")

par(mfrow = c(2, 2)) # several figures together 2 rows and 2 columns!

plot(as.dendrogram(o.clu.s), main = "single")
plot(as.dendrogram(o.clu.c), main = "complete")
plot(as.dendrogram(o.clu.a), main = "average")
plot(as.dendrogram(o.clu.w), main = "ward.D2")


o.clu <- o.clu.w # selecting the most logical (Ward linkage)


par(mfrow = c(1, 1)) # single figure again.

## Cutting tree to parts

o.grel <- cutree(o.clu, k = 5)
plot(as.dendrogram(o.clu))
rect.hclust(o.clu, 5, border = "red")


# Similarity between sites (colors), ordered along clusters
image(as.matrix(vd)[order(o.grel), order(o.grel)], asp = T, 
      col = hcl.colors(8, palette = "viridis"))

5.4 k-means clustering

Not hierarchical – just give number of clusters needed. Computation is complex, based on machine learning and iterations.


k.o <- kmeans(comm.data, 5)
k.o$cluster
##   X001Vapramae       X002Illi   X003Vitipalu    X004Konguta    X005Vehendi 
##              4              2              3              1              2 
##    X006.Erumae   X007Porgumae     X008Aardla   X009Kurepalu  X010Sarakuste 
##              5              3              5              2              3 
##      X011Kannu      X012Vonnu     X013Rookse    X014Kammeri     X015Kambja 
##              3              4              3              4              3 
##      X016Reola     X017Ignase    X018Unikula X019Haavametsa X020Kruusahaua 
##              5              4              4              2              3 
##      X021Aravu  X026Pedajamae       X027Miti    X028Puhaste  X031Pimmelaan 
##              2              4              2              2              4 
##     X032Logina  X033Voorepalu X036Taevaskoja    X044Savimae X073Pausakunnu 
##              5              2              2              2              3

image(as.matrix(vd1)[order(k.o$cluster), order(k.o$cluster)], asp = T,
      col = hcl.colors(8, palette = "viridis"))

5.5 How many clusters?

Above we defined 5 clusters but can we find how many clusters are optimal? One method is to inspect at what number of groups the correlation between the real distance matrix between sites and the distance matrix between clusters maximizes. This distance in clusters can only include 0 (in different cluster) and 1 (in the same cluster) but we can still calculate the correlation.


# Lets define a vector for correlations
correls <- numeric() # making a numeric vector

for (i in 2:(nrow(comm.data) - 1)) {
  # loop for possible cluster numbers. We do not use 1 (all in the same cluster) and number of samples (all in different clusters)
  
  clusters <- cutree(o.clu, k = i) # defining clusters
  clusdist <- vegdist(table(1:30, clusters), "bray")  # dist calculates Bray distance
  # distance is 0 (in different cluster) or 1 (same cluster)
  
  # table(1:30,clusters) makes a table of 30 sites vs clusters, 1 if a site is in a cluster
  
  correls[i] <- cor(vd, clusdist) # correlation as a measure
  
}

plot(correls, type = "h", xlab = "No of clusers", ylab = "Correlation")
max.corr <- which.max(correls) # which is max correlation?
points(max.corr, correls[max.corr], pch = 16, col = "red", cex = 2)
axis(side = 1, at = max.corr, labels = max.corr, col.axis = "red")


# A nice graph!

5.6 Human-defined forest types vs. clusterings

Now we will compare how the clusters that we defined using maths are related to human-defined forest types.

forest.types # this vector contains the forest type each site is categorized as
##  [1] "115" "113" "115" "151" "113" "113" "116" "116" "113" "116" "116" "114"
## [13] "116" "114" "116" "114" "114" "115" "113" "116" "151" "114" "113" "113"
## [25] "116" "151" "113" "113" "151" "114"
image(as.matrix(vd1)[order(forest.types), order(forest.types)], asp = T,
      col = hcl.colors(8, palette = "viridis"))


table(o.grel, forest.types) # cross-table
##       forest.types
## o.grel 113 114 115 116 151
##      1   0   3   3   0   0
##      2   8   1   0   0   3
##      3   0   0   0   1   1
##      4   1   0   0   1   0
##      5   0   2   0   6   0
table(k.o$cluster, forest.types)
##    forest.types
##     113 114 115 116 151
##   1   0   0   0   0   1
##   2   8   0   0   0   2
##   3   0   1   1   6   0
##   4   0   4   2   1   0
##   5   1   1   0   1   1
table(k.o$cluster, o.grel)
##    o.grel
##      1  2  3  4  5
##   1  0  0  1  0  0
##   2  0 10  0  0  0
##   3  1  0  0  0  7
##   4  4  1  1  0  1
##   5  1  1  0  2  0

# Fisher exact test of two groups
fisher.test(table(o.grel, forest.types))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(o.grel, forest.types)
## p-value = 6.84e-07
## alternative hypothesis: two.sided
fisher.test(table(k.o$cluster, forest.types))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(k.o$cluster, forest.types)
## p-value = 2.745e-06
## alternative hypothesis: two.sided
fisher.test(k.o$cluster, o.grel)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  k.o$cluster and o.grel
## p-value = 6.744e-09
## alternative hypothesis: two.sided

All tests are highly significant, but there are also some differences.

5.7 Which species define clusters?

We want to know which species actually define clusters. There are several indices which show how well species differ between clusters. Here we use the IndVal Index. There are ways to calculate statistical significance as well for different clusters.

library(indicspecies)

# IndVal index (species indicator values)

indval <- multipatt(comm.data, o.grel, func = "IndVal")
indval$sign
##            s.1 s.2 s.3 s.4 s.5 index      stat p.value
## ACTAspic     1   0   0   0   1     9 0.4629100   0.350
## AEGOpoda     0   0   0   0   1     5 0.5590170   0.165
## ANEMnemo     1   0   0   0   1     9 0.8144110   0.005
## ASAReuro     1   0   0   0   1     9 0.5345225   0.140
## ATHYfil.fe   0   0   0   0   1     5 0.5000000   0.330
## BRACpinn     1   0   0   0   0     1 0.4082483   0.350
## CALAarun     1   1   0   1   1    28 0.8451543   0.180
## CALLvulg     1   0   0   0   0     1 0.3535534   0.720
## CAREdigi     1   0   1   0   1    20 0.7245688   0.015
## CIRCalpi     0   0   1   0   0     3 0.7745967   0.025
## CONVmaja     1   0   1   1   0    19 0.7450517   0.035
## DESCflex     0   1   0   0   1    12 0.3162278   1.000
## DRYOcart     1   0   1   1   1    29 0.7669650   0.045
## DRYOexpa     0   0   0   0   1     5 0.3872983   0.825
## EPILangu     0   0   1   0   0     3 0.7071068   0.140
## FESTovin     0   1   0   0   0     2 0.5000000   0.165
## FRAGvesc     1   0   1   0   1    20 0.7133923   0.015
## GALElute     0   0   0   0   1     5 0.9354143   0.005
## GERAsylv     1   0   0   0   1     9 0.4629100   0.345
## GYMNdryo     0   0   1   0   1    14 0.6454972   0.035
## HEPAnobi     1   0   0   0   1     9 0.7357672   0.010
## IMPAnoli     0   0   0   1   0     4 0.7071068   0.160
## LATHvern     0   0   0   0   1     5 0.5000000   0.255
## LUZUpilo     1   1   1   0   1    27 0.8017837   0.335
## LYCOanno     0   1   1   1   0    22 0.6123724   0.145
## MAIAbifo     1   1   1   0   1    27 0.8237545   0.260
## MELAprat     0   1   0   0   0     2 0.8660254   0.005
## MELInuta     0   0   1   0   0     3 0.7071068   0.175
## MILIeffu     0   0   0   0   1     5 0.7905694   0.005
## MOLIcaer     0   1   0   0   0     2 0.2886751   1.000
## MYCEmura     1   0   1   1   0    19 0.7483315   0.005
## ORTHsecu     0   0   0   1   0     4 0.5000000   0.280
## OXALacet     1   0   1   0   1    20 0.9538920   0.005
## PARIquad     1   0   1   1   1    29 0.6666667   0.085
## PTERaqui     1   0   1   0   1    20 0.4330127   0.490
## RANUcass     0   0   0   0   1     5 0.3535534   0.600
## RUBUsaxa     1   0   1   0   1    20 0.9278844   0.005
## STELholo     0   0   0   0   1     5 0.8134892   0.005
## STELnemo     0   0   1   0   1    14 0.6324555   0.050
## TRIEeuro     1   1   1   0   1    27 0.8017837   0.350
## VACCmyrt     1   1   1   0   0    16 0.9356763   0.005
## VACCvit.id   0   1   1   0   0    10 0.8183171   0.005
## VEROcham     0   0   1   1   0    13 0.6123724   0.075
## VIOLmira     0   0   0   0   1     5 0.3535534   0.540
## ANGEsylv     1   0   0   0   0     1 0.5773503   0.055
## ANTHsylv     1   0   1   0   0     7 0.5000000   0.425
## CAMPpers     1   0   0   0   0     1 0.5773503   0.030
## CAREcane     0   0   0   1   0     4 0.7071068   0.160
## CAREglob     0   0   1   0   0     3 0.4082483   0.510
## CHIMumbe     0   1   0   0   0     2 0.2886751   1.000
## CREPpalu     0   0   1   0   1    14 0.5477226   0.165
## DESCcaes     0   0   0   1   0     4 0.7071068   0.160
## DRYOfili     0   0   1   1   0    13 0.5976143   0.060
## EQUIprat     0   0   0   0   1     5 0.5000000   0.355
## EQUIsylv     1   0   0   1   1    21 0.4330127   0.560
## GALEOPsp     0   1   0   0   0     2 0.2886751   1.000
## GALIalbu.    1   0   0   0   0     1 0.4082483   0.360
## GALIpalu     0   0   1   0   0     3 1.0000000   0.005
## GOODrepe     0   1   0   0   0     2 0.5000000   0.185
## HIERvulg     1   0   0   0   0     1 0.4082483   0.360
## IMPAparv     1   0   1   1   0    19 0.7416198   0.015
## LYSIvulg     0   0   1   0   0     3 0.4472136   0.430
## MELAnemo     1   0   0   0   1     9 0.3779645   0.600
## MELAsylv     0   1   1   0   0    10 0.4780914   0.465
## MOERtrin     0   0   1   1   0    13 0.7500000   0.005
## PHEGconn     0   0   1   0   0     3 0.7071068   0.140
## POAnemo      0   0   0   1   0     4 0.5000000   0.410
## POLYodor     1   0   0   0   0     1 0.7071068   0.025
## POTEerec     0   1   0   0   0     2 0.2886751   1.000
## PULMobsc     0   0   0   0   1     5 0.3535534   0.600
## PYROmino     0   0   1   0   0     3 0.5000000   0.385
## RANUauri.    1   0   0   0   0     1 0.4082483   0.360
## SCORhumi     0   1   0   0   0     2 0.2886751   1.000
## SOLIvirg     0   0   1   0   1    14 0.5477226   0.080
## URTIdioi     1   0   1   0   1    20 0.4330127   0.370
## VEROoffi     0   0   0   0   1     5 0.3535534   0.615
## VIOLcani     1   0   0   0   0     1 0.8164966   0.005
## VIOLepip     1   0   1   0   0     7 0.5000000   0.290
## VIOLrivi.    1   0   1   0   1    20 0.4330127   0.375

For example, let’s see which species are sinificantly indicating cluster 1:

subset(indval$sign, indval$sign$s.1 == 1 & indval$sign$p.value < 0.05)
##          s.1 s.2 s.3 s.4 s.5 index      stat p.value
## ANEMnemo   1   0   0   0   1     9 0.8144110   0.005
## CAREdigi   1   0   1   0   1    20 0.7245688   0.015
## CONVmaja   1   0   1   1   0    19 0.7450517   0.035
## DRYOcart   1   0   1   1   1    29 0.7669650   0.045
## FRAGvesc   1   0   1   0   1    20 0.7133923   0.015
## HEPAnobi   1   0   0   0   1     9 0.7357672   0.010
## MYCEmura   1   0   1   1   0    19 0.7483315   0.005
## OXALacet   1   0   1   0   1    20 0.9538920   0.005
## RUBUsaxa   1   0   1   0   1    20 0.9278844   0.005
## VACCmyrt   1   1   1   0   0    16 0.9356763   0.005
## CAMPpers   1   0   0   0   0     1 0.5773503   0.030
## IMPAparv   1   0   1   1   0    19 0.7416198   0.015
## POLYodor   1   0   0   0   0     1 0.7071068   0.025
## VIOLcani   1   0   0   0   0     1 0.8164966   0.005

5.8 Clustering also taxa

Sometimes we might want to cluster taxa which ofter co-occur. For that we need to transpose our sites x taxa matrix using the function t.

comm.data.2 <- log1p(comm.data[, colSums(comm.data > 0) > 3]) # omitting rare taxa which cannot co-occur much anyway

tdis <- vegdist(t(comm.data.2), "euclidean")
tdis.clus <- hclust(tdis, method = "ward.D2")
plot(as.dendrogram(tdis.clus))



## Plotting both dendrograms together!

tabasco(comm.data.2, o.clu, tdis.clus)


## Saving for future!
save(o.grel, file = "clusters.rda")