Here we explore how similar ecological communities are. With this, we mean how much their species composition overlaps. Dissimilarity is just the opposite of similarity, both measures carry the same information.
Let’s read our community data and explore abundance data. Because abundance data tends to be skewed (few species have high abundance and many have low abundance), it is often reasonable to transform abundance to get a better distribution.
# Reading previously saved data
load("community.rda", verbose = T) # loading from previous
## Loading objects:
## vas.plants
## tree.counts
## forest.types
## xy
## soil.data
## tables.join
comm.data <- vas.plants # copy of a community data
# Most abundant species
max.spp <- which.max(colSums(comm.data))
max.spp
## OXALacet
## 33
hist(comm.data[comm.data > 0]) # histogram of requencies (other than 0)
hist(sqrt(comm.data[comm.data > 0])) # square root transformation
hist(log1p(comm.data[comm.data > 0])) # log (x+1) transformation
We can see that log-tranformation gives approximately normal distribution, it downweights frequent taxa.
We can measure dissimilarity by using several indices. Here are some examples.
Euclidean distance:
\(d_{jk} = \sqrt{\sum_{i=1}^{n}{(x_{ij}-x_{ik})^2}}\)
where \(x\) is abundance of species \(i\) in sites \(j\) and \(k\) and \(n\) is the total number of taxa.
Bray-Curtis distance:
\(d_{jk} = \frac{\sum_{i=1}^{n}{|x_{ij}-x_{ik}|}}{\sum_{i=1}^{n}{(x_{ij}+x_{ik})}}\)
Euclidean distance does not have an upper limit, whereas Bray-Curtis distance is bounded between 0 and 1.
With the vegdist function we can calculate various
distance measures. It returns a triangular distance matrix, because
distance from sample A to B is the same as from B to A.
library(vegan)
vegdist(comm.data[1:5, ], "euclidean") # triangular distance matrix
## X001Vapramae X002Illi X003Vitipalu X004Konguta
## X002Illi 6.422616
## X003Vitipalu 6.164414 7.729812
## X004Konguta 7.297260 8.544004 9.473648
## X005Vehendi 8.154753 5.766281 9.746794 9.708244
# Calculating some distances and ploting against each other.
vd1 <- vegdist(comm.data, "euclidean")
vd2 <- vegdist(log1p(comm.data), "euclidean") # log transformation
vd3 <- vegdist(log1p(comm.data), "bray")
pairs(cbind(vd1, vd2, vd3))
Explore other community (dis)similarity measures.
Using a distance matrix we can perform a clustering of our samples. Initially all samples form their own clusters, then we start to join the most similar sites and form clustering trees.
Single linkage is based on the most similar members of two clusters. Complete linkage is based on the most dissimilar member of two clusters. Average linkage is based on calculating the average similarity between all members. Ward method is more complex, aiming to minimize the variance within clusters. You can check the Wildi book for more details.
If we have a hierarchical cluster tree, we can always cut this to any number of clusters.
vd <- vd2 # selecting a distance matrix for future calculations
o.clu.s <- hclust(vd, method = "single")
o.clu.c <- hclust(vd, method = "complete")
o.clu.a <- hclust(vd, method = "average")
o.clu.w <- hclust(vd, method = "ward.D2")
par(mfrow = c(2, 2)) # several figures together 2 rows and 2 columns!
plot(as.dendrogram(o.clu.s), main = "single")
plot(as.dendrogram(o.clu.c), main = "complete")
plot(as.dendrogram(o.clu.a), main = "average")
plot(as.dendrogram(o.clu.w), main = "ward.D2")
o.clu <- o.clu.w # selecting the most logical (Ward linkage)
par(mfrow = c(1, 1)) # single figure again.
## Cutting tree to parts
o.grel <- cutree(o.clu, k = 5)
plot(as.dendrogram(o.clu))
rect.hclust(o.clu, 5, border = "red")
# Similarity between sites (colors), ordered along clusters
image(as.matrix(vd)[order(o.grel), order(o.grel)], asp = T,
col = hcl.colors(8, palette = "viridis"))
Not hierarchical – just give number of clusters needed. Computation is complex, based on machine learning and iterations.
k.o <- kmeans(comm.data, 5)
k.o$cluster
## X001Vapramae X002Illi X003Vitipalu X004Konguta X005Vehendi
## 4 2 3 1 2
## X006.Erumae X007Porgumae X008Aardla X009Kurepalu X010Sarakuste
## 5 3 5 2 3
## X011Kannu X012Vonnu X013Rookse X014Kammeri X015Kambja
## 3 4 3 4 3
## X016Reola X017Ignase X018Unikula X019Haavametsa X020Kruusahaua
## 5 4 4 2 3
## X021Aravu X026Pedajamae X027Miti X028Puhaste X031Pimmelaan
## 2 4 2 2 4
## X032Logina X033Voorepalu X036Taevaskoja X044Savimae X073Pausakunnu
## 5 2 2 2 3
image(as.matrix(vd1)[order(k.o$cluster), order(k.o$cluster)], asp = T,
col = hcl.colors(8, palette = "viridis"))
Above we defined 5 clusters but can we find how many clusters are optimal? One method is to inspect at what number of groups the correlation between the real distance matrix between sites and the distance matrix between clusters maximizes. This distance in clusters can only include 0 (in different cluster) and 1 (in the same cluster) but we can still calculate the correlation.
# Lets define a vector for correlations
correls <- numeric() # making a numeric vector
for (i in 2:(nrow(comm.data) - 1)) {
# loop for possible cluster numbers. We do not use 1 (all in the same cluster) and number of samples (all in different clusters)
clusters <- cutree(o.clu, k = i) # defining clusters
clusdist <- vegdist(table(1:30, clusters), "bray") # dist calculates Bray distance
# distance is 0 (in different cluster) or 1 (same cluster)
# table(1:30,clusters) makes a table of 30 sites vs clusters, 1 if a site is in a cluster
correls[i] <- cor(vd, clusdist) # correlation as a measure
}
plot(correls, type = "h", xlab = "No of clusers", ylab = "Correlation")
max.corr <- which.max(correls) # which is max correlation?
points(max.corr, correls[max.corr], pch = 16, col = "red", cex = 2)
axis(side = 1, at = max.corr, labels = max.corr, col.axis = "red")
# A nice graph!
Now we will compare how the clusters that we defined using maths are related to human-defined forest types.
forest.types # this vector contains the forest type each site is categorized as
## [1] "115" "113" "115" "151" "113" "113" "116" "116" "113" "116" "116" "114"
## [13] "116" "114" "116" "114" "114" "115" "113" "116" "151" "114" "113" "113"
## [25] "116" "151" "113" "113" "151" "114"
image(as.matrix(vd1)[order(forest.types), order(forest.types)], asp = T,
col = hcl.colors(8, palette = "viridis"))
table(o.grel, forest.types) # cross-table
## forest.types
## o.grel 113 114 115 116 151
## 1 0 3 3 0 0
## 2 8 1 0 0 3
## 3 0 0 0 1 1
## 4 1 0 0 1 0
## 5 0 2 0 6 0
table(k.o$cluster, forest.types)
## forest.types
## 113 114 115 116 151
## 1 0 0 0 0 1
## 2 8 0 0 0 2
## 3 0 1 1 6 0
## 4 0 4 2 1 0
## 5 1 1 0 1 1
table(k.o$cluster, o.grel)
## o.grel
## 1 2 3 4 5
## 1 0 0 1 0 0
## 2 0 10 0 0 0
## 3 1 0 0 0 7
## 4 4 1 1 0 1
## 5 1 1 0 2 0
# Fisher exact test of two groups
fisher.test(table(o.grel, forest.types))
##
## Fisher's Exact Test for Count Data
##
## data: table(o.grel, forest.types)
## p-value = 6.84e-07
## alternative hypothesis: two.sided
fisher.test(table(k.o$cluster, forest.types))
##
## Fisher's Exact Test for Count Data
##
## data: table(k.o$cluster, forest.types)
## p-value = 2.745e-06
## alternative hypothesis: two.sided
fisher.test(k.o$cluster, o.grel)
##
## Fisher's Exact Test for Count Data
##
## data: k.o$cluster and o.grel
## p-value = 6.744e-09
## alternative hypothesis: two.sided
All tests are highly significant, but there are also some differences.
We want to know which species actually define clusters. There are several indices which show how well species differ between clusters. Here we use the IndVal Index. There are ways to calculate statistical significance as well for different clusters.
library(indicspecies)
# IndVal index (species indicator values)
indval <- multipatt(comm.data, o.grel, func = "IndVal")
indval$sign
## s.1 s.2 s.3 s.4 s.5 index stat p.value
## ACTAspic 1 0 0 0 1 9 0.4629100 0.350
## AEGOpoda 0 0 0 0 1 5 0.5590170 0.165
## ANEMnemo 1 0 0 0 1 9 0.8144110 0.005
## ASAReuro 1 0 0 0 1 9 0.5345225 0.140
## ATHYfil.fe 0 0 0 0 1 5 0.5000000 0.330
## BRACpinn 1 0 0 0 0 1 0.4082483 0.350
## CALAarun 1 1 0 1 1 28 0.8451543 0.180
## CALLvulg 1 0 0 0 0 1 0.3535534 0.720
## CAREdigi 1 0 1 0 1 20 0.7245688 0.015
## CIRCalpi 0 0 1 0 0 3 0.7745967 0.025
## CONVmaja 1 0 1 1 0 19 0.7450517 0.035
## DESCflex 0 1 0 0 1 12 0.3162278 1.000
## DRYOcart 1 0 1 1 1 29 0.7669650 0.045
## DRYOexpa 0 0 0 0 1 5 0.3872983 0.825
## EPILangu 0 0 1 0 0 3 0.7071068 0.140
## FESTovin 0 1 0 0 0 2 0.5000000 0.165
## FRAGvesc 1 0 1 0 1 20 0.7133923 0.015
## GALElute 0 0 0 0 1 5 0.9354143 0.005
## GERAsylv 1 0 0 0 1 9 0.4629100 0.345
## GYMNdryo 0 0 1 0 1 14 0.6454972 0.035
## HEPAnobi 1 0 0 0 1 9 0.7357672 0.010
## IMPAnoli 0 0 0 1 0 4 0.7071068 0.160
## LATHvern 0 0 0 0 1 5 0.5000000 0.255
## LUZUpilo 1 1 1 0 1 27 0.8017837 0.335
## LYCOanno 0 1 1 1 0 22 0.6123724 0.145
## MAIAbifo 1 1 1 0 1 27 0.8237545 0.260
## MELAprat 0 1 0 0 0 2 0.8660254 0.005
## MELInuta 0 0 1 0 0 3 0.7071068 0.175
## MILIeffu 0 0 0 0 1 5 0.7905694 0.005
## MOLIcaer 0 1 0 0 0 2 0.2886751 1.000
## MYCEmura 1 0 1 1 0 19 0.7483315 0.005
## ORTHsecu 0 0 0 1 0 4 0.5000000 0.280
## OXALacet 1 0 1 0 1 20 0.9538920 0.005
## PARIquad 1 0 1 1 1 29 0.6666667 0.085
## PTERaqui 1 0 1 0 1 20 0.4330127 0.490
## RANUcass 0 0 0 0 1 5 0.3535534 0.600
## RUBUsaxa 1 0 1 0 1 20 0.9278844 0.005
## STELholo 0 0 0 0 1 5 0.8134892 0.005
## STELnemo 0 0 1 0 1 14 0.6324555 0.050
## TRIEeuro 1 1 1 0 1 27 0.8017837 0.350
## VACCmyrt 1 1 1 0 0 16 0.9356763 0.005
## VACCvit.id 0 1 1 0 0 10 0.8183171 0.005
## VEROcham 0 0 1 1 0 13 0.6123724 0.075
## VIOLmira 0 0 0 0 1 5 0.3535534 0.540
## ANGEsylv 1 0 0 0 0 1 0.5773503 0.055
## ANTHsylv 1 0 1 0 0 7 0.5000000 0.425
## CAMPpers 1 0 0 0 0 1 0.5773503 0.030
## CAREcane 0 0 0 1 0 4 0.7071068 0.160
## CAREglob 0 0 1 0 0 3 0.4082483 0.510
## CHIMumbe 0 1 0 0 0 2 0.2886751 1.000
## CREPpalu 0 0 1 0 1 14 0.5477226 0.165
## DESCcaes 0 0 0 1 0 4 0.7071068 0.160
## DRYOfili 0 0 1 1 0 13 0.5976143 0.060
## EQUIprat 0 0 0 0 1 5 0.5000000 0.355
## EQUIsylv 1 0 0 1 1 21 0.4330127 0.560
## GALEOPsp 0 1 0 0 0 2 0.2886751 1.000
## GALIalbu. 1 0 0 0 0 1 0.4082483 0.360
## GALIpalu 0 0 1 0 0 3 1.0000000 0.005
## GOODrepe 0 1 0 0 0 2 0.5000000 0.185
## HIERvulg 1 0 0 0 0 1 0.4082483 0.360
## IMPAparv 1 0 1 1 0 19 0.7416198 0.015
## LYSIvulg 0 0 1 0 0 3 0.4472136 0.430
## MELAnemo 1 0 0 0 1 9 0.3779645 0.600
## MELAsylv 0 1 1 0 0 10 0.4780914 0.465
## MOERtrin 0 0 1 1 0 13 0.7500000 0.005
## PHEGconn 0 0 1 0 0 3 0.7071068 0.140
## POAnemo 0 0 0 1 0 4 0.5000000 0.410
## POLYodor 1 0 0 0 0 1 0.7071068 0.025
## POTEerec 0 1 0 0 0 2 0.2886751 1.000
## PULMobsc 0 0 0 0 1 5 0.3535534 0.600
## PYROmino 0 0 1 0 0 3 0.5000000 0.385
## RANUauri. 1 0 0 0 0 1 0.4082483 0.360
## SCORhumi 0 1 0 0 0 2 0.2886751 1.000
## SOLIvirg 0 0 1 0 1 14 0.5477226 0.080
## URTIdioi 1 0 1 0 1 20 0.4330127 0.370
## VEROoffi 0 0 0 0 1 5 0.3535534 0.615
## VIOLcani 1 0 0 0 0 1 0.8164966 0.005
## VIOLepip 1 0 1 0 0 7 0.5000000 0.290
## VIOLrivi. 1 0 1 0 1 20 0.4330127 0.375
For example, let’s see which species are sinificantly indicating cluster 1:
subset(indval$sign, indval$sign$s.1 == 1 & indval$sign$p.value < 0.05)
## s.1 s.2 s.3 s.4 s.5 index stat p.value
## ANEMnemo 1 0 0 0 1 9 0.8144110 0.005
## CAREdigi 1 0 1 0 1 20 0.7245688 0.015
## CONVmaja 1 0 1 1 0 19 0.7450517 0.035
## DRYOcart 1 0 1 1 1 29 0.7669650 0.045
## FRAGvesc 1 0 1 0 1 20 0.7133923 0.015
## HEPAnobi 1 0 0 0 1 9 0.7357672 0.010
## MYCEmura 1 0 1 1 0 19 0.7483315 0.005
## OXALacet 1 0 1 0 1 20 0.9538920 0.005
## RUBUsaxa 1 0 1 0 1 20 0.9278844 0.005
## VACCmyrt 1 1 1 0 0 16 0.9356763 0.005
## CAMPpers 1 0 0 0 0 1 0.5773503 0.030
## IMPAparv 1 0 1 1 0 19 0.7416198 0.015
## POLYodor 1 0 0 0 0 1 0.7071068 0.025
## VIOLcani 1 0 0 0 0 1 0.8164966 0.005
Sometimes we might want to cluster taxa which ofter co-occur. For
that we need to transpose our sites x taxa matrix using the
function t.
comm.data.2 <- log1p(comm.data[, colSums(comm.data > 0) > 3]) # omitting rare taxa which cannot co-occur much anyway
tdis <- vegdist(t(comm.data.2), "euclidean")
tdis.clus <- hclust(tdis, method = "ward.D2")
plot(as.dendrogram(tdis.clus))
## Plotting both dendrograms together!
tabasco(comm.data.2, o.clu, tdis.clus)
## Saving for future!
save(o.grel, file = "clusters.rda")