In this document we supply an example of univariate mixture modeling, an example for multivariate normal mixture, and a simple example of regression mixture using R. This document serves as a supplementary material for EDMS 657-Exploratory Latent and Composite Variable Methods, taught by Dr. Gregory R. Hancock at University of Maryland.
Before we get started, let’s install all the packages we are going to use in this demo.
install.packages("mixtools")
install.packages("FactMixtAnalysis")
install.packages("mclust")
devtools::install_github("data-edu/tidyLPA")
devtools::install_github("YiFengEDMS/EDMS657Data")
Research Context
Were head circumference measurements sampled from a homogeneous normally distributed population with a single mean and variance? Were head circumference measurements sampled from a heterogeneous population consisting of multiple latent classes (i.e., unknown groups), each characterized by a normal distribution with a unique mean and variance?
library(EDMS657Data)
data(ME_dat)
str(ME_dat)
## 'data.frame': 2028 obs. of 4 variables:
## $ ID : chr "2000011" "2000051" "2000091" "2000101" ...
## $ sex: int 0 0 0 1 1 1 0 1 1 1 ...
## $ age: int 32 36 4 44 6 3 4 12 44 7 ...
## $ y : num 53.3 47 39 49 17 ...
head(ME_dat)
## ID sex age y
## 1 2000011 0 32 53.33333
## 2 2000051 0 36 47.00000
## 3 2000091 0 4 39.00000
## 4 2000101 1 44 49.00000
## 5 2000121 1 6 17.00000
## 6 2000141 1 3 34.00000
hist(ME_dat$y, main="Child Head Circumference",
xlab="Circumference", ylab="")
There are many different R packages that can be used to fit mixture
models. Here I will demo this same example with three available R
packages: mixtools
, mclust
, and
tidyLPA
.
mixtools
packageYou can use the normalmixEM
function within the
mixtools
R package to fit the mixture models. You will need
to specify the initial or starting values for the model parameters:
mixing proportions (lambda), means (mu), and variances (sigma).
library(mixtools)
mod1 <- normalmixEM(ME_dat$y, lambda = .5, mu = c(45, 15), sigma = c(100, 110))
## number of iterations= 28
# summarize output
mod1[c("lambda", "mu", "sigma")]
## $lambda
## [1] 0.7566855 0.2433145
##
## $mu
## [1] 45.65962 18.25173
##
## $sigma
## [1] 3.863324 2.380447
summary(mod1)
## summary of normalmixEM object:
## comp 1 comp 2
## lambda 0.756685 0.243315
## mu 45.659616 18.251726
## sigma 3.863324 2.380447
## loglik at estimate: -6501.699
You could obtain a histogram of the raw data along with the two density curves determined by the model estimates.
plot(mod1, density=TRUE, cex.axis=1.4, cex.lab=1.4, cex.main=1.8, xlab2="Head Circumferences")
mclust
packageA more commonly used R package for mixture modeling is the
mclust
package. Here you do not need to supply the initial
values for the model parameters.
library(mclust)
mod4 <- densityMclust(ME_dat$y, modelName = "V", G = 2)
## fitting ...
##
|
| | 0%
|
|================================ | 50%
|
|================================================================| 100%
summary(mod4)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 2 components:
##
## log-likelihood n df BIC ICL
## -6501.72 2028 5 -13041.51 -13044.59
mod4$parameters
## $pro
## [1] 0.2434123 0.7565877
##
## $mean
## 1 2
## 18.25592 45.66181
##
## $variance
## $variance$modelName
## [1] "V"
##
## $variance$d
## [1] 1
##
## $variance$G
## [1] 2
##
## $variance$sigmasq
## [1] 5.708087 14.889938
##
## $variance$scale
## [1] 5.708087 14.889938
It is not required for you to specify the number of groups
(G =
) with mclust
. If you do not specify the
number of groups, mclust
will run a series of models with
different numbers of groups, and determine how many clusters to keep
based on fit indices. For example:
mod5 <- densityMclust(faithful$waiting)
## fitting ...
##
|
| | 0%
|
|=== | 5%
|
|======= | 11%
|
|========== | 16%
|
|============= | 21%
|
|================= | 26%
|
|==================== | 32%
|
|======================== | 37%
|
|=========================== | 42%
|
|============================== | 47%
|
|================================== | 53%
|
|===================================== | 58%
|
|======================================== | 63%
|
|============================================ | 68%
|
|=============================================== | 74%
|
|=================================================== | 79%
|
|====================================================== | 84%
|
|========================================================= | 89%
|
|============================================================= | 95%
|
|================================================================| 100%
plot(mod5, what = "BIC")
Here V
represents the models that allow
varying variances across groups, while E
represents the models that contrain equal variances
across groups.
You may also note the mclust
package determines the best
model by maximizing the BIC, instead of minimizing it as we have learned
in class. It is because mclust
uses a slightly different
function to compute BIC compared to the general definition of it.
The general definition of BIC is: \[BIC = -2 \times ln(L(\mathbf{\Theta} | \mathbf{X})) + k \times ln(n),\] where k is the number of model parameters to be estimated.
In mclust
, however, the formula for BIC is (Scrucca,
Fop, Murphy, & Raftery, 2016):
\[BIC = 2 \times ln(L(\mathbf{\Theta} |
\mathbf{X})) - k \times ln(n)\] Therefore, it makes sense that
you need to maximize BIC if using mclust
.
tidyLPA
packageYou may notice that unlike Mplus, mclust
does not report
entropy for the mixture models. If you need entropy, you could use the
tidyLPA
R package that provides a tidy and concise wrapper
function for fitting mixture models using mclust
.
library(tidyLPA)
mod6 <- estimate_profiles(ME_dat[, "y"], 1:2, variances = "varying")
mod6$model_2_class_2$estimates
## Category Parameter Estimate se p Class Model
## 1 Means df 18.255900 0.11203300 0.000000e+00 1 2
## 2 Variances df 5.707914 0.96558599 3.393213e-09 1 2
## 3 Means df 45.661800 0.09795117 0.000000e+00 2 2
## 4 Variances df 14.890084 0.67651556 2.311695e-107 2 2
## Classes
## 1 2
## 2 2
## 3 2
## 4 2
data("faithful")
psych::describe(faithful)
## vars n mean sd median trimmed mad min max range skew
## eruptions 1 272 3.49 1.14 4 3.53 0.95 1.6 5.1 3.5 -0.41
## waiting 2 272 70.90 13.59 76 71.50 11.86 43.0 96.0 53.0 -0.41
## kurtosis se
## eruptions -1.51 0.07
## waiting -1.16 0.82
with(faithful, plot(eruptions, waiting))
# Two clusters
out.faith <- mvnormalmixEM(faithful, arbvar = T, k = 2)
## number of iterations= 12
summary(out.faith)
## summary of mvnormalmixEM object:
## comp 1 comp 2
## lambda 0.355873 0.644127
## mu1 2.036389 54.478518
## mu2 4.289662 79.968117
## loglik at estimate: -1130.264
out.faith$sigma
## [[1]]
## [,1] [,2]
## [1,] 0.06916781 0.4351691
## [2,] 0.43516911 33.6972922
##
## [[2]]
## [,1] [,2]
## [1,] 0.1699682 0.9406068
## [2,] 0.9406068 36.0461825
head(out.faith$posterior)
## comp.1 comp.2
## [1,] 2.592412e-09 1.000000e+00
## [2,] 1.000000e+00 1.907899e-09
## [3,] 8.422465e-06 9.999916e-01
## [4,] 9.999893e-01 1.066832e-05
## [5,] 1.001461e-21 1.000000e+00
## [6,] 9.926681e-01 7.331888e-03
plot(out.faith, which = 2)
pred <- apply(out.faith$posterior, 1, function(row) which.max(row))
head(pred)
## [1] 2 1 2 1 2 1
table(pred)
## pred
## 1 2
## 97 175
# Three clusters
out.faith3 <- mvnormalmixEM(faithful, arbvar = T, k = 3)
## number of iterations= 172
summary(out.faith3)
## Warning in rbind(x$lambda, matrix(unlist(x$mu), byrow = TRUE, nrow =
## length(x$lambda))): number of columns of result is not a multiple of
## vector length (arg 1)
## summary of mvnormalmixEM object:
## comp 1 comp 2
## lambda 0.332768 0.0903426
## mu1 1.996646 54.3829144
## mu2 3.568137 70.2600748
## mu3 4.335336 80.5227071
## loglik at estimate: -1119.214
out.faith3$sigma
## [[1]]
## [,1] [,2]
## [1,] 0.04390206 0.3440483
## [2,] 0.34404832 33.7411354
##
## [[2]]
## [,1] [,2]
## [1,] 0.5536093 7.849662
## [2,] 7.8496620 134.878891
##
## [[3]]
## [,1] [,2]
## [1,] 0.1359341 0.358164
## [2,] 0.3581640 28.587349
head(out.faith3$posterior)
## comp.1 comp.2 comp.3
## [1,] 1.139900e-13 0.125396216 8.746038e-01
## [2,] 9.984080e-01 0.001591990 7.205475e-14
## [3,] 7.855305e-09 0.593925180 4.060748e-01
## [4,] 9.903475e-01 0.009652477 3.485977e-08
## [5,] 3.293286e-33 0.051445238 9.485548e-01
## [6,] 2.276152e-03 0.997720863 2.984782e-06
plot(out.faith3, which = 2)
pred3 <- apply(out.faith3$posterior, 1, function(row) which.max(row))
head(pred3)
## [1] 3 1 2 1 3 2
table(pred3)
## pred3
## 1 2 3
## 92 15 165
data(NOdata)
psych::describe(NOdata)
## vars n mean sd median trimmed mad min max range skew
## NO 1 88 1.96 1.13 1.75 1.91 1.44 0.37 4.03 3.66 0.33
## Equivalence 2 88 0.93 0.20 0.93 0.93 0.26 0.54 1.23 0.70 -0.14
## kurtosis se
## NO -1.30 0.12
## Equivalence -1.24 0.02
with(NOdata, plot(y = Equivalence, x = NO))
reg.out <- with(NOdata, regmixEM(y = Equivalence, x = NO, verb = F, epsilon = 1e-04, k = 2))
## number of iterations= 18
summary(reg.out)
## summary of regmixEM object:
## comp 1 comp 2
## lambda 0.4897277 0.5102723
## sigma 0.0433292 0.0241435
## beta1 0.5649610 1.2471261
## beta2 0.0850408 -0.0830369
## loglik at estimate: 122.0383
reg.out[3:6]
## $lambda
## [1] 0.4897277 0.5102723
##
## $beta
## comp.1 comp.2
## beta.0 0.56496100 1.24712608
## beta.1 0.08504084 -0.08303687
##
## $sigma
## [1] 0.04332924 0.02414351
##
## $loglik
## [1] 122.0383
plot(reg.out, which = 2, xlab2 = "NO", ylab2 = "Equivalence")
The R script for this tutorial can be found here.
References
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8, 289-317.