Finding the best LCA model in poLCA R package

fritzz Source

I am applying LCA analysis with PoLCA R package, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: "ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND". So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number. The R script i use to find the best model in LCA and one of the outputs is as follows:

    for(i in 2:50){
    lc <- poLCA(f, data, nclass=i, maxiter=500, 
                tol=1e-5, na.rm=FALSE,  
                nrep=10, verbose=TRUE, calc.se=TRUE)
    if(lc$bic < min_bic){
        min_bic <- lc$bic
        LCA_best_model<-lc
    }
}

========================================================= Fit for 35 latent classes: ========================================================= number of observations: 36036
number of estimated parameters: 2029 residual degrees of freedom: 34007
maximum log-likelihood: -482547.1
AIC(35): 969152.2 BIC(35): 986383 G^2(35): 233626.8 (Likelihood ratio/deviance statistic)
X^2(35): 906572555 (Chi-square goodness of fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND

rfor-loopcluster-analysisdata-sciencecategorical-data

Answers

answered 3 months ago mpaladino #1

The script you are using sequentially tests every model from 2 to 50 classes and keeps the one with the lowest BIC. BIC is not the only one or the best way to select "the best" model, but fair enough.

The problem is, you are estimating a LOT of parameters, especially in the last steps. The more classes you fit, the more time consuming the process is. Also, in this cases convergence problems are to be expected because you are fitting so many classes. That's what the error message reports, it can't find the maximum likelihood for a model with 35 classes.

I don't know what problem you are trying to solve, but models with over 10 classes are unusual in LCA. You do LCA to reduce the complexity of your data as much as possible. If you NEED to fit models with many -over 10- classes:

  • fit them one by one, so RAM consumption will be less of a problem.
  • increase the nrep= argument in the call, so you the probability of the model not finding maximum likelihood by chance -bad random initial numbers- is reduced. Also increases computing time.

Alternatively you can reduce computing time running models in parallel. Almost every modern PC has 2 or more cores. The function acl() in the next block does this with foreach() and %dopar%, so is OS independent.

library(poLCA)
library(foreach)
library(doParallel)
registerDoParallel(cores=2) #as many physical cores as available.

acl <- function(datos,   #a data.frame with your data 
                k,       #the maximum number of classes to fit
                formula) {  
  foreach(i=1:k, .packages="poLCA") %dopar% poLCA(formula, datos, nclass=i
  )
}

acm() returns a list of models, you can pick "the best" later. The next function will retrieve the quantities of intrest from the list and create a nicely formatted data.frame with usefull information to select the right number of classes.

comparar_clases_acl <- function(modelo) {
  entropy<-function (p) sum(-p*log(p)) #to asses the quality of classification
    tabla_LCA <- data.frame(Modelo=0, BIC=0, Lik_ratio=0, Entropia=0, MenorClase=0)   #empty data.frame to prealocate memory. 
for(i in 1:length(modelo)){
  tabla_LCA [i,1] <- paste("Modelo", i)
  tabla_LCA [i,2] <- modelo[[i]]$bic
  tabla_LCA [i,3] <- modelo[[i]]$Gsq
  error_prior <- entropy(modelo[[i]]$P)
  error_post <- mean(apply(modelo[[i]]$posterior,1, entropy),na.rm = TRUE)
  tabla_LCA [i,4]<-round(((error_prior-error_post) / error_prior),3)
  tabla_LCA [i,5] <- min(modelo[[i]]$P)*100
  }
return(tabla_LCA)
}

It takes only one argument: an object with a list of LCA models, exactly what acl() returns.

This parallel approach should reduce computing time. Still 50 classes are to much and you are probably getting the smallest BIC way before 50 classes. Remember, BIC penalices models as the number of estimated parameters increases, helping you find the point of diminishing returns of an extra class in your model.

comments powered by Disqus