So I am using the `mice`

package to impute missing data. I'm new to imputation so I've got to a point but have run into a steep learning curve. To give a toy example:

```
library(mice)
# Using nhanes dataset as example
df1 <- mice(nhanes, m=10)
```

So as you can see I imputed df1 10 times using mostly default settings - and I am comfortable using this result in regression models, pooling results etc. However in my real life data, I have survey data from different countries. And so levels of missings differ by country, as do the values of specific variables - i.e. age, education level etc. Therefore I would like to impute the misssings, allowing for clustering by the country. So I will create a grouping variable which has no missings (of course in this toy example the correlations with other variables are missing, but in my real data they exist)

```
# Create a grouping variable
nhanes$country <- sample(c("A", "B"), size=nrow(nhanes), replace=TRUE)
```

So how to I tell `mice()`

that this variable is different from the others - i.e. it is a level in a multi-level dataset?

answered 2 years ago helper #1

You have to set up a predictorMatrix to tell mice which variable to use to impute another. A fast way in doing so is to use `predictorM<-quickpred(nhanes)`

Then you change the 1s in the matrix to 2 if it is a normal variable and -2 if it is the level two variable for different countries and submit it to the mice command as `predictorMatrix =predictorM`

. In the method command you now have to set the methods to `2l.norm`

if it is a metric variable or `2l.binom`

if it is binary variable. For the latter you need the function written by Sabine Zinn (https://www.neps-data.de/Portals/0/Working%20Papers/WP_XXXI.pdf). Unfortunately it is not known to me if there methods for imputation of two level count data out there in the world.

Be aware imputing a multilevel datasets will slow down the process a lot. In my experience resampling method like PMM or in the Baboon package work well in keeping the hierarchical structure of the data and are much faster in use.

answered 2 years ago SimonG #2

If you're thinking clusters as in "mixed-effects" models, then you should use the methods provided by `mice`

intended for clustered data. These methods can be found in the manual and are usually prefixed like `2l.something`

.

The variety of methods for clustered data is somewhat limited in `mice`

, but I can recommend using `2l.pan`

for missing data in lower-level units and `2l.only.norm`

at the cluster level.

As an alternative to mixed-effects models, you may consider using dummy indicators to represent the cluster structure (i.e., one dummy variable for each cluster). This method is not ideal when you think of the clusters from the perspective of mixed-effects models. So if you want to do mixed-effects analyses, then stick to mixed-effects models when you can.

Below, I show an example for both strategies.

**Preparation:**

```
library(mice)
data(nhanes)
set.seed(123)
nhanes <- within(nhanes,{
country <- factor(sample(LETTERS[1:10], size=nrow(nhanes), replace=TRUE))
countryID <- as.numeric(country)
})
```

**Case 1: Imputation using mixed-effects models**

This section uses `2l.pan`

to impute the three variables with missing data. Note that I use `clusterID`

as the cluster variable by specifying a `-2`

in the predictor matrix. To all other variables, I assign fixed effects only (`1`

).

```
# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred1 <- imp0$predictorMatrix
meth1 <- imp0$method
# set imputation procedures
meth1[c("bmi","hyp","chl")] <- "2l.pan"
# set predictor Matrix (mixed-effects models with random intercept
# for countryID and fixed effects otherwise)
pred1[,"country"] <- 0 # don't use country factor
pred1[,"countryID"] <- -2 # use countryID as cluster variable
pred1["bmi", c("age","hyp","chl")] <- c(1,1,1) # fixed effects (bmi)
pred1["hyp", c("age","bmi","chl")] <- c(1,1,1) # fixed effects (hyp)
pred1["chl", c("age","bmi","hyp")] <- c(1,1,1) # fixed effects (chl)
# impute
imp1 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred1, method=meth1)
```

**Case 2: Imputation using dummy indicators (DIs) for clusters**

This section uses `pmm`

for imputation, and the clustered structure is represented in an "ad hoc" fashion. That is, the clustered aren't represented by random effects but by fixed effects instead. This may exaggerate the cluster-level variability of the variables with missing data, so be sure you know what you do when you use it.

```
# create dummy indicator variables
DIs <- with(nhanes, contrasts(country)[country,])
colnames(DIs) <- paste0("country",colnames(DIs))
nhanes <- cbind(nhanes,DIs)
# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred2 <- imp0$predictorMatrix
meth2 <- imp0$method
# set imputation procedures
meth2[c("bmi","hyp","chl")] <- "pmm"
# for countryID and fixed effects otherwise)
pred2[,"country"] <- 0 # don't use country factor
pred2[,"countryID"] <- 0 # don't use countryID
pred2[,colnames(DIs)] <- 1 # use dummy indicators
pred2["bmi", c("age","hyp","chl")] <- c(1,1,1) # fixed effects (bmi)
pred2["hyp", c("age","bmi","chl")] <- c(1,1,1) # fixed effects (hyp)
pred2["chl", c("age","bmi","hyp")] <- c(1,1,1) # fixed effects (chl)
# impute
imp2 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred2, method=meth2)
```

If you want to read up on what to think of these methods, have a look at one or two of these papers.