hurdle model prediction - count vs response

user2602640 Source

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.

Here's a similar result for a toy model (code adapted from here):

data("RecreationDemand", package = "AER") 

## model 
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin") 
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")

Why is it that the counts are higher than the responses?

added comparison to glm.nb

Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??

m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,]) 
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only


answered 6 months ago Achim Zeileis #1

The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).

Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)

m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
  subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
##                 hurdle  zerotrunc     negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality     0.02482553 0.02483015 0.01671314

Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.

As for the predictions, one can distinguish three quantities:

  1. The expectation from the untruncated count part, i.e., simply exp(x'b).
  2. The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
  3. The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.

See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

comments powered by Disqus