The Poisson regression model naturally arises when we want to model the average number of occurrences per unit of time or space. For example, the incidence of rare cancer, the number of car crossing at the crossroad, or the number of earthquakes.

当我们要每单位发生的时间或空间的平均数量的T型车,他泊松回归模型自然就出现了。 例如,罕见癌症的发生率,十字路口的汽车穿越次数或地震次数。

One feature of the Poisson distribution is that the mean equals the variance. However, over- or underdispersion happens in Poisson models, where the variance is larger or smaller than the mean value, respectively. In reality, overdispersion happens more frequently with a limited amount of data.

泊松分布的一个特征是均值等于方差。 但是,在Poisson模型中会发生过度分散分散不足的情况,其中方差分别大于或小于平均值。 实际上,在数据量有限的情况下,过度分散会更加频繁地发生。

The overdispersion issue affects the interpretation of the model. It is necessary to address the problem in order to avoid the wrong estimation of the coefficients.

过度分散问题影响模型的解释。 为了避免错误地估计系数,有必要解决该问题。

In this post, I am going to discuss some basic methods to adjust for the overdispersion phenomenon in the Poisson regression model. The implementation will be shown in R codes. I hope this article is helpful.

在本文中,我将讨论一些用于调整Poisson回归模型中过度分散现象的基本方法。 该实现将在R代码中显示。 希望本文对您有所帮助。

Suppose we want to model count responses Yi using a vector of predictors xi. We know that the response variable Yi follows a Poisson distribution with parameter μi.

假设我们要使用预测因子xi的向量对计数响应Yi建模 我们知道响应变量Yi遵循参数μi的泊松分布。

Yi follows the Poisson distribution

where the probability function is


The Poisson PMF

The log link function is used to link the linear combination of the predictors, Xi with the Poisson parameter μi.


The Poisson regression model.

Let’s build a simple model with the example introduced in Faraway’s book.


## R codelibrary(faraway)data(gala)gala = gala[,-2]pois_mod = glm(Species ~ .,family=poisson,gala)summary(pois_mod)

This is the summary of the Poisson model.


Call:glm(formula = Species ~ ., family = poisson, data = gala)

Deviance Residuals:     Min       1Q   Median       3Q      Max  -8.2752  -4.4966  -0.9443   1.9168  10.1849  

Coefficients:              Estimate Std. Error z value Pr(>|z|)    (Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3510.73  on 29  degrees of freedomResidual deviance:  716.85  on 24  degrees of freedomAIC: 889.68

Number of Fisher Scoring iterations: 5

Don’t be fooled by the super significant coefficients. Those beautiful p-values are exactly the consequences of the overdispersion issue. We can check the overdispersion either visually or quantitatively.

不要被超有效系数所迷惑。 这些漂亮的p值正是过度分散问题的结果。 我们可以目测或定量检查过度分散。

Let’s first plot out the estimated variance against the mean.


## R codeplot(log(fitted(pois_mod)),log((gala$Species-fitted(pois_mod))^2),xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2),pch=20,col="blue")abline(0,1) ## 'varianc = mean' line
Overdispersion Diagnosis

We can see that the majority of the variance is larger than the mean, which is a warning of overdispersion.


Quantitatively, the dispersion parameter φ can be estimated using Pearson’s Chi-squared statistic and the degree of freedom.


Estimation of the dispersion parameter

When φ is larger than 1, it is overdispersion. To manually calculate the parameter, we use the code below.

当φ大于1时,它是过度分散的。 要手动计算参数,我们使用以下代码。

## R codedp = sum(residuals(pois_mod,type ="pearson")^2)/pois_mod$df.residualdp

which gives us 31.74914 and confirms this simple Poisson model has the overdispersion problem.


Alternatively, we can apply a significance test directly on the fitted model to check the overdispersion.


## R codelibrary(AER)dispersiontest(pois_mod)

which yields,


Overdispersion test

data:  pois_modz = 3.3759, p-value = 0.0003678alternative hypothesis: true dispersion is greater than 1sample estimates:dispersion   25.39503

This overdispersion test reports the significance of the overdispersion issue within the model.


We can check how much the coefficient estimations are affected by overdispersion.


## R codesummary(pois_mod,dispersion = dp)

which yields,


Call:glm(formula = Species ~ ., family = poisson, data = gala)

Deviance Residuals:     Min       1Q   Median       3Q      Max  -8.2752  -4.4966  -0.9443   1.9168  10.1849  

Coefficients:              Estimate Std. Error z value Pr(>|z|)    (Intercept)  3.1548079  0.2915897  10.819  < 2e-16 ***Area        -0.0005799  0.0001480  -3.918 8.95e-05 ***Elevation    0.0035406  0.0004925   7.189 6.53e-13 ***Nearest      0.0088256  0.0102621   0.860    0.390    Scruz       -0.0057094  0.0035251  -1.620    0.105    Adjacent    -0.0006630  0.0001653  -4.012 6.01e-05 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 31.74914)

    Null deviance: 3510.73  on 29  degrees of freedomResidual deviance:  716.85  on 24  degrees of freedomAIC: 889.68

Number of Fisher Scoring iterations: 5

Now around half of the predictors become insignificant, which changes the entire interpretation of the model.


Alright, let’s address the problem in the following two ways.


允许色散估计 (Allow Dispersion Estimation)

A simple way to adjust the overdispersion is as straightforward as to estimate the dispersion parameter within the model. This could be done via the quasi-families in R.

调整过度分散的简单方法与估算模型中的分散参数一样简单。 这可以通过R中的准族来完成。

qpoi_mod = glm(Species ~ .,family=quasipoisson, gala)summary(qpoi_mod)

which yields,


Call:glm(formula = Species ~ ., family = quasipoisson, data = gala)

Deviance Residuals:     Min       1Q   Median       3Q      Max  -8.2752  -4.4966  -0.9443   1.9168  10.1849  

Coefficients:              Estimate Std. Error t value Pr(>|t|)    (Intercept)  3.1548079  0.2915901  10.819 1.03e-10 ***Area        -0.0005799  0.0001480  -3.918 0.000649 ***Elevation    0.0035406  0.0004925   7.189 1.98e-07 ***Nearest      0.0088256  0.0102622   0.860 0.398292    Scruz       -0.0057094  0.0035251  -1.620 0.118380    Adjacent    -0.0006630  0.0001653  -4.012 0.000511 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 31.74921)

    Null deviance: 3510.73  on 29  degrees of freedomResidual deviance:  716.85  on 24  degrees of freedomAIC: NA

Number of Fisher Scoring iterations: 5

We can see that the dispersion parameter is estimated to be 31.74921, which is very close to our manual calculation as aforementioned. This procedure tells us that only three of the predictors’ coefficients are significant.

我们可以看到,色散参数估计为31.74921,这与我们前面提到的手动计算非常接近。 这个过程告诉我们,只有三个预测变量系数是有效的。

用负二项式代替泊松 (Replace Poisson with Negative Binomial)

Another way to address the overdispersion in the model is to change our distributional assumption to the Negative binomial in which the variance is larger than the mean.


Let’s implement the negative binomial model in R.


## R codelibrary(MASS)nb_mod = glm.nb(Species ~ .,data = gala)summary(nb_mod)

which yields,


Call:glm.nb(formula = Species ~ ., data = gala, init.theta = 1.674602286,     link = log)

Deviance Residuals:     Min       1Q   Median       3Q      Max  -2.1344  -0.8597  -0.1476   0.4576   1.8416  

Coefficients:              Estimate Std. Error z value Pr(>|z|)    (Intercept)  2.9065247  0.2510344  11.578  < 2e-16 ***Area        -0.0006336  0.0002865  -2.211 0.027009 *  Elevation    0.0038551  0.0006916   5.574 2.49e-08 ***Nearest      0.0028264  0.0136618   0.207 0.836100    Scruz       -0.0018976  0.0028096  -0.675 0.499426    Adjacent    -0.0007605  0.0002278  -3.338 0.000842 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(1.6746) family taken to be 1)

    Null deviance: 88.431  on 29  degrees of freedomResidual deviance: 33.196  on 24  degrees of freedomAIC: 304.22

Number of Fisher Scoring iterations: 1

              Theta:  1.675           Std. Err.:  0.442 

 2 x log-likelihood:  -290.223

It is a better fit to the data because the ratio of deviance over degrees of freedom is only slightly larger than 1 here.


结论 (Conclusions)

A. Overdispersion can affect the interpretation of the poisson model.


B. To avoid the overdispersion issue in our model, we can use a quasi-family to estimate the dispersion parameter.


C. We can also use the negative binomial instead of the poisson model.


翻译自: https://towardsdatascience.com/adjust-for-overdispersion-in-poisson-regression-4b1f52baa2f1



