Generalized Linear Modelling in R (part 1)

In classical linear modelling we are assuming that the response variable (Y) is normally distributed, however for certain type of data like count data or presence/absence data this is not the case.

There is in statistic an ensemble of technique called Generalized Linear Modelling (GLM in short) where the reponse variable follow one known distribution, these are for example: gaussian, poisson, Bernoulli or binomial.

A GLM is made of three parts: the assumptions of the distribution of the response variable, the predictors used to build the model (the X part) and the link between the predictors and the response variable.

Let’s first see an exmple of GLM with a gaussian distribution (which is basically a linear model)

#load the dataset airquality measuring various climatic variables

#you can compare the two output which gives similar estimates of the slope and intercept

#model validation plot, residuals vs fitted values, residuals vs predictors and histogramm of #residuals

#add a plot with the estimated intercept and slope (which are both significantly different from #0)

Model Validation Plots

From this example we see that linear models are GLM with a gaussian distribution. The model validation plot show that there is no pattern in the residuals which could violate the assumption of the homogeneity of the variance and the residuals are normally distributed. The conclusion that we can draw from this are: the wind values significantly affect the temperature value in a negative way.

Know let’s look at some data which are not normally distributed, count data which follow a poisson distribution:

#a poisson example with the count of Amphibians road kills (Zuur, 2009)
rk<-data.frame(death=c(22,14,65,55,88,104,49, 66, 26, 47, 35, 55, 44, 30, 33, 29, 34, 64, 76, 32, 34, 32, 35, 22, 34, 25, 18,14, 14, 7, 7, 17, 10, 3, 6, 5, 2, 3, 2, 2, 7, 3, 5, 4, 7, 12, 7, 14, 10, 4, 11, 3)
,distance=250.214 , 741.179, 1240.080, 1739.885, 2232.130, 2724.089, 3215.511, 3709.401, 4206.477, 4704.176,5202.328, 5700.669, 6199.342, 6698.151, 7187.762, 7668.833, 8152.155, 8633.224, 
9101.411, 9573.578,10047.630, 10523.939, 11002.496, 11482.896, 11976.232, 12470.968, 12968.285, 13465.914, 13961.321, 14432.954,14904.995, 15377.983, 15854.389, 16335.936, 16810.109, 
17235.045, 17673.064, 18167.269, 18656.949, 19149.507,19645.717, 20141.987, 20640.729, 21138.903, 21631.542, 22119.102 ,22613.647, 23113.450, 23606.088, 24046.886,24444.874, 24884.803)

#do a poisson model

#model validation plot with the fitted data
plot(r~f,ylab="Residuals",xlab="Fitted values")

#add the predicted values with the 95% confidence interval

Model Validation Plots

The summary command on a glm object give similar information as in a lm object. We have the formula call, informations about the distribution of the residuals, the estimated coefficiants; in this case they are on a logarithmic scale since the default link between the predictors and the response variable for a poisson distribution is the log function (see ?family). Then information on the total and residuals sum of squares, and an AIC coefficient which can be usefull when comparing different models.


Zuur et al books,

Zeilis et al,



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s