Generalised linear models for aggregate claims; to Tweedie or not?
This web-page is a complement to the article to Tweedie or not (the published version can be found here and a preprint can be found here). The purpose is to show and comment the R code used for the simulations, graphs and tables shown in the article.
An R package has been created with functions that are used in the code
shown in this page. The name of the package is ToTweedieOrNot
. The
package and the documentation of the package can be downloaded from
the links at the top of the page.
In order to properly run the code from this page, not only the
ToTweedieOrNot
is necessary, but also the CRAN packages tweedie
and mgcv
. It is necessary then, to put this at the beginning of the
R script.
library(statmod); library(tweedie); library(mgcv); library(ToTweedieOrNot);
Some functions from the package use auxiliary functions for simulation
(see the documentation of generate.simulated.glm.data
). These
functions define how to generate simulations for 1 observation or how
to generate weights for a specific simulation. The following block
defines the auxiliary functions used for the article.
sim.poisson.function <- function(mean=lambda,phi=1){ rpois(1,lambda=mean/phi) } sim.gamma.function <- function(mean,phi){ rgamma(1,shape=1/phi,scale=mean*phi); } sim.igauss.function <- function(mean,phi){ rig(1,mean=mean,scale=phi) } generate.igauss.weights <- function(n){ rep(1,n); } generate.gamma.weights <- function(n){ rep(1,n); } generate.weights <- function(n){ p <- 0.7 bern <- rbinom(n,size=1,prob=p); bern+(1-bern)*runif(n) }
The examples from the last section used a data-set from the site
http://www.statsci.org/data/general/carinsca.html. The data-set can be
downloaded and loaded into R by running the following code. Notice
that The file carinsca.txt
will be downloaded into your working
directory after you run this code.
download.file("http://www.statsci.org/data/general/carinsca.txt",destfile="carinsca.txt") data <- read.csv("carinsca.txt",head=TRUE,sep="\t");
1 Figure 1
Figure 1 contains an example of a lift chart with a good fit. This graph was generated by simulating a gamma GLM with many observations. Then a GLM with gamma responses and a log-link function was fitted to this simulated data.
# Define first the coefficients vector gamma.beta.vector.gfe <- c(3,-1.2,0.8,-1.35,1.4,-0.55,0.68,-0.77,1.46,.38,-1.11,1.13,1.14); # Then, a vector of observations. The entries correspond to how many # values are simulated for each class gamma.observations.gfe <- sample(100:200,60,replace=TRUE); fake.gamma.data.gfe <- generate.simulated.glm.data(classes.vector=c(2,3,10), beta=gamma.beta.vector.gfe, phi=1000, inverse.link.function=exp, observations.per.class=gamma.observations.gfe, names=c("Gender","Driver","Make"), sim.function=sim.gamma.function, response.variable.name="Payments", weights.function=generate.gamma.weights, is.tweedie=FALSE); gamma.glm.gfe <- glm(Payments/Exposure~Gender+Driver+Make, weights=Exposure, family=Gamma(link=log), data=fake.gamma.data.gfe) lift.curve.glm(gamma.glm.gfe)
2 Figure 2
Figure 2 exemplifies a lift chart with a bad fit. This graph was generated by simulating a GLM with inverse Gaussian responses and log-link, but then fitting a GLM with normal responses and a log-link to the simulated values. This simulations does give sometimes good fit, so it might be needed to run it a few times in order to get a bad fit.
igauus.neta.bfe <- c(3,-1.2,0.8,-1.35,1.4,-0.55,0.68,-0.77,1.46,.38,-1.11,1.13,1.14); observations.for.bfe <- sample(100:200,60,replace=TRUE); fake.igauss.data.bfe <- generate.simulated.glm.data(classes.vector=c(2,3,10), beta=igauus.neta.bfe, phi=500, inverse.link.function=exp, observations.per.class=observations.for.bfe, names=c("Gender","Driver","Make"), sim.function=sim.igauss.function, response.variable.name="Payments", weights.function=generate.igauss.weights, is.tweedie=FALSE); normal.glm.bfe <- glm(Payments/Exposure~Gender+Driver+Make, weights=Exposure, family=gaussian(link=log), data=fake.igauss.data.bfe, control=list(maxit=50000) ); # For figure 2(a), the lift chart: lift.curve.glm(normal.glm.bfe,title="Normal") # For figure 2(b), the qq plot; qqplot(predict(normal.glm.bfe,type="response"), fake.igauss.data.bfe$Payments/fake.igauss.data.bfe$Exposure, xlab="Predicted values", ylab="Simulated values") abline(0,1)
3 Figure 3
In these graphs, the limitations of the Tweedie induced models for frequency and severity is shown. Poisson and gamma distributed data-sets were simulated in such a way that for the classes with higher pure premium, there is also a higher claim frequency but a smaller claim severity. A Tweedie GLM was then fitted for the pure premium from which the induced predicted means for the claim frequency and severity were obtained.
For this example the function tweedie.profile
for finding the mle of
the power of the variance function did not converge. 1.5 has been used
as the power of the variance function in the fitted GLM.
This simulation takes sometime to finish, a way to make it go faster
us to reduce the number of observations, i.e. in the fourth line
change sample(200:300,60,replace=TRUE);
for something like sample(50:100,60,replace=TRUE);
gamma.beta.vector.imbf <- c(5,-0.1,-0.2,-0.3,-0.4,-0.5,-0.6,-0.7,-0.8,-0.9,-0.10,-0.11,-0.12); poisson.beta.vector.imbf <- -2*gamma.beta.vector.imbf; poisson.beta.vector.imbf[1] <- 3; observations.for.poisson.classes.imbf <- sample(200:300,60,replace=TRUE); aggregated.data.imbf <- generate.aggregated.loss.data( classes.vector=c(2,3,10), beta.vector.freq=poisson.beta.vector.imbf, beta.vector.sev=gamma.beta.vector.imbf, phi.freq=1, phi.sev=100, inverse.link.function.freq=exp, inverse.link.function.sev=exp, observations.per.class.freq=observations.for.poisson.classes.imbf, names=c("Gender","Driver","Make"), sim.function.freq=sim.poisson.function, sim.function.sev=sim.gamma.function, response.variable.name="Loss", weights.function.freq=generate.weights ) tweedie.glm.imbf <- glm(Loss/Duration~Gender+Driver+Make, weights=Duration, data=aggregated.data.imbf, family=tweedie(var.power=1.5,link.power=0)) induced.models.imbf <- tweedie.fs(tweedie.glm.imbf,1.5,summary(tweedie.glm.imbf)$dispersion) induced.freq.imbf <- induced.models.imbf$mean.poisson induced.sev.imbf <- induced.models.imbf$mean.gamma par(mfrow=c(2,1)) # Lift curve for the induced frequency lift.curve(induced.freq.imbf,aggregated.data.imbf$Claims/aggregated.data.imbf$Duration,title="Claim frequency induced by Tweedie model",xtitle="",ytitle="") # Lift curve for the induced severity lift.curve(induced.sev.imbf,aggregated.data.imbf$Loss/aggregated.data.imbf$Claims,xtitle="",ytitle="",title="Claim severity induced by Tweedie model")
4 Tables and graphs for the frequency model in the Example section
The following code can be used to generate Table 4, Figure 4 and Table 5
data$Merit <- factor(data$Merit) data$Class <- factor(data$Class) data$C1M3 <-factor(data$Class == 1 & data$Merit == 3 ) data$C3M3 <-factor(data$Class == 3 & data$Merit == 3 ) data$C4M3 <-factor(data$Class == 4 & data$Merit == 3 ) data$C1M2 <-factor(data$Class == 1 & data$Merit == 2 ) frequency.model <- glm(Claims ~ Merit + Class + C1M3 + C3M3 + C4M3 +C1M2, offset=log(Insured), family=poisson(link=log), data=data ); # Table 4 summary(frequency.model) # Figure 5 resids.plot(frequency.model) # Table 5 drop1(frequency.model)
5 Tables and graphs for the severity model in the Example section
The following code generates Table 6, Table 7 and Figure 5
severity.model <- glm(Cost/Claims ~ Merit + Class,weights=Claims,family=Gamma(link=log),data=data); #Table 6 summary(severity.model) #Table 7 drop1(severity.model,test="F") #Figure 4 resids.plot(severity.model)
6 Tables and graphs for the Tweedie GLM in the Examples section
The following code generates Table 8, Figure 6 and Table 9
#To find the mle of the power of the variance tweedie.mle <- tweedie.profile(Cost/Insured~Class+Merit + C1M3 + C4M3, weights=Insured, p.vec=seq(1.1,1.9,by=0.01), data=data, do.ci=F, do.smooth=F); tweedie.model <- glm(Cost/Insured~Class+Merit+ C1M3 + C4M3, weights=Insured, family= tweedie( var.power=tweedie.mle$p.max, link.power=0), data=data); #Table 8 summary(tweedie.model); #Figure 6 resids.plot(tweedie.model); #Table 9 drop1(tweedie.model,test="F");
7 Figure 7
The lift charts from the SPGA and Tweedie GLM can be obtained by running
severity.mean <- predict(severity.model,type="response"); spga.mean <- (predict(frequency.model,type="response")/data$Insured) * severity.mean; par(mfrow=c(1,2)); lift.curve(spga.mean,data$Cost/data$Insured,weights=data$Insured, title="SPGA lift chart",ytitle="Aggregated loss in thousands of dollars."); lift.curve.glm(tweedie.model,title="Tweedie lift chart", ytitle="Aggregated loss in thousands of dollars") ;
8 Errata
In the previous version of this website sim.gamma.function
was defined as
sim.gamma.function <- function(mean,phi){ rgamma(1,shape=mean,rate=phi); }
instead of the definition given at the beginning of this document. This was a typo. Note that the graphs in the article were produced with the correct version of this function.