Generalised linear models for aggregate claims; to Tweedie or not?

Download R package | Download package documentation

This web-page is a complement to the article to Tweedie or not (the published version can be found here and a preprint can be found here). The purpose is to show and comment the R code used for the simulations, graphs and tables shown in the article.

An R package has been created with functions that are used in the code shown in this page. The name of the package is ToTweedieOrNot. The package and the documentation of the package can be downloaded from the links at the top of the page.

In order to properly run the code from this page, not only the ToTweedieOrNot is necessary, but also the CRAN packages tweedie and mgcv. It is necessary then, to put this at the beginning of the R script.

library(statmod);
library(tweedie);
library(mgcv);
library(ToTweedieOrNot);

Some functions from the package use auxiliary functions for simulation (see the documentation of generate.simulated.glm.data). These functions define how to generate simulations for 1 observation or how to generate weights for a specific simulation. The following block defines the auxiliary functions used for the article.

sim.poisson.function <- function(mean=lambda,phi=1){
    rpois(1,lambda=mean/phi)
}

sim.gamma.function <- function(mean,phi){
  rgamma(1,shape=1/phi,scale=mean*phi);
}

sim.igauss.function <- function(mean,phi){
  rig(1,mean=mean,scale=phi)
}

generate.igauss.weights <- function(n){
  rep(1,n);
 }

generate.gamma.weights <- function(n){
  rep(1,n);
 }

generate.weights <- function(n){
  p <- 0.7
    bern <- rbinom(n,size=1,prob=p);
    bern+(1-bern)*runif(n)
 }

The examples from the last section used a data-set from the site http://www.statsci.org/data/general/carinsca.html. The data-set can be downloaded and loaded into R by running the following code. Notice that The file carinsca.txt will be downloaded into your working directory after you run this code.

download.file("http://www.statsci.org/data/general/carinsca.txt",destfile="carinsca.txt")
data <- read.csv("carinsca.txt",head=TRUE,sep="\t");

1 Figure 1

Figure 1 contains an example of a lift chart with a good fit. This graph was generated by simulating a gamma GLM with many observations. Then a GLM with gamma responses and a log-link function was fitted to this simulated data.

# Define first the coefficients vector
gamma.beta.vector.gfe <- c(3,-1.2,0.8,-1.35,1.4,-0.55,0.68,-0.77,1.46,.38,-1.11,1.13,1.14);

# Then, a vector of observations. The entries correspond to how many
# values are simulated for each class
gamma.observations.gfe <- sample(100:200,60,replace=TRUE);


fake.gamma.data.gfe <- generate.simulated.glm.data(classes.vector=c(2,3,10),
					       beta=gamma.beta.vector.gfe,
					       phi=1000,
					       inverse.link.function=exp,
					       observations.per.class=gamma.observations.gfe,
					       names=c("Gender","Driver","Make"),
					       sim.function=sim.gamma.function,
					       response.variable.name="Payments",
					       weights.function=generate.gamma.weights,
					       is.tweedie=FALSE);

gamma.glm.gfe <- glm(Payments/Exposure~Gender+Driver+Make,
		     weights=Exposure,
		     family=Gamma(link=log),
		     data=fake.gamma.data.gfe)

lift.curve.glm(gamma.glm.gfe)

2 Figure 2

Figure 2 exemplifies a lift chart with a bad fit. This graph was generated by simulating a GLM with inverse Gaussian responses and log-link, but then fitting a GLM with normal responses and a log-link to the simulated values. This simulations does give sometimes good fit, so it might be needed to run it a few times in order to get a bad fit.

igauus.neta.bfe <- c(3,-1.2,0.8,-1.35,1.4,-0.55,0.68,-0.77,1.46,.38,-1.11,1.13,1.14);
observations.for.bfe <- sample(100:200,60,replace=TRUE);

fake.igauss.data.bfe <- generate.simulated.glm.data(classes.vector=c(2,3,10),
						beta=igauus.neta.bfe,
						phi=500,
						inverse.link.function=exp,
						observations.per.class=observations.for.bfe,
						names=c("Gender","Driver","Make"),
						sim.function=sim.igauss.function,
						response.variable.name="Payments",
						weights.function=generate.igauss.weights,
						is.tweedie=FALSE);

normal.glm.bfe <- glm(Payments/Exposure~Gender+Driver+Make,
		      weights=Exposure,
		      family=gaussian(link=log),
		      data=fake.igauss.data.bfe, 
		      control=list(maxit=50000) );


# For figure 2(a), the lift chart:
lift.curve.glm(normal.glm.bfe,title="Normal")

# For figure 2(b), the qq plot;
qqplot(predict(normal.glm.bfe,type="response"),
       fake.igauss.data.bfe$Payments/fake.igauss.data.bfe$Exposure,
       xlab="Predicted values",
       ylab="Simulated values")
abline(0,1)

3 Figure 3

In these graphs, the limitations of the Tweedie induced models for frequency and severity is shown. Poisson and gamma distributed data-sets were simulated in such a way that for the classes with higher pure premium, there is also a higher claim frequency but a smaller claim severity. A Tweedie GLM was then fitted for the pure premium from which the induced predicted means for the claim frequency and severity were obtained.

For this example the function tweedie.profile for finding the mle of the power of the variance function did not converge. 1.5 has been used as the power of the variance function in the fitted GLM.

This simulation takes sometime to finish, a way to make it go faster us to reduce the number of observations, i.e. in the fourth line change sample(200:300,60,replace=TRUE); for something like sample(50:100,60,replace=TRUE);

gamma.beta.vector.imbf <- c(5,-0.1,-0.2,-0.3,-0.4,-0.5,-0.6,-0.7,-0.8,-0.9,-0.10,-0.11,-0.12);
poisson.beta.vector.imbf <- -2*gamma.beta.vector.imbf;
poisson.beta.vector.imbf[1] <- 3;
observations.for.poisson.classes.imbf <- sample(200:300,60,replace=TRUE);

aggregated.data.imbf <- generate.aggregated.loss.data(
			  classes.vector=c(2,3,10),
			  beta.vector.freq=poisson.beta.vector.imbf,
			  beta.vector.sev=gamma.beta.vector.imbf,
			  phi.freq=1,
			  phi.sev=100,
			  inverse.link.function.freq=exp,
			  inverse.link.function.sev=exp,
			  observations.per.class.freq=observations.for.poisson.classes.imbf,
			  names=c("Gender","Driver","Make"),
			  sim.function.freq=sim.poisson.function,
			  sim.function.sev=sim.gamma.function,
			  response.variable.name="Loss",
			  weights.function.freq=generate.weights
			  )

tweedie.glm.imbf <- glm(Loss/Duration~Gender+Driver+Make,
			weights=Duration,
			data=aggregated.data.imbf,
			family=tweedie(var.power=1.5,link.power=0))

induced.models.imbf <- tweedie.fs(tweedie.glm.imbf,1.5,summary(tweedie.glm.imbf)$dispersion)
induced.freq.imbf <- induced.models.imbf$mean.poisson
induced.sev.imbf <- induced.models.imbf$mean.gamma

par(mfrow=c(2,1))
# Lift curve for the induced frequency
lift.curve(induced.freq.imbf,aggregated.data.imbf$Claims/aggregated.data.imbf$Duration,title="Claim frequency induced by Tweedie model",xtitle="",ytitle="")

# Lift curve for the induced severity
lift.curve(induced.sev.imbf,aggregated.data.imbf$Loss/aggregated.data.imbf$Claims,xtitle="",ytitle="",title="Claim severity induced by Tweedie model")

4 Tables and graphs for the frequency model in the Example section

The following code can be used to generate Table 4, Figure 4 and Table 5

data$Merit <- factor(data$Merit)
data$Class <- factor(data$Class)
data$C1M3 <-factor(data$Class == 1 & data$Merit == 3 )
data$C3M3 <-factor(data$Class == 3 & data$Merit == 3 )
data$C4M3 <-factor(data$Class == 4 & data$Merit == 3 )
data$C1M2 <-factor(data$Class == 1 & data$Merit == 2 )

frequency.model <- glm(Claims ~ Merit + Class + C1M3 + C3M3 + C4M3 +C1M2,
		       offset=log(Insured),
		       family=poisson(link=log),
		       data=data );

# Table 4
summary(frequency.model)

# Figure 5
resids.plot(frequency.model)

# Table 5
drop1(frequency.model)

5 Tables and graphs for the severity model in the Example section

The following code generates Table 6, Table 7 and Figure 5

severity.model <- glm(Cost/Claims ~ Merit + Class,weights=Claims,family=Gamma(link=log),data=data);

#Table 6
summary(severity.model)

#Table 7
drop1(severity.model,test="F")

#Figure 4
resids.plot(severity.model)

6 Tables and graphs for the Tweedie GLM in the Examples section

The following code generates Table 8, Figure 6 and Table 9

#To find the mle of the power of the variance
tweedie.mle <- tweedie.profile(Cost/Insured~Class+Merit + C1M3 + C4M3, 
			       weights=Insured,
			       p.vec=seq(1.1,1.9,by=0.01),
			       data=data,
			       do.ci=F,
			       do.smooth=F);

tweedie.model <- glm(Cost/Insured~Class+Merit+ C1M3 +  C4M3,
		     weights=Insured,
		     family= tweedie( var.power=tweedie.mle$p.max,
		     link.power=0),
		     data=data);

#Table 8
summary(tweedie.model);

#Figure 6
resids.plot(tweedie.model);

#Table 9
drop1(tweedie.model,test="F");

7 Figure 7

The lift charts from the SPGA and Tweedie GLM can be obtained by running

severity.mean <- predict(severity.model,type="response");

spga.mean <- (predict(frequency.model,type="response")/data$Insured) * severity.mean;

par(mfrow=c(1,2));

lift.curve(spga.mean,data$Cost/data$Insured,weights=data$Insured,
	   title="SPGA lift chart",ytitle="Aggregated loss in thousands of dollars.");

lift.curve.glm(tweedie.model,title="Tweedie lift chart",
	       ytitle="Aggregated loss in thousands of dollars") ;

8 Errata

In the previous version of this website sim.gamma.function was defined as

sim.gamma.function <- function(mean,phi){
   rgamma(1,shape=mean,rate=phi);
 }

instead of the definition given at the beginning of this document. This was a typo. Note that the graphs in the article were produced with the correct version of this function.