1 Predicting the removal of customers assets from the bank Master s Thesis Stochastics and Financial Mathematics Author: Bas van Schriek Supervisors: ...

Author:
Brent Tyler

0 downloads 12 Views 1MB Size

Master’s Thesis

Stochastics and Financial Mathematics

Supervisors: Stan Alink (Van Lanschot) Peter Spreij (UvA)

Author: Bas van Schriek

July 8, 2013

Abstract The purpose of this thesis is to investigate whether it is possible to model and predict the removal of customers’ assets from the bank. Specifically, we would like to identify the risky customers, i.e. customers who are likely to remove their assets from the bank. The methodology is composed of a variable transformation using a designed algorithm and a logistic regression model. The significancy of potential variables is reviewed twice. Firstly, the Gini coefficient has to be high enough and secondly, chi-square tests on the remaining variables are performed at a stepwise selection method for the logistic regression model. For this approach, a partition into four subsets of the dataset is required. A construction set and a stop set, for the algorithm and the determination of the transformed values for the explanatory variables, and a coefficients set, used for selecting the significant variables and for estimating the coefficients for the logistic regression model. Finally, performance of the models is tested on an out-of-sample set. Three different models have been examined and it turned out that all three models sufficiently met the criteria for a new model within Van Lanschot Bankiers. Moreover, it appears that one of them can fairly well identify risky customers. This prototype is being tested in practise right now.

KEYWORDS: Weight of Evidence, Gini Coefficient, Logistic Regression.

Korteweg de Vries Institute for Mathematics University of Amsterdam Science Park 904, 1098 XH Amsterdam The Netherlands http://www.science.uva.nl/math

i

Preface This thesis is a result of half a year of research at Van Lanschot Bankiers in ’s-Hertogenbosch. During my study, both during the bachelor Mathematics at the Radboud University in Nijmegen and the master Stochastics and Financial Mathematics at the University of Amsterdam, I mainly learned theory without any application in practice. Therefore, I preferred to finish my master at a financial institute. I am glad that Van Lanschot Bankiers offered me this opportunity. During the past months, I have learned a lot, but I especially enjoyed working on the project. My special thanks goes on the first place to my supervisors. I would like to thank my daily supervisor at Van Lanschot Bankiers, Stan Alink, for his time and effort in the recent months. Furthermore, I would like to thank my supervisor at the University of Amsterdam, Peter Spreij, to whom I could discuss all my issues and other questions. Moreover, I would like to thank the whole Financial Risk Management department for their willingness to help, especially John de Kroon for his help in programming. Last but not least, I would like to thank the account managers of the office in Eindhoven, who gave me insight in my results and some potential practical application of my model. I hope you will enjoy reading this thesis as much as I had producing it. Bas van Schriek ’s-Hertogenbosch, July 2013

ii

Contents Abstract

i

Preface

ii

List of Tables

v

List of Figures

vi

1 Introduction 1.1 Background . . . . . . . . . . . . 1.2 Problem statement . . . . . . . . 1.2.1 Dependent variable . . . . 1.2.2 Model selection . . . . . . 1.3 Models . . . . . . . . . . . . . . . 1.3.1 Model 1: Customer-based 1.3.2 Model 2: Assets-based . . 1.4 Thesis overview . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

1 1 2 2 2 5 5 6 7

2 Generalized linear models 2.1 Linear Regression . . . . 2.1.1 Estimation of β . 2.1.2 Example . . . . . 2.2 Generalization . . . . . 2.3 Logistic Regression . . . 2.4 Asymptotic behavior . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

8 8 9 11 11 12 14

3 Predictive powers 3.1 Example . . . . . . . . . . . . . . . . . 3.2 Weight of Evidence . . . . . . . . . . . 3.3 Gini coefficient . . . . . . . . . . . . . 3.3.1 Background . . . . . . . . . . . 3.3.2 Customer-based Gini coefficient 3.3.3 Assets-based Gini coefficient . 3.4 Chi-square tests . . . . . . . . . . . . . 3.4.1 Wald chi-square test . . . . . . 3.4.2 Score chi-square test . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

17 17 18 21 21 22 23 25 26 27

4 Methodology 4.1 Data tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Bucket algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 31 31

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

iii

4.3

4.2.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Analysis of boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Selecting the explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Model analysis 5.1 Customer-based model . . . . . . . 5.1.1 Transformation . . . . . . . 5.1.2 Logistic regression . . . . . 5.1.3 Out-of-sample performance 5.2 Assets-based model . . . . . . . . . 5.2.1 Transformation . . . . . . . 5.2.2 Logistic regression . . . . . 5.2.3 Out-of-sample performance 5.3 Modified Assets-based model . . . 5.3.1 Transformation . . . . . . . 5.3.2 Logistic regression . . . . . 5.3.3 Out-of-sample performance

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

33 33 37 39 39 40 40 42 43 44 44 47 48 49 50 52

6 Conclusion

53

7 Suggestions for future research

55

Appendix A Additional theorems

57

Bibliography

59

iv

List of Tables 1.1 1.2

Data of two fictional customers for Model 1. . . . . . . . . . . . . . . . . . . . . . Data of customer K1 (from Example 1.3.2) for Model 2. . . . . . . . . . . . . . .

5 7

2.1

Characteristics of some distribution functions of the exponential family. . . . . .

12

3.1 3.2

Fictional data of ’default’ and ’non-default’ costumers on a bank. . . . . . . . . . WoE and LOR added to each category of Table 3.1. . . . . . . . . . . . . . . . .

17 20

4.1 4.2 4.3

Realizations of X for 9 different customers, including one default. . . . . . . . . . Values of the Ginis, considering the 25h, 50th and 75th percentile as boundary. . Values of the Ginis, considering the boundaries of step 2. . . . . . . . . . . . . .

32 32 33

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11

Gini and Weight of Evidence for every bucket regarding to Model 1 . Summary of the stepwise selection for Model 1. . . . . . . . . . . . . Analysis of the coefficients for the significant variables for Model 1. . Odds Ration Estimates for the significant variables for Model 1. . . . Gini and Weight of Evidence for every bucket regarding to Model 2. Summary of the stepwise selection for Model 2. . . . . . . . . . . . . Analysis of the coefficients for the significant variables for Model 2. . Odds Ration Estimates for the significant variables for Model 2. . . . Gini and Weight of Evidence for every bucket regarding to Model 2’. Summary of the stepwise selection for Model 2’. . . . . . . . . . . . . Analysis of the coefficients for the significant variables for Model 2’.

40 41 41 42 44 45 46 47 49 50 51

v

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

List of Figures 3.1 3.2 3.3

An example of a Lorenz curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A graph for the Customer-based Gini coefficient . . . . . . . . . . . . . . . . . . . Graphical representation of the Wald and Score chi-square test. . . . . . . . . . .

21 22 28

4.1 4.2

Overview of the splitting of the dataset. . . . . . . . . . . . . . . . . . . . . . . . The evolution of the Gini coefficient on the construction set and stop set. . . . .

30 31

5.1 5.2 5.3

The graph for the out-of-sample Gini coefficient of Model 1. . . . . . . . . . . . . The graph for the out-of-sample Gini coefficient of Model 2. . . . . . . . . . . . . The graph for the out-of-sample Gini coefficient of Model 2’. . . . . . . . . . . .

43 48 52

vi

CHAPTER 1. INTRODUCTION

Chapter 1

Introduction In this first chapter, we will motivate and introduce the subject of this thesis, starting with some literary background and similar studies in other branches. Thereafter, we present a short discussion about the dependent variable and a motivation of the choice of modelling technique. Finally, the developed models are described and discussed.

1.1

Background

In this thesis, we will model the behavior of customers of banks. The economic value of customer retention is widely recognised in literature [11]. Long-term customers buy more and become less costly to serve due to the bank’s greater knowledge of the existing customer and due to decreased servicing costs. Moreover, they tend to be less sensitive to competitive marketing activities. Also, losing customers not only leads to opportunity costs because of reduced sales, but to an increased need for attracting new customers as well. It turned out that attracting a new customer is five to six (!) times more expensive than retaining a customer [11]. By modelling the customer behavior, we can make a prediction for the current customers and identify the risky customers, i.e. the customers who are likely to leave the bank. These customers can be supervised and once they exhibit runaway behavior, the bank can please those customers by, for example, sending flowers or offering a certain discount. Prediction tools are already applied in the European financial services industry [11], for instance retail banking [12], but especially in telecommunications and (health) insurance [13]. Here it is called churn prediction [8]. Churn is derived from change and turn, it means the discontinuation of a contract. In the paper of D. Lazarov et al. [8], six churn prediction methods are presented, using four sets of data variables: • Customer behaviour identifies which parts of the service a customer is using and how often he is using them. • Customer perceptions are defined as the way a customer apprehends the service. • Customer demographics include age, gender, level of education, social status, geographical data, etc. • Macroenvironment variables identify changes in the world, different experiences of customers, which can affect the way they use a service. One of the prediction methods used is Regression Analysis, a technique we will use in this study as well (More details in Chapter 2). Churn prediction is also applied in the Netherlands, by health insurer Agis in 2008 [13]. In their research, a model for tracing customers in the ‘risk zone’, i.e. the churners, is developed. They 1

CHAPTER 1. INTRODUCTION

focused on indications in the customers’ past and the differences between customers, churners and non-churners. In this branch, a churner is easier defined, i.e. a customer who switched to another insurer. D. Popovi´c et al. [12] have developed a churn prediction model in retail banking using a so-called fuzzy C-means algorithm. We shall not delve into this algorithm. In that paper, a customer is treated as churner if he had N > 0 products (saving account, credit card, cash loan etc.) at time tn and had no product at time tn+1 , meaning that he cancelled all his products in the period tn+1 − tn . In my study, the emphasis is more on customers’ assets (Definition 1.2.1) than on churning. Indeed, the balance of the bank mainly consists of the assets of the customers, so it is important to understand the behavior of customers’ assets. There is practically no difference between two customers have each 50 000 in assets and one customer having 100 000 in assets. Moreover, churning, or customer attrition, is a prolonged procedure, whereas removing assets is a piece of cake. Therefore, we will predict whether a customer will remove a certain amount of assets in a given time period. For more details about the constructed models, please refer to Section 1.3.

1.2

Problem statement

So far, researchers only studied churn prediction, i.e. the run away of customers. However, for financial institutions such a banks, it is more interesting to study the removal of the assets of customers rather than the customers themselves, as mentioned in Section 1.1. In view of this, we define the dependent variable for our models (Section 1.2.1). Next, we discuss several statistical approaches and motivate the choice of logistic regression (Section 1.2.2).

1.2.1

Dependent variable

Definition 1.2.1 (Assets). The value of assets At of a customer K at time t is defined as the sum of the positive balances on each product at time Pmt, plus the value of the securities portfolio. The value of the securities portfolio is defined as i=1 Ni Ri , where • Ni is the number of shares of security i at time t, • Ri is the price of security i at time t, • m is the number of securities. In this thesis, we study the evolution of A during a certain time period. Therefore, we will define several dependent variables Y as function of the assets A. For example, ( 1 if the assets A of a certain customer have been halved within a month, Y := 0 else. Finally, we want to identify the customers who will remove (a major part of) their assets from the bank. The associated models are defined in the next section (Section 1.3).

1.2.2

Model selection

There are many possible statistical approaches available. Here, we will discuss a few of them and motivate why we have chosen for logistic regression. The model descriptions in this subsection are based on a paper of B. Baesens et al. [2].

2

CHAPTER 1. INTRODUCTION

Linear programming Linear programming is a method for determining a way to achieve the best outcome (such as maximum profit or lowest cost) in a given mathematical model for some list of requirements represented as linear relationships. One of the most common and popular formulation for linear programming is as follows: max c> x, x

subject to (

Ax ≤ b x ≥ 0,

where x represents the vector of variables (to be determined), c and b are vectors of known coefficients and A is a known matrix of coefficients. The expression to be maximized is called the objective function. The inequalities Ax ≤ b are the constraints which specify a convex polytope over which the objective function is to be optimized. Support vector machines n Given a training set of N data points D = {(xi , yi )}N i=1 with input data xi ∈ R and corresponding binary class labels yi ∈ {0, 1}, the support vector machine classifier satisfies the following conditions: ( w> φ(ξi ) + b ≥ +1 if yi = +1 w> φ(ξi ) + b ≤ −1 if yi = −1,

which is equivalent to yi (w> φ(ξi ) + b) ≥ 1.

i = 1 . . . , N.

The non-linear function φ(·) maps the input space to a high-dimensional feature space. In this feature space, the above inequalities basically construct a hyperplane w> φ(ξ) + b = 0 discriminating between both classes. In the primal weight space, the classifier then takes the form y(x) = sign[w> φ(ξ) + b] but, on the other hand, is never evaluated in this form. One defines the convex optimisation problem # " N X 1 > ξi , min w w + C w,ξ 2 i=1 subject to ( yi (w> φ(ξi ) + b) ≥ 1 − ξi ξi ≥ 0.

i = 1, . . . , N i = 1, . . . , N.

The variables ξi are variables that are needed in order to allow misclassification in the set of inequalities. The first part of the objective function tries to maximise the margin between both classes in the feature space, whereas the second part minimises the misclassification error. The positive real constant C should be considered as a tuning parameter in the algorithm.

3

CHAPTER 1. INTRODUCTION

Neural networks Neural networks are mathematical representations inspired by the functioning of the human brain. There are several types of neural networks. We will discuss the multilayer perceptron neural network (MLP) in more detail because it is the most popular neural network for classification. An MLP is typically composed of an input layer, one or more hidden layers and an output layer, each consisting of several neurons. Each neuron processes its inputs and generates one output value that is transmitted to the neurons in the subsequent layer. In the case of one hidden layer and one output neuron, the output of hidden neuron i is computed (1) by processing the weighted inputs and its bias term bi as follows: hi = f

(1)

(1) bi

+

n X

Wij xj ,

i=1

where W is the weight matrix and Wij denotes the weight connecting input j to hidden unit i. In an analogous way, the output of the output layer is computed as follows: y=f

(2)

nh X (2) b + vj hj , j=1

with nh being the number of hidden neurons and v the weight vector where vj represents the weight connecting hidden unit j to the output neuron. The bias inputs play a similar role as the intercept term in a classical linear regression model. A threshold function is then typically applied to map the network output y to a classification label. The transfer functions f (1) and f (2) allow the network to model non-linear relationships in the data. For a binary classification problem, it is convenient to use the logistic transfer function in the output layer (f (2) ), since its output is limited to a value within the range [0, 1]. This allows the output y of the MLP to be interpreted as a conditional probability. Logistic regression n Given a training set of N data points D = {(xi , yi )}N i=1 with input data xi ∈ R and corresponding binary class labels yi ∈ {0, 1}, the logistic regression approach to classification models the probability P(y = 1|x) as follows:

P(y = 1|x) =

1 , 1 + exp(−(w0 + w> x))

where x ∈ Rn is an n-dimensional input vector, w is the parameter vector and the scalar w0 is the intercept. The parameters w0 and w are then typically estimated using the maximum likelihood procedure.

The paper of B. Baesens et al. studied the performance of various classification techniques for credit scoring. This is the opposite of the subject in this thesis, which is on debit. However, it is certainly comparable since we use nearly identical datasets. It was found in that paper that the neural network classifier yields a very good performance. However, it had to be noted that simple, linear classifiers such as logistic regression also gave very good performances. The experiment of the paper also indicated that many classification techniques yield performances which are quite competitive with each other. Taking into account that logistic regression is intuitively easy to understand, in contrast to neural networks, which is more of a ‘black box’, we have chosen for logistic regression in this thesis. 4

CHAPTER 1. INTRODUCTION

1.3

Models

As discussed in the previous section, we have chosen for logistic regression. Therefore, we have to define an indicator variable, i.e. a variable Y such that Y = 1E for some event E. Of course, there are plenty ways to define this event E. In this thesis, two choices are made and they will be discussed in this section. The first model is based on customers, the second model focuses on the money.

1.3.1

Model 1: Customer-based

Through this model, we would like to predict a certain behavior of customers. To be more accurate, the dependent variable of this model has the following structure: Definition 1.3.1 (Dependent variable Model 1). The dependent variable for Model 1 has the following structure: Y = 1E1 , where E1 := {Customer will remove at least M1 from the bank during the coming period T1 } where M1 is a fixed amount of money (in euros) and T1 is a certain time period. Here, we assume that the customers are independent of each other , i.e. a customer acts independently of other customers. This is necessary, since logistic regression assumes experiments to be independent. A customer in the set E1 is called a default. We could have chosen for a fraction of the customers’ assets instead of a a fixed amount as well. However, from the perspective of Van Lanschot Bankiers, a fraction is less attractive because of the large difference in assets of customers. Indeed, consider a customer A with assets of 500 000 euro and a customer B with assets of 10 000 000 euro. Suppose that customer A has removed 50% of his assets during a period T1 and customer B has removed only 5%. This is equivalent to a removal of 250 000 euro for customer A and 500 000 euro for customer B. The removal of customer B takes priority since it implies a higher loss in the assets of Van Lanschot Bankiers, while customer A has removed a much greater fraction. Still, there are quite a few possibilities to define the dependent variable for Model 1, based on the structure of Definition 1.3.1. For internal reasons, the choice of the values for M1 and T1 will not be mentioned. The following example illustrates this model, using the structure of Definition 1.3.1. Example 1.3.2. The following table supposes that there are n covariates and two customers at time t0 . Moreover, this example assumes that M1 = 30 000 and T1 = 12 months. X1 3 1

X2 0.3 0.8

··· ··· ···

Xk 20 14

Y = 1E1 0 1

At0 60 000 100 000

At0 +T1 40 000 50 000

Table 1.1: Data of two fictional customers for Model 1. Every customer has his own row of realizations of the covariates X1 , . . . , Xn . The first customer, say K1 of Example 1.3.2 has removed 20 000 in one year, which is less than 30 000. Therefore K1 6∈ E1 , hence Y = 0. The second customer, K2 has removed more than 30 000, hence K2 ∈ E1 and Y = 1. 5

CHAPTER 1. INTRODUCTION

In practise, we have access to a huge dataset, consisting of many customers1 and numerous realizations of several covariates Xi . Using logistic regression, we can find estimators for coefficients βi , resulting in a estimator for the probability P[Y = 1|X = x]. Thus, we can predict whether a customer will remove at least M1 from the bank during the coming period T1 . Specifically, if x is the vector of realizations of the covariates X1 . . . , Xk , the estimated probability will equal P[Y = 1|X = x] =

1 . 1 + exp(−x> β)

Here, we add x0 = 1 to the vector x and β0 to the vector β, such that β0 is the intercept. We take a deeper look into this estimated probability in the next chapter. The drawback of this model is that it only distinguishes whether a customer removes M1 euro or not. A better scenario would be that we can predict the amount of money a customer will remove from the bank. Therefore, we introduce the second model, based on assets.

1.3.2

Model 2: Assets-based

In a nutshell, this second model lumps all the assets together and puts it into so called ‘units’. Herewith, we estimate the probability whether such a unit will be removed during a certain coming period. Therefore, this model is not based on customers, but on the assets. Let us make this more precise, to start with units. Definition 1.3.3 (Unit). Suppose we have a customer K with assets A. A unit is a bucket consisting of M2 euro, where M2 is a fixed positive number. For Model 2, we divide A into A + 12 c units consisting of M2 euro. bM 2 In this way, we transformed the dataset from ‘customers’ to ‘units’. Note that this transformation enlarges the dataset tremendously. At least, when M2 is not too big. Now we can define the dependent variable for this model in a similar way as in Definition 1.3.1: Definition 1.3.4 (Dependent variable Model 2). The dependent variable for Model 2 has the following structure: Y = 1E2 , where E2 := {The unit will be removed during a coming period T2 } where a unit consists of M2 euro and T2 is a fixed time period. Similar to Model 1, we assume independency of the units, i.e. the removal of one unit does not depend on the other units. A unit in E2 is called a default. This defaults are different from the defaults in Section 1.3.1. From the context, it is clear which defaults are meant. Again, there are quite a few possibilities to define the dependent variable for Model 2, based on the structure of Definition 1.3.4. Again, for internal reasons, the chosen values for M2 and T2 will not be mentioned. The example on the next page illustrates the transformation from Definition 1.3.3 and the structure of Definition 1.3.4.

1 For

internal reasons, the exact number will not be mentioned.

6

CHAPTER 1. INTRODUCTION

Example 1.3.5. The following table supposes that there are k covariates, M2 = 10 000 and T2 = 12 months. The customer in this example is customer K1 from Example 1.3.2, at time t0 . X1 3 3 3 3 3 3

X2 0.3 0.3 0.3 0.3 0.3 0.3

··· ··· ··· ··· ··· ··· ···

Xk 20 20 20 20 20 20

Y = 1E2 1 1 0 0 0 0

At0 60 000

At0 +T2 40 000

Table 1.2: Data of customer K1 (from Example 1.3.2) for Model 2. At t0 , the assets of K1 were worth 60 000 euro. Therefore, we divided his/her assets into 1 000 b 60 10 000 + 2 c = 6 units: each row corresponds to one unit. One year later, at t0 + T2 , his assets were worth 40 000, a reduction of 20 000. In other words, two units had been removed. Hence, two rows get Y = 1 and the remainder gets Y = 0. This resembles a binomial distribution. Indeed, every unit has a ‘probability to success’ p(x), deA pending on realizations x = (x1 , . . . , xk ) of the covariates X1 , . . . , Xk , and we have n = b M + 12 c 2 i.i.d. experiments. Hence, with this Model 2, we can give a prediction for the amount of money a customer will remove. Specifically, the logistic regression yields a probability pˆ(x), i.e. the probability of removing a unit. Therefore, the prediction for the amount of removed money is n · pˆ(x), where A n = bM + 12 c the number of units (experiments). 2 Note that major customers have a greater impact than small customers, since major customers have more units and therefore more rows. This in itself is not bad at all, since major customers are more important for Van Lanschot Bankiers.

1.4

Thesis overview

The rest of this thesis is organized as follows. In Chapter 2, we give mathematical background information about generalized linear models, especially logistic regression. Moreover, we look at the asymptotic behavior of the coefficients in a logistic regression model. Next, in Chapter 3, we introduce some tools for measuring the predictive power of explanatory variables and a useful transformation for these variables. Subsequently in Chapter 4, we describe the methodology we use in this thesis, including a new developed algorithm. Thereafter, we analyse the results of both models (described in Section 1.3) in Chapter 5. Finally, we give a conclusion and evaluate and compare the predictive power of the models in Chapter 6. Suggestions for future research are given in Chapter 7. Additional theorems can be found in Appendix A.

7

CHAPTER 2. GENERALIZED LINEAR MODELS

Chapter 2

Generalized linear models The models described in this thesis are based on the theory of generalized linear models. Therefore, in this chapter we will give some mathematical background about generalized linear models, in particular logistic regression. Moreover, a few theorems will be proved to motivate the advantages of generalized linear models. Finally, we discuss the asymptotics of the coefficients.

2.1

Linear Regression

Before looking at generalized linear models, we introduce the linear regression model based on the theory of Gauss and Legendre [9]. Suppose we have a dataset {yi , xi1 , ..., xik }ni=1 consisting of n samples. Definition 2.1.1 (Linear Regression). Assuming that the relationship between the dependent variable Y and the k-dimensional vector of explanatory variables X ∈ Rk is linear, a linear regression model takes the form 1 x11 · · · x1k y1 β0 ε1 y2 1 x21 · · · x2k β1 ε2 .. = .. .. .. + .. , .. . . . . . . yn

1

xn1

···

xnk

βk

εn

which will be written as y = xβ + ε, where • yi is the ith realization of the dependent variable Y ; • x(i) := x> i = (1, xi1 , · · · , xik ) is the ith (k + 1)-dimensional vector of realizations of the explanatory variables X1 , X2 , . . . , Xk ; • β is an (unknown!) (k + 1)-dimensional parameter vector ; • εi ∼ N (0, σ 2 ) is a random error. The εi are i.i.d. with mean 0 and variation σ 2 > 0. Moreover, they are assumed to be independent of the explanatory variables X1 , X2 , . . . , Xk . We have added a unit column to the matrix, because usually a constant is included as one of the explanatory variables. The corresponding value β0 is called the intercept. Note that the conditional mean E[Y |X = x(i) ] = x(i) β = β0 + β1 xi1 + · · · + βk xik is a function of the explanatory variables xi1 , · · · , xik and the parameter β. Hence, if the parameter β can be estimated using the n samples, this model could predict the value of the random variable Y for any realization x = (x1 , x2 , . . . , xk ) of the explanatory variables X1 , X2 , . . . , Xk by E[Y |X = x].

8

CHAPTER 2. GENERALIZED LINEAR MODELS

2.1.1

Estimation of β

Generally, there are two methods to estimate the parameter β. The first one is the Least Squares Estimation (LSE), the other one is the Maximum Likelihood Estimation(MLE) . LSE can be used regardless of the distribution function of the random error ε. MLE makes use of the probability density function (pdf) of Y , which is dependent on the distribution function of ε. Fortunately, it will be shown that these two methods give the same results under our assumptions on the distribution of ε. Least Squares Estimation This method estimates β, such that the sum of the squares of the differences between yi and the linear combination x(i) β, the residual sum of squares, is a minimum. In other words, Definition 2.1.2 (LSE). Suppose we have a linear regression model y = xβ + ε. By means of the Least Squares Estimation, β is chosen such that RSS(β) =

n X (yi − x(i) β)2 i=1

is minimised over β. Note that this method totally disregards (the distribution of) εi . This method looks at the best straight line through the data points {yi , xi1 , ..., xik }ni=1 , it does not treatP Y as random variable n (i) (β) = −2 at all. In order to find a solution, we need to ensure that ∂RSS i=1 (yi − x )xij = 0 ∂βj for j = 0, . . . , k. Thus, the estimator βˆ is the solution of the following system of linear equations: n X

yi =

i=1 n X

yi xi1 =

n X

(βo + β1 xi1 + · · · + βk xik )

i=1 n X

(βo + β1 xi1 + · · · + βk xik )xi1

i=1

i=1

.. . n X i=1

yi xik =

n X (βo + β1 xi1 + · · · + βk xik )xik . i=1

Since this system contains k + 1 linear equations and k + 1 variables, there exists (at least) one solution βˆ = (βˆ0 , βˆ1 , · · · , βˆk ). We can rewrite this system of equations to x> y = x> xβ since ∂RSS > > > −1 > b x y. ∂β (β) = 2x xβ − 2x y. Hence, if x is of full rank, the unique solution is β = (x x) Next, calculating the second derivative yields: ∂ 2 RSS (β) = 2x> x. ∂β 2 Since this matrix is nonnegative definite, our solution is indeed a minimum. Maximum Likelihood Estimation Considering Y as a random variable, as is the case in our linear regression model, we can use MLE as well. Within the model it is assumed that the εi (for i ∈ {1, . . . , k}) are independent, of each other and of the explanatory variables X1 , . . . , Xn , and identically distributed as N (0, σ 2 )

9

CHAPTER 2. GENERALIZED LINEAR MODELS

for some σ 2 > 0. Thus for fixed data points {xi1 , ..., xik }ni=1 , the yi are i.i.d. with distribution N (x(i) β, σ 2 ). The joint pdf of y1 , . . . , yn is1 f (y|β, x, σ 2 ) =

n Y

f (yi |β, x(i) , σ 2 )

i=1

=√

1 2πσ 2

n

exp(−

n 1 X (yi − x(i) β)2 ) 2σ 2 i=1

The likelihood function is similar to this joint pdf, except that the parameters of the function are switched: the likelihood function expresses the values of β and σ 2 for fixed values y, x. Definition 2.1.3 (MLE). Suppose we have a linear regression model Y = Xβ + ε. By means of the Maximum Likelihood Estimation, β is chosen such that L(β, σ 2 |y, x) =

n Y

f (yi |β, x(i) , σ 2 )

i=1

is maximised over β. Since the log-function is strictly increasing, maximizing the log of the likelihood function (loglikelihood function) gives the same solution βˆ as maximizing the likelihood function. Since, it is easier to work with, we will use the log-likelihood function: # " n n X 1 1 √ (yi − x(i) β)2 ) exp(− 2 log L(β, σ 2 |y, x) = log 2σ i=1 2πσ 2 =−

n n n 1 X (yi − x(i) β)2 log(σ 2 ) − log(2π) − 2 2 2 2σ i=1

Again, we have to ensure that the first derivatives are 0, thus n ∂ log L(β, σ 2 |y, x) 1 X = 2 [yi − (βo + β1 xi1 + · · · + βk xik )] ∂βo σ i=1

=0

n ∂ log L(β, σ 2 |y, x) 1 X = 2 [yi − (βo + β1 xi1 + · · · + βk xik )] xi1 ∂β1 σ i=1

=0

.. .

.. .

n ∂ log L(β, σ 2 |y, x) 1 X [yi − (βo + β1 xi1 + · · · + βk xik )] xik = 2 ∂βk σ i=1

=0

n n 1 X ∂ log L(β, σ 2 |y, x) = − + (yi − (βo + β1 xi1 + · · · + βk xik ))2 = 0 ∂σ 2 2σ 2 2σ 4 i=1

Now we have k + 2 linear equations and k + 2 variables, hence there exists (at least) one solution c2 ) = (βˆ0 , βˆ1 , · · · , βˆk , σ c2 ). Can we compare this solution with the previous mentioned solution ˆ σ (β, of the LSE? The answer is yes: Multiplying the first k + 1 equations from above with σ 2 leads to the same system of equations as for the LSE with k + 1 equations and k + 1 (β0 , . . . , βk ) ˆ The last equation from above can be used to find variables. Hence, it gives the same solution β. c 2 the solution for σ . We now have proved the following (little) theorem: 1X

∼ N (µ, σ 2 ) ⇒ f (x) =

√ 1 2πσ 2

(x−µ)2 exp − 2σ2

10

CHAPTER 2. GENERALIZED LINEAR MODELS

Theorem 2.1.4. Let {yi , xi1 , ..., xik }ni=1 be a dataset and y = xβ + ε the corresponding linear regression model. The estimator βˆ of β can be found using LSE or MLE, but will gave a similar solution. Specifically, by solving the system of linear equations x> y = x> xβ, i.e. n X

yi =

i=1 n X

yi xi1 =

i=1

n X i=1 n X

(βo + β1 xi1 + · · · + βk xik ) (βo + β1 xi1 + · · · + βk xik )xi1 (2.1)

i=1

.. . n X

yi xik =

i=1

n X (βo + β1 xi1 + · · · + βk xik )xik . i=1

Remark. The advantage of LSE is that you do not have to specify the distribution of the εi . For MLE, it is crucial that the εi are normal distributed.

2.1.2

Example

In this section we consider an example about the relationship between the number of complaints in one year (x) and the amount of money taken from the bank during that year (y, in 1000 euros). Example 2.1.5. Suppose that the relationship is linear and we want to predict the value of Y for customer A who had 6 complaints in one year, on the basis of the following (fictional) data points: X Y

1 0.5

3 4.0

7 8.5

The corresponding regression equations are yi = β0 + β1 xi + εi where εi ∼ N (0, σ 2 ). According to the previous theorem, we need to solve the following system of linear equations in order to find the estimators of β0 and β1 : 3 X

yi =

i=1 3 X i=1

yi xi =

3 X

(βo + β1 xi )

⇔

13 = 3βo + 11β1 .

(βo + β1 xi )xi

⇔

72= 11βo + 59β1 .

i=1 3 X i=1

25 73 These equations lead to the solution βˆ = (βˆ0 , βˆ1 ) = (− 56 , 56 ). Hence, since E[Y |X = 5] = 59 β0 + 6β1 = 8 , customer A is predicted to take 7 375 euros from the bank during the concerned year, according to this linear regression model.

2.2

Generalization

A linear regression model assumes that a constant change in an explanatory variable leads to a constant change in the dependent variable. Moreover, the dependent variable has a normal distribution, i.e. it can vary in either direction with equal probability. Of course, many dependent variables do not fulfill these requirement. For example, in many cases this variable must be positive, or it can take finitely many values (in particular {0, 1}) so that the dependent variable 11

CHAPTER 2. GENERALIZED LINEAR MODELS

is bounded on both sides. Therefore, the concept of generalized linear models was introduced by Nelder and Wedderburn in 1972 [10]. This type of model allows the dependent variable to have an arbitrary distribution from the exponential family and a non-trivial relationship between the linear predictor and E[Y |X = x] through a so-called link function. Summarizing: Definition 2.2.1 (Generalized linear model). A generalized linear model consists of: • A probability distribution function of the exponential family. • A parameter vector β, for the linear predictor x> β. • A monotone and continuous link function g, such that E[Y |X = x] = g −1 (x> β). As the distribution of the dependent variable is known, maximum likelihood estimation will be ˆ A few probability functions from the exponential family with used in order to find the best β. their corresponding link functions are given in the following table, where µ(x) := E[Y |X = x]: Distribution Normal Poisson

Range (−∞, +∞) N

Link name Identity Log

Bernoulli

{0, 1}

Logit

Link function g(x) = x g(x) = log(x) g(x) = log

x 1−x

Mean µ(x) = x> β µ(x) = exp(x> β) µ(x) =

exp(x> β) 1+exp(x> β)

Table 2.1: Characteristics of some distribution functions of the exponential family. As can be seen, linear regression is a special case of a generalized linear model. Since the models described in this thesis are based on logistic regression, logistic regression will be treated separately in the next section.

2.3

Logistic Regression

Logistic Regression is a special form of a generalized linear model. Definition 2.3.1 (Logistic Regression). A Logistic model is a GLM with a Regression x Bernoulli distribution and link function g(x) = log 1−x . The link function of a Bernoulli variable does not come out of thin air. Note that the expected value of a Bernoulli variable Y is µ = P[Y = 1]. Thus, the linear predictor is postulated to be equal to the log odds log

P[Y =1] P[Y =0]

of Y . Hence, we can predict P[Y = 1|X = x] by

Note that we indeed have P[Y = 1|X = x] =

exp(x> β) 1+exp(x> β)

exp(x> β) . 1+exp(x> β)

∈ [0, 1].

Estimation of β with MLE Since the distribution of our dependent variable Y is known, MLE is used to find the best fitting exp(x> β) 1 β. Let p(x, β) := P[Y = 1|X = x] = 1+exp(x > β) = 1+exp(−x> β) , then we can write the density of Y in the following convenient way: f (y|β, x) = p(x)y (1 − p(x))1−y

12

for

y ∈ {0, 1}

CHAPTER 2. GENERALIZED LINEAR MODELS

Suppose we have n data points, then the joint pdf of y1 , y2 , . . . , yn is n Y

f (y|β, x) =

i=1 n Y

=

f (yi |β, x(i) ) p(x(i) , β)y (1 − p(x(i) , β))1−yi .

i=1

As mentioned in the previous section, the likelihood function is similar to this joint pdf, except that the parameters of the function are switched, and we will maximise the log-likelihood function since it easier to work with: " n # Y (i) y (i) 1−yi log L(β|y, x) = log p(x , β) (1 − p(x , β)) i=1 n h i X = yi log(p(x(i) , β)) + (1 − yi ) log(1 − p(x(i) , β)) . i=1

Now, since −xk exp(x> β) exp(−x> β) 1 ∂p(x, β) =− = x = xk p(x, β)(1 − p(x, β)), k > 2 > ∂βk (1 + exp(−x b)) 1 + exp(−x β) 1 + exp(−x> β) we can differentiate the log-likelihood function with respect to βk : n −∂p(x(i) , β) ∂ log L(β|y, x) X ∂p(x(i) , β) 1 1 + (1 − y ) yi = i ∂βk ∂βk ∂βk p(x(i) , β) 1 − p(x(i) , β) i=1 =

n h i X yi xik (1 − p(x(i) , β)) − (1 − yi )xik p(x(i) , β) i=1

=

n X

xik yi − p(x(i) , β) .

i=1

Equating to zero leads to the system of k + 1 equations and k + 1 variables x> y = x> p(β), where > (1) (n) p(β) = p(x , β), . . . , p(x , β) . In other words, n X

yi =

i=1 n X

n X

p(x(i) , β)

i=1

yi xi1 =

i=1

n X

p(x(i) , β)xi1 (2.2)

i=1

.. . n X i=1

yi xik =

n X

p(x(i) , β)xik .

i=1

However, in contrast to the linear regression model, this equations are nonlinear. Hence, the solution cannot be derived analytically. Therefore, the solution must be estimated numerically using an iterative process. The most popular and most common method is Newton’s method [7].

13

CHAPTER 2. GENERALIZED LINEAR MODELS

2.4

Asymptotic behavior

In this section, we assume that the covariates X = (X1 , . . . , Xk ), X : Ω → Rk , have probability distribution function µ(x) with expectation µ = (µ1 , . . . , µk ). Furthermore, β0 ∈ Θ is assumed to be the real parameter value from the parameter set Θ ⊂ Rm . The results of the theorems in this section are useful later on in this thesis. In particular for Section 3.4, where we discuss several tests for significancy. The theory in this section is derived from A.W. Van der Vaart [15]. Some modifications are performed for purposes of this thesis. Definition 2.4.1 (Fisher Information). Suppose we have a probability density function f (y|β, x) on {0, 1} × Θ × Ω of a variable Y , such that f is a continuously differentiable function of β, ∀(y, x) ∈ {0, 1} × Ω and a measurable function of (y, x), ∀β ∈ Θ. Moreover, assume we have m coefficients β = (β1 , . . . , βm ) and k covariates (X1 , . . . , Xk ) with probability density function µ. Let l(y; β, x) := log(f (y|β, x)) and

0

l (y; β, x) := ∇ log(f (y|β, x)) =

∂ ∂ log(f (y|β, x)), . . . , log(f (y|β, x)) ∂β1 ∂βk

its derivative with respect to β. The Fisher Information Matrix is defined as the m × m matrix Iβ = Eβ0 ,µ l0 (Y ; β, X)> · l0 (Y ; β, X) Hence, the entries (i, j), for 1 ≤ i, j ≤ n, are (Iβ )i,j = Eβ0 ,µ

∂ ∂βi l(Y

; β, X)

∂ ∂βj l(Y

; β, X) .

Lemma 2.4.2. Let f be the function as in Definition 2.4.1. Furthermore, assume f is twice continuously differentiable as function of β. Now let l00 (y; β, x) := D2 log (f (y|β, x))

(with respect to β).

An alternative formula for the Fisher Information Matrix at β0 , under the same assumption as above, is: Iβ0 = −Eβ0 ,µ l00 (Y ; β0 , X) . Proof. Fix (i, j) from 1 ≤ i, j ≤ n. The (i, j)th entry of the RHS equals −Eβ0 ,µ

∂ ∂ ∂βi ∂βj l(Y

; β0 , X) .

Now, ∂ ∂ ∂ l(y; β, x) = ∂βi ∂βj ∂βi

∂ ∂βj f (y|β, x)

f (y|β, x)

=

∂ ∂ ∂βi ∂βj f (y|β, x)

f (y|β, x)

−

∂ ∂ ∂βi f (y|β, x) ∂βj f (y|β, x) f 2 (y|β, x)

RR Since f is a probability density function, f (y|β, x)dy µ(dx) = R1.R By continuous differentia∂ bility of f we can interchange the derivative and integral, hence ∂βi f (y|β, x)dy µ(dx) = 0 RR ∂ ∂ and ∂βi ∂βj f (y|β, x)dy µ(dx) = 0.

14

CHAPTER 2. GENERALIZED LINEAR MODELS

Whence, Eβ0 ,µ

∂ ∂ Z Z ∂ ∂ f (y|β0 , x) f (y|β0 , x) ∂β f (y|β0 , x) ∂ ∂ ∂βi ∂βj ∂βi j l(Y ; β0 , X) = − f (y|β0 , x)dy µ(dx) 2 ∂βi ∂βj f (y|β0 , x) f (y; β0 , x) Z Z ∂ f (y|β , x) ∂ f (y|β0 , x) Z Z 0 ∂ ∂ ∂βj ∂βi = f (y|β0 , x)dyµ(dx) f (y|β0 , x)dyµ(dx) − ∂βi ∂βj f (y|β0 , x) f (y|β0 , x) ∂ ∂ = 0 − Eβ0 l(Y ; β0 , X) l(Y ; β0 , X) ∂βi ∂βj = − Iβ0 i,j

Lemma 2.4.3. Let f be the function as in Definition 2.4.1. Now let L(β) := Eβ0 ,µ l(Y ; β, X) . Then L(β) ≤ L(β0 ), and equality iff f (y; βo , x) = f (y|β, x) Pβ0 × Pµ -a.s. Proof. Consider the difference f (Y |β, X) L(β) − L(β0 ) = Eβ0 ,µ l(Y ; β, X) − l(Y ; β0 , X) = Eβ0 ,u log f (Y |β0 , X) . Since log(x) ≤ x − 1 for positive x, f (Y |β, X) f (Y |β, X) ≤ Eβ0 ,µ −1 Eβ0 ,µ log f (Y |β0 , X) f (Y |β0 , X) Z Z f (y|β, x) = − 1 f (y|β0 , x)dy µ(dx) f (y|β0 , x) Z Z Z Z = f (y|β, x)dy µ(dx) − f (y|β0 , x)dy µ(dx) =1−1 =0

Pn p Let Ln (β) := i=1 l(Yi ; β, X (i) ). Since βbn maximises Ln (β), β0 maximises L(β) and n1 Ln (β) → p p L(β) by the Law of Large Numbers, it would be reasonable to think that βbn → β0 . Here, → means convergence in probability under β0 . However, we need some reasonable regularity conditions to ensure consistency. • β ∈ Θ ⊂ Rm , where Θ is a compact space.

(1)

• |l(y, β, x)| ≤ h1 (x) ∀(y, β, x) ∈ {0, 1} × Θ × Ω, such that E[|h1 (X)]| < ∞.

(2)

• ∀ε > 0 :

(3)

sup

L(β) < L(β0 ).

β∈Θ:||β−β0 ||>ε

• |l00 (y, β, x)| ≤ h2 (x) ∀(y, β, x) ∈ {0, 1} × Θ × Ω, such that E[|h2 (X)]| < ∞.

(4)

The norm || · || is the Euclidean norm. Under the conditions 1 and 2 respectively 1 and 4 we can apply the Uniform Law of Large Numbers (Theorem A.5) on l(y, β, x) respectively l00 (y, β, x). Condition 3 is a stronger assumption than the result of Lemma 2.4.3, it requires that L attains its maximum at a unique point β0 , and only coefficients close to β0 may yield a value close to the maximum value L(β0 ). The additional theorems used in the proofs of the following theorems can be found in Appendix A. 15

CHAPTER 2. GENERALIZED LINEAR MODELS

Theorem 2.4.4. Let f be the function as in Definition 2.4.1. Under the regularity conditions p (1 - 3), the MLE is consistent. In other words, βbn → β0 . Proof. By definition of βbn , we have Ln (βbn ) ≥ Ln (β0 ). By the Law of Large Numbers (A.2), we p have n1 Ln (β0 ) → L(β0 ) whence2 L(β0 ) = n1 Ln (β0 ) + op (1). Therefore, using the Uniform Law of Large Numbers (A.5), we have 1 Ln (βbn ) − L(βbn ) + op (1) n 1 p ≤ sup || Ln (β) − L(β)|| + op (1) → 0. β∈Θ n

L(β0 ) − L(βbn ) ≤

Fix ε > 0. By condition 3, ∃δ > 0 s.t. ||β − β0 || > ε implies L(β) < L(β0 ) − δ. Hence, p P ||β0 − βbn || > ε ≤ P L(β0 ) − L(βbn ) > δ → 0.

Pn Pn Denote L0n (β) := i=1 l0 (Yi ; β, X (i) ) and likewise L00n (β) := i=1 l00 (Yi ; β, X (i) ). Using the results of this section, we derive the following theorem about the asymptotics of β0 : Theorem 2.4.5. Let f be the function as in Lemma 2.4.2 and assume nonsingularity of Iβ0 . √ Under the regularity conditions (1 - 4), the sequence n(βbn − β0 ) is asymptotically normal with −1 mean 0 and variance Iβ0 . Proof. The Taylor series for L0n (β) at β0 yields L0n (β) = L0n (β0 ) + (β − β0 )> L00n (ξ) for a ξ such that ||ξ − β0 || ≤ ||β − β0 ||. Fix n ∈ N. Since βbn is the maximizer of Ln (β), we have −1 L0n (βbn ) = 0. These facts and rearranging the terms yields βbn − β0 = − n1 L0n (β0 ) n1 L00n (ξn ) , −1 √ √ , for a ξn such that ||ξn − β0 || ≤ ||βbn − β0 ||. hence n(βbn − β0 ) = − n n1 L0n (β0 ) n1 L00n (ξn ) We consider the numerator and denominator (inverse) separately. Let us consider the numerator first. From Lemma 2.4.3 we know that β0 is the maximizer of L(β), hence L0 (β0 ) = 0. Therefore, using the Central Limit Theorem (A.3), √ 1 0 √ √ 1 0 1 0 d n Ln (β0 ) = n Ln (β0 ) − 0 = n Ln (β0 ) − L0 (β0 ) N 0, Var l0 (Y, β0 , X) n n n 2 0 0 2 0 where Var l (Y, β0 , X) = Eβ0 ,µ l (Y, β, X) − Eβ,µ l (Y, β0 , X) = Iβ0 − L0 (β0 )2 = Iβ0 . For the denominator, we use that (eventually) ξn will be close to β0 , by consistency of βbn (Theorem 2.4.4). Therefore, we can apply the Uniform Law of Large Numbers, using condition p 1 00 00 00 00 1 and 4: n Ln (ξn ) → L (β0 ). Note that L (β0 ) = Eβ0 ,µ l (y; β0 , x) = −Iβ0 by Lemma 2.4.2. Finally, we use Slutsky’s Lemma (A.4) to end the proof. Indeed, √ 1 0 1 00 e p d n Ln (β0 ) N 0, Iβ0 and L (βn ) → −Iβ0 , n n n hence, √

−1 √ 1 0 1 00 e b n(βn − β0 ) = − n Ln (β0 ) L (βn ) n n n

2 Recall

p

d

Iβ−1 N 0

that Xn =op (Rn ) means Xn = Yn Rn and Yn → 0 [15].

16

0, Iβ0

−1 = N 0, Iβ0 .

CHAPTER 3. PREDICTIVE POWERS

Chapter 3

Predictive powers In order to justify whether a model performs well, a few tools will be introduced in this chapter. The Weight of Evidence measures the ability to distinguish two events from each other. Next, some Gini coefficients are derived from the original Gini coefficient postulated by Corrado Gini [4]. Finally, we introduce two chi-square tests. All the tools will be illustrated at the hand of the following fictional example.

3.1

Example

Example 3.1.1. The following data consists of 370 fictional customers with money on bank X at the beginning of year T . At the end of the year we examined whether a customer has removed more than 50% of his assets from bank X. If he/she did, he/she will be marked as default, otherwise he/she will be given the mark non-default. We made a distinction between the genres of jobs, observed at the beginning of year T . In the last column, the percentage defaults with respect to the total number of defaults is given. Category Index Name 1 Full-time 2 Part-time 3 Jobless 4 Retired Total:

Observations 140 100 50 80 370

Defaults 4 10 19 13 46

Non-defaults 136 90 31 67 324

% Defaults 8.70 % 21.74 % 41.30 % 28.26 % 100.00 %

Table 3.1: Fictional data of ’default’ and ’non-default’ costumers on a bank.

As can be seen in the table above, ’Jobless’ appears to be most risky, full-timers have the lowest risk. In order of decreasing risk: 1. Jobless 2. Retired 3. Part-time 4. Full-time However, there is no monotonic relationship between ’work’ and the risk of going into default. This is a crucial point for logistic regression, since it assumes a form of monotonicity; increasing exp(x> β) an explanatory variable Xi leads to an increase in p(x, β) = P[Y = 1|X = x] = 1+exp(x > β) . 17

CHAPTER 3. PREDICTIVE POWERS

The WoE-methodology solves this problem for categorical explanatory variables. It has other features as well, these will be observed in the next section.

3.2

Weight of Evidence

The Weight of Evidence (WoE) is a Bayesian method to indicate how frequently a certain characteristic occurs given an event A, relative to an event B [1]. Recall that we will use logistic regression, thus our dependent variable Y has a discrete probability distribution function p(x, β) = P[Y = 1|X = x] = 1 − P[Y = 1|X = x] =

1 . 1 + exp(−x> β)

Definition 3.2.1 (Weight of Evidence). Given X and two events A and B, the a characteristic P[X|A] Weight of Evidence is an estimator of log P[X|B] , defined as WoE = log

|X∩A| |A| |X∩B| |B|

= log

|X ∩ A| |A|

− log

|X ∩ B| |B|

.

A higher WoE implies a better ratio between the events A and B, relative to characteristic X. In this thesis, the event A will be stated as default and event B = Ac as non-default In particular, A will be the event E1 or E2 from Definition 1.3.1 respectively Definition 1.3.4 and B the event E1C or E2C . The characteristic X corresponds to a so called bucket: Definition 3.2.2 (Bucket). Fix an arbitrary explanatory variable X : Ω → R, where R is the range of X. We say that X is divided into m buckets if there exists a finite partition {Ri }m i=1 of R. The subset Ri is referred to bucket i. Hence, • a bucket of a discrete variable consists of a set of numbers. • a bucket of a categorical variable consists of a category, or a set of categories. • a bucket of a continuous variable consists of an interval, or a set of intervals. The dividing into buckets of every explanatory variable is performed using a new developed algorithm, defined in Section 4.2. Recall that under these assumptions, the WoE is an estimator of P[bucket i|default] log . P[bucket i|non-default] Within these specific events and characteristics, the WoE is closely related to the log-odds ratio, since P[bucket i|default] P[default|bucket i] P[non-default] = × , P[bucket i|non-default] P[non-default|bucket i] P[default] using Bayes’ rule: P[A|B] =

P[B|A]P[A] . P[B]

18

CHAPTER 3. PREDICTIVE POWERS

In terminology of p(x, β) = P[Y = 1|X = x], we define the probability of default as follows: Definition 3.2.3. The probability of default, given bucket j, is defined as P[default|bucket j] := P[Y = 1|X = ej ] = p(ej , β), where ej is the unit vector (0, · · · , 1, · · · , 0), with 1 on the jth place. Next, we have the following theorem: \j the MLE of the log-odds ratio Theorem 3.2.4. Let A,B and Rj defined as above, and let LOR of Xj , then ∃C ∈ R such that \j + C, WoEj = LOR where WoEj is the Weight of Evidence of the jth bucket. Proof. Firstly, we derive the Maximum Likelihood Estimator of LORj . Recall that P[default|bucket j] p(ej , β) LORj = log = log . P[non-default|bucket j] 1 − p(ej , β) Maximizing LORj over β is equivalent to maximizing p(ej , β) over β (left to the reader). Pnj ∂ We will find this maximizer by finding the MLE of β, i.e. solving i=1 ∂β log f (yi , β, x(i) ) = 0. (We condition on ‘bucket j’, so x(i) = ej ∀1 ≤ i ≤ nj , and nj corresponds to the size of bucket j.) Recall that f (yi , β, x(i) ) = p(ej , β)yi (1 − p(ej , β))1−yi and p(ej , β) = Hence, p(ej , β) only depends on βj : p(ej , β) = p(ej , βj ). exp(−β ) ∂ Moreover, ∂β p(ej , βj ) = (1+exp(−βjj ))2 = p(ej , βj )(1 − p(ej , βj )). j

1 1+exp(−βj ) .

Therefore, ∂ ∂ log f (yi , βj , x(i) ) = yi log p(ej , β) + (1 − yi ) log (1 − p(ej , β)) ∂βj ∂βj = yi 1 − p(ej , β) − 1 − yi p(ej , β) = yi − p(ej , β). Hence, nj nj X X ∂ (i) log f (yi , β, x ) = 0 ⇔ yi − p(ej , β) = 0 ∂β i=1 i=1 Pnj yi ⇔ p(ej , β) = i=1 nj |Rj ∩ A| ⇔ p(ej , β) = . |Rj | \j = log Thus LOR

1

|Rj ∩A| |Rj | |Rj ∩A| − |R j|

!

= log

|Rj ∩A| |Rj |−|Rj ∩A|

19

= log

|Rj ∩A| |Rj ∩B|

.

CHAPTER 3. PREDICTIVE POWERS

Finally,

|Rj ∩A| |A| |Rj ∩B| |B|

− log |Rj ∩ A| |Rj ∩ B| |Rj ∩ A| |B| |Rj ∩ A| = log + log − log |Rj ∩ B| |A| |Rj ∩ B| |B| . = log |A| This completes the proof, since log |B| is indeed independent of bucket j. |A| \j + C, for C = log |B| . We can conclude that WoEj = LOR |A| \j = log WoEj − LOR

Thus, to put the buckets in order of riskiness, it does not make a difference whether we use WoE or LOR because it only differs by a constant. However, conform the philosophy of Van Lanschot Bankiers, the WoE-methodology will be used: The LOR explains the ratio of defaults and non-defaults within one specific bucket, on the other hand the WoE of a bucket will be high if it contains a large proportion of all the defaults. The LOR will be positive if there are more defaults than non-defaults in the associated bucket. A positive WoE occurs when the proportion of all defaults is higher than the proportions of all non-defaults within the bucket. This last statement is more in line with our view on risk; a positive WoE implies a ’risky’ bucket. In the following table both the WoE and the LOR arecalculated for each category. For [ of category ’Full-time’ ,LOR [ 1 , is log instance, the LOR 4 46 = −1.574. is log 136

4 140 136 140

= −3.526 and the related WoE1

324

Category Index Name 1 Full-time 2 Part-time 3 Jobless 4 Retired

% Defaults 8.696 % 21.739 % 41.304 % 28.261 %

WoEi -1.574 -0.245 1.463 0.312

di LOR -3.526 -2.200 -0.490 -1.640

Table 3.2: WoE and LOR added to each category of Table 3.1. [ i = log( 324 ) = 1.952. Furthermore, we see that The constant C is equal to C = WoEi − LOR 46 the categories ’Jobless’ and ’Retired’ are labelled as ’risky’. Putting the categories in decreasing order of WoE, or equivalently in decreasing order of LOR, yields: 1. Jobless 2. Retired 3. Part-time 4. Full-time This is exactly the same sequence as in Section 3.1, thus sorting by WoE, or LOR, yields the preferred order of buckets: A high WoE corresponds to a relative high risk. Moreover, we 20

CHAPTER 3. PREDICTIVE POWERS

can use these transformed variables for logistic regression, since they imply monotonicity. Increasing the transformed variable (i.e. switching to a riskier bucket) leads to an increase in p(x, β) = P[Y = 1|X = x], hence increased probability of default. This is consistent with our intuition. Hence, supported by the above arguments, we will transform all explanatory variables through the WoE-methodology.

3.3

Gini coefficient

To evaluate and compare every explanatory variable, we introduce a few coefficients, derived from the Gini coefficient, in order to give those variables a rating between 0 and 1.

3.3.1

Background

The Italian statistician Corrado Gini published in 1912 his Gini coefficient [4] measuring the inequality in income within a society. In order to calculate the Gini coefficient, the cumulative percentage of income, Y , is plotted against the cumulative percentage of population, X, ordered from rich to poor, i.e. the Lorenz curve. An example of a Lorenz curve is given in the picture on the right. The surfaces A and B are used for the calculation of G. Figure 3.1: An example of a Lorenz curve.2 Definition 3.3.1 (Gini coefficient). Given a Lorenz curve and corresponding surfaces A and B, the Gini coefficient is calculated by G=

A A+B

∈ [0, 1].

G is well defined since A + B > 0. In fact, A + B = 21 since it is half of a unit square. Hence, A G = A+B = 2A = 1 − 2B. Let us consider the edge points: • G ≈ 0 if A ≈ 0. Then, the Lorenz curve is a straight line, i.e. the income is proportional distributed over population. Thus, there is no inequality. • G ≈ 1 occurs if B ≈ 0, i.e. almost all income is earned by the rich people. This yields the largest possible inequality. Based on this coefficient, we will construct two look-alikes of the Gini-coefficient for rating our explanatory variables. The aim of an explanatory variable is to distinguish the defaults from the non-defaults. Therefore, the variable will be judged on the ability to distinguish those from each other. 2 http://people.hofstra.edu/geotrans/eng/methods/gini.html

21

CHAPTER 3. PREDICTIVE POWERS

3.3.2

Customer-based Gini coefficient

This coefficient is motivated by our first model, which is based on customers. Recall that the buckets with a high WoE are the risky buckets. Hence, there are more defaults, i.e. customers in event E1 , in the first buckets than in the last several. In fact, an ideal explanatory variable puts all defaults in one bucket. Such an explanatory variable has all defaults in the first bucket and none in the remainder. We want to give this variable the highest possible rating Gc . On the other hand, the worst case scenario is that the defaults are proportional spread over the buckets. In this case, we want to give such a variable the lowest possible rating. The following coefficient will satisfy these criteria: C ∗ x) ∧ 1 represents the ideal situation, where D is total Definition 3.3.2. Let fideal (x) = ( D amount of defaults, i.e. the customers in event E1 , and C the total number of observations. The customer-based Gini coefficient, Gc , is then calculated with the graph, where the axes are chosen in the following way:

• The x-axis contains the cumulative percentage of customers, divided into buckets. The buckets are ordered by a certain order. • The y-axis contains the cumulative percentage of defaults. Subsequently, the coefficient is defined as Gc =

A A+B ,

where

• A is the surface enclosed by the plotted curve and the straight line y = x. • B is the surface enclosed by the plotted curve, and the line fideal . Indeed, every default in the first bucket yields the smallest possible surface B, hence the highest possible rating Gc = 1. Furthermore, the Gc is consistent with the worst case scenario as well: Evenly distributed yields the straight line y = x. Hence, the surface A attains the smallest possible value, thus the smallest possible rating Gc = 0. Consider the example of Section 3.1. We want to calculate the Customer-based Gini coefficient, thus we need the graph described in the definition above. It is given in Figure 3.2:

Figure 3.2: The graph for the Customer-based Gini coefficient of Example 3.1.1.

22

CHAPTER 3. PREDICTIVE POWERS

In order to calculate the surface of A and B, we divide the graph into pieces. To compute A, we divide the area into the trapezia between the data points. Every trapezium is calculated through a triangle and a rectangle. Adding them up and subtracting 21 , corresponding to the area under the line y = x, yields: 1 1 ∗ 0.1351 ∗ 0.4130 + (0.3514 − 0.1351) ∗ (0.4130 + ∗ (0.6957 − 0.4130)) 2 2 1 + (0.6216 − 0.3514) ∗ (0.6957 + ∗ (0.9130 − 0.6957)) 2 1 1 + (1 − 0.6216) ∗ (0.9130 + ∗ (1 − 0.9130)) − 2 2 = 0.22708.

A=

For B, we divide the area under fideal into a triangle and a rectangle. Then we add their surface up and subtract the area of A and again 21 : B=

1 1 ∗ (1 ∗ 0.1243) + 1 ∗ (1 − 0.1243) − 0.22708 − = 0.21077. 2 2

Hence, the Customer-based Gini coefficient Gc is Gc =

A A+B

=

0.22708 0.22708+0.21077

= 0.51862.

After these calculations, we can construct an algorithm for computing Gc in general cases as well: Algorithm 3.3.3 (Customer-based Gini coefficient). Fix an explanatory variable X, divided into buckets R1 , . . . , Rm . Put the buckets in a certain order Ri1 , Ri2 . . . , Rim (ij ∈ {1, . . . , m}) (for instance, such that the WoE of bucket Rij is greater than the WoE of bucket Rij+1 ). Let D be the total amount of defaults, C the total number of customers, dj the percentage of defaults in the jth bucket (di0 = 0) and cj the percentage of customers in the jth buckets (ci0 := 0). A , Then Gc (X), i.e. the customer-based Gini coefficient of variable X, is computed by Gc (X) = A+B where: Pm Pj−1 • A = j=1 cij ∗ ( k=1 dik + 12 ∗ dij ). • B=

3.3.3

1 2

∗

D C

+ (1 −

D C)

−

1 2

−A=

1 2

−

1 2

·

D C

− A.

Assets-based Gini coefficient

We can define a similar modified Gini coefficient for our second model, called as the Assets-based Gini coefficient: Definition 3.3.4. Let fideal (x) = ( M D ∗ x) ∧ 1 represents the ideal situation, where D is total amount of defaults, i.e. the units in event E2 , and M the total amount of euros. The assetsbased Gini coefficient (Ga ) is then calculated with the graph, where the axes are chosen in the following way: • The x-axis contains the cumulative percentage of assets, divided into buckets. The buckets are ordered by a certain order. • The y-axis contains the cumulative percentage of defaults. Subsequently, the coefficient is defined as Ga =

A A+B

where

• A is the surface enclosed by the plotted curve and the straight line y = x.

23

CHAPTER 3. PREDICTIVE POWERS

• B is the surface enclosed by the plotted curve, and the line fideal . In general this coefficient Ga will be lower than Gc : A default in Gc -terminology is a customer in event E1 . A default in Ga -terminology is a unit in event E2 . However, such a unit corresponds to a customer and every unit of that customer is in the same bucket since his characteristics depends on the customer! Hence, every customer who has removed for at least one unit from the bank causes defaults, even though it is just one single unit. Thus, the ideal situation is rather unlikely: In that situation, every customer who has removed assets from the bank should be in the first bucket, and they should have removed all their assets. The other buckets only consists of customers who have not removed assets from the bank relative to one year ago. Lastly, similar to the Algorithm 3.3.3 for Gc , we can define an algorithm for Ga as well. Algorithm 3.3.5 (Assets-based Gini coefficient). Fix an explanatory variable X, divided into buckets R1 , . . . , Rm . Put the buckets in a certain order Ri1 , Ri2 . . . , Rim (ij ∈ {1, . . . , m}) (for instance, such that the WoE of bucket Rij is greater than the WoE of bucket Rij+1 ). Let D be the total amount of defaults, A the worth of all assets, dj the percentage of defaults in the jth bucket (di0 = 0) and aj the percentage of (value of) assets in the jth buckets (ai0 := 0). A , Then Ga (X), i.e. the assets-based Gini coefficient of variable X, is computed by Ga (X) = A+B where: Pm Pj−1 • A = j=1 aij ∗ ( k=1 dik + 12 ∗ dij ). • B=

1 2

∗

D C

+ (1 −

D C)

−

1 2

−A=

1 2

−

1 2

·

D C

− A.

Remark. The definition of default in this algorithm is not the same as the definition of default in Algorithm 3.3.3. Here, defaults are related to units (recall Definition 1.3.3). On the other hand, defaults in Algorithm 3.3.3 are related to customers. The following two lemmas are useful for upcoming calculations. Lemma 3.3.6. Let Di be the number of defaults in bucket i and Ni the (positive) number of observations in bucket i. Then Di Dj < . Ni Nj

⇔

WoEi < WoEj

In other words, the ordering by WoE is identical to the ordering by

Di Ni .

Proof. WoEi < WoEj ⇔ log

⇔

1

Di Ni Ni −Di Ni

Di Ni i −D Ni

<

1

!

< log

Dj Ni Nj −Dj Nj

Dj Nj D − Njj

Di Dj Dj Di (1 − )< (1 − ) Ni Nj Nj Ni Di Dj ⇔ < . Ni Nj ⇔

24

CHAPTER 3. PREDICTIVE POWERS

Note that curve.

Di Ni

is the slope of the Gini curve. Therefore, ordering by WoE always yields a convex

Lemma 3.3.7. Suppose we have k buckets. If |Bi | = 1 for every i ∈ {1, · · · , k}, then G = 1. i Proof. In this case, D Ni ∈ {0, 1}. By the previous Lemma 3.3.6, ordering by WoE implies that the first D buckets consist of a default, and the last several consist of a non-default. Hence, moving 1 upwards towards the next bucket, i.e. C1 to the right, yields the plotted Gini curve going D for the first D buckets. Hereafter, the line remains constant. This is similar to the ideal line A C ∗ x) ∧ 1. Hence, B = 0 so G = A+B = 1. fideal (x) = ( D

3.4

Chi-square tests

In this section, we assume that we have a probability density function f (y|β, x) on {0, 1} × Θ × Ω of a variable Y , such that f is a continuously differentiable function of β, ∀(y, x) ∈ {0, 1}×Ω and a measurable function of (y, x), ∀β ∈ Θ, as in Section 2.4. Furthermore, we assume nonsingularity of Iβ0 and the regularity conditions (1 - 4) to be satisfied. Under these assumptions, we can use the Uniform Law of Large Numbers (A.5). Definition 3.4.1 (Standard Error). The Standard Error of the ith coefficient β0,i is defined as the square root of the ith diagonal element of the inverse of the estimator −L00n (βbn ) (excluding the factor n1 ) for the Fisher Information Matrix Iβ0 , i.e. SE2n (βbn,i ) = (−L00n (βbn ))−1 i,i

Thus, we have approximately SEn (βbn ) ≈ √1n σ, where σ 2 = Iβ−1 is the asymptotic variance of 0 β0 . Standard errors are used for inferences about individual coefficients of the logistic regression models. In particular, for chi-square tests. We cover two of them in this section: The Wald chi-square test (Section 3.4.1) and the Score chi-square test (Section 3.4.2). Remark. For Logistic Regression, we have

−L00n (βbn )

= j,k

n X i=1

−

n ∂ log f (yi |β, x(i) ) X xij xik p(x(i) , βbn ) 1 − p(x(i) , βbn ) . = ∂βj ∂βk i=1

Example 3.4.2. Consider the following dataset for a logistic regression model (n = 3): X Y

1 1

2 0

3 1

1 The system of equations 2.2 yields βbn = (βbn,0 , βbn,1 ) = (0.6931, 0), i.e. p(x, βbn ) = 1+exp(−0.6931) 2 1+1+1 1+2+3 2 00 b b b and p(x, βn ) 1 − p(x, βn ) = 9 . Therefore, we have −L3 (βn ) = = 1+4+9 9 1 + 2 + 3 −1 2 3 6 9 1 14 −6 10.5 −4.5 . Hence, −L00 (βbn ) = = . 6 14 −6 3 −4.5 2.25 9 2 (3 · 14 − 6 · 6)

Thus we can conclude that the Standard Errors for β0,0 , the intercept, and β0,1 , corresponding √ √ to X, are SE3 (βb0,0 ) = 10.5 = 3.24 and SE3 (βb0,1 ) = 2.25 = 1.5, based on this dataset. 25

CHAPTER 3. PREDICTIVE POWERS

3.4.1

Wald chi-square test

Definition 3.4.3 (Wald statistic). The Wald statistic for coefficient β0,i is defined as Tw (βn,i ) :=

(βbn,i )2 . SE2 (βbn,i ) n

Using the Wald statistic, we can test whether a parameter is significance by the chi-square distribution. Indeed (Theorem 3.4.4), the Wald statistic is asymptotically chi-square distributed. Theorem 3.4.4. The Wald statistic Tw (βn,i ) has an asymptotic χ21 -distribution under the null hypothesis H0 : β0,i = 0. √ b d −1 N 0, Iβ0 . Proof. Let β0 be the real parameter vector. Recall Theorem 2.4.5: n(βn − β0 ) Multiplying by the unit vector ei yields √ √ n βbn,i − β0,i = ei · n(βbn − β0 )

d

> N 0, ei · Iβ−1 · e i 0

Furthermore, by the Uniform Law of Large Numbers (A.5), 1 and 4, we know that using condition p − n1 L00n (βbn ) → Iβ0 . Indeed, noting that L00 (β0 ) = Eβ0 ,µ p00 (y; β0 , x) = −Iβ0 (Lemma 2.4.2),

1 1 | L00n (βbn ) − L00 (β0 )| ≤ | L00n (βbn ) − L00 (βbn )| + |L00 (βbn ) − L00 (β0 )| n n 1 ≤ sup | L00n (β) − L00 (β)| + |L00 (βbn ) − L00 (β0 )| β∈Θ n p

→0. p Here, we used that βbn → β0 , by consistency of β0 (Theorem 2.4.4) and continuity of L00 (β). p . Applying Continuous Mapping Theorem (A.1) gives us −nL00n (βbn )−1 → Iβ−1 0 p −1 > So in particular, ei · −nL00n (βbn )−1 · e> i → ei · Iβ0 · ei .

Note that, under H0 , β0,i = 0. Now we can conclude by Slutsky’s Lemma that q Tw (βn,i ) =

βbn,i SEn (βbn,i ) βbn,i

=q

ei (−L00n (βbn ))−1 e> i √ b n βn,i − β0,i =q ei n(−L00n (βbn ))−1 eti 1 d > q · N 0, ei · Iβ−1 · e i 0 ei · Iβ−1 · e> i 0 = N (0, 1).

Finally, since N (0, 1)2 ∼ χ21 , we obtain Tw (βn,i )

26

d

χ21 .

CHAPTER 3. PREDICTIVE POWERS

3.4.2

Score chi-square test

The other chi-square test in this thesis is the Score chi-square test. Definition 3.4.5 (Score statistic). The Score statistic is defined as Ts (βn,i ) := Si2 (βen ) · SE2n (βen,i ), Pn where βen is the MLE subject to βe0,i = 0, and Si (β) := j=1 sample score.

∂ (j) ) ∂βi l(Yj ; β, X

is called the

This statistic is also asymptotically chi-square distributed. We will prove it in the next theorem. Theorem 3.4.6. The Score statistic Ts (βn,i ) has an asymptotic χ21 -distribution under the null hypothesis H0 : β0,i = 0. p p Proof. Note that eventually, βen will be close to βbn , since both βbn → β0 and βen → β0 (under H0 ). Therefore, using the Taylor series for L0n (β), we have L0n (βen ) = L0n (βbn )+L00n (βen )(βen − βbn )+op (1). Moreover, βbn is the MLE, so L0n (βbn ) = 0. Now we have 1 0 e 1 00 e √ e 1 b √ Ln (βn ) = Ln (βn ) n βn − βn + op √ . n n n

So in particular, √ 1 1 1 1 e b √ Si (βen ) = √ · ei · L0n (βen ) = ei L00n (βen )e> √ . j · e i n βn − β n + op n n n n Consider the RHS. The first factor will converge by the Uniform Law of Large Numbers, using conditions 1 and 4: p → −ei Iβ0 e> ei n1 L00n (βen )e> i i . The second factor under H converges 0: √ e √ e √ b √ b d −1 > b b N 0, ei Iβ0 ei . ei n βn − βn = n βn,i − βn,i = − n βn,i = n βn,i − βo,i −1 > > e Hence, the RHS converges (in distribution) to −ei Iβ0 e> · N 0, e I = N 0, e I e i β0 i i β0 i i q q √ Next, nSEn (βen,i ) = ei (− n1 L00n (βbn ))−1 e> ei (Iβ0 )−1 e> i converges (in probability) to i . Therefore, by Slutsky’s Lemma, q

√ 1 Ts (βn,i ) = √ Si (βen ) · nSEn (βen,i ) n

Finally we can conclude that Ts (βn,i )

d

d

q

χ21 .

27

ei (Iβ0 )−1 e> i

·N

0, ei Iβ0 e> i

= N (0, 1).

CHAPTER 3. PREDICTIVE POWERS

Having introduced two test statistics, we pose the question: Why do we need two different chisquare tests? This question can be answered partially in this section. We will come back to this as well in the next chapter, Section 4.3. To form a clearer idea of these chi-square tests, both tests are pictured in the following graph.

Figure 3.3: Graphical representation of the Wald and Score chi-square test.3 Along the x-axis are possible values of the coefficient βi . Along the y-axis are the values of the log likelihood corresponding to those values of βi . The Wald test compares the coefficient estimate βbn,i to β0,i . In this thesis, we have always β0,i = 0. In Figure 3.3, this is shown as the distance between βbn,i and β0,i , i.e. |βbn,i |, on the x-axis. Next, the Score test looks at the slope of the log likelihood when β0,i is constrained to zero. That is, it looks at how quickly the likelihood is changing at the (null) hypothesized value of βi ; 0. In Figure 3.3 this is shown as the tangent line at β0 . Moreover, checking the definitions of the Wald and Score statistic, these statistics are defined using different models. Specifically, the Wald statistic uses the model including the parameter to be tested and the Score statistic used the model without the parameter to be tested. Therefore, • The Wald statistic is used to determine whether the variable has to remain in the model against a certain significance level, • the Score statistic is used to determine whether a parameter can be added to the model against a certain significance level. As mentioned before, we will come back to this in Section 4.3.

3 http://www.ats.ucla.edu/stat/mult_pkg/faq/general/nested_tests.htm

28

(adapted)

CHAPTER 4. METHODOLOGY

Chapter 4

Methodology The methodology described in this chapter has been applied to both models, customer-based and assets-based. The complete dataset is split into smaller, disjoint, sets. This will be described in the first section. As is apparent from the previous chapter, defining the buckets is essential for the Weight of Evidence, hence the Gini coefficient. Since the significance of each explanatory variable is based on this Gini coefficient, it is crucial to find the optimum bucket(s), i.e. the best number of buckets and their boundaries. Usually, within Van Lanschot Bankiers, the boundaries distinguishing the buckets are defined by experts. For purpose of this thesis, I prefer a more mathematical approach. Therefore, I designed an bucket algorithm which determines the optimum boundaries for the buckets. Finally, the coefficients of the model are determined by stepwise logistic regression: The best fitting explanatory variables will be added to the model, the insignificant ones will be removed step by step.

Intermezzo: Overfitting Adding as many explanatory variables as possible is not appropriate. Recall that the aim of this thesis is to predict the value of a certain stochastic variable Y by potential explanatory variables X = (X1 . . . , Xk ). In other words, we want to fit the function f such that Y = f (X). For this, we use a dataset consisting of realizations of Y with associated values of X. Overfitting occurs when the model starts to capture the points of this particular dataset instead of the function f . For example, consider a dataset consisting of 4 data points. Using a polynomial of fourth degree, we can draw a smooth line between these 4 data points. However, this does not imply that the generating function f is equal to this line as well. Moreover, using a polynomial of an even higher degree, totally nonsense graphs may arise. Therefore, we have to avoid overfitting by adding not to many variables to the model. This is discussed in detail in Section 4.3.

29

CHAPTER 4. METHODOLOGY

4.1

Data tree

The following graphic illustrates the division of the complete dataset. Each part is randomly determined. Note that the portions are disjoint.

Figure 4.1: Overview of the splitting of the dataset. Primarily, the complete dataset is split into two parts: 30% of all data points is used for estimating the number of buckets and their boundaries, 70% is used for modelling the logistic regression and, optionally, removing the insignificant explanatory variables. The choice of these percentages is based on the fact that estimating the coefficients is more important than estimating the boundaries of the buckets. We could have chosen for a slightly different ratio as well, but this will not affect the results because of the enormous size of our dataset. Next, stratified sampling1 ensures that it cannot occur that all defaults are contained in the ‘Logistic Regression-part’: without defaults, the estimation of buckets is useless. Subsequently, both parts are split into two parts as well. The splitting of the ‘Bucket algorithm’part is needed for the bucket algorithm, defined in the next section. The ‘Logistic Regression’part is split since we want to perform an out-of-sample test for determining the accuracy of the model as well. Again, the 70%-30% split is made because the estimating of the coefficients is more important than testing the accuracy of the model. Both splits are also performed by stratified sampling. 1 Stratified sampling is used to avoid the possibility that a portion contains none of the defaults. Stratified sampling means that the non-defaults and defaults are split separately using the given ratio. Therefore, every part consists of at least one non-default and one default. Especially in the case of bucket estimation, it is crucial that a part contains defaults.

30

CHAPTER 4. METHODOLOGY

4.2

Bucket algorithm

The algorithm is constructed from several steps. The construction and processing of the algorithm is explained in the following section. Some particular choices within the algorithm will be supported in the following sections.

4.2.1

Design

The algorithm is build up in the following way: • Firstly, the algorithm seeks the optimum single boundary, i.e. the boundary corresponding to the highest Gini calculated using the WoE, in decreasing order, of the two buckets. The dataset used is called the construction set, which is a subset of the dataset. • Next, the algorithm calculates the Gini on a different subset of the dataset, the stop set, such that the construction set has an empty intersection with the stop set. • Subsequently, these two steps are repeated for two boundaries rather than one. If the Gini on the stop set has decreased relative to the previous step, the algorithm will stop. Otherwise, it continues to the next step. • The first two steps are repeated for three boundaries, i.e. four buckets. Again, the algorithm will stop of the Gini on the stop set has decreased, otherwise, the next step will be executed. • The first two steps are repeated for four boundaries, i.e. five buckets. • At last, the boundaries creating the highest Gini on the stop set are defined as the optimum set of boundaries. The algorithm can easily be extended to more boundaries. However, adding one more boundary increases the computational time tremendously (for more details, please refer to Section 4.2.3). Moreover, based on experiences of experts within Van Lanschot Bankiers, a useful number of buckets greater than 5 will rarely occur. The existence of the stop set is to avoid overfitting. This is motivated by Figure 4.2:

Figure 4.2: The evolution of the Gini coefficient on the construction set and stop set.

31

CHAPTER 4. METHODOLOGY

As the number of buckets increases, the Gini will always increase as well (for a proof, see Theorem 4.2.3). Eventually, new buckets are only chosen to fit the data points, while we would increase the number of buckets only if this improves the predictive power of the explanatory variable. This is overfitting. By introducing the stop set, the algorithm stops when overfitting occurs: Adding a new bucket for fitting the data of the construction set will increase the Gini on the construction set, but the Gini on the stop set will not increase, or even decrease. Therefore, the algorithm will stop at this moment. Since the importance of the construction set is greater than the importance of the stop set, the construction set consists of 70% of the dataset and the stop set consists of the other 30%. The determination of the optimum boundary is achieved by iterations. This concept will be explained by the following example. Two input variables for the algorithm are n and steps. In this example, n = 4 and steps = 2. Example 4.2.1. Consider the following situation. We have an explanatory variable X ∈ Z and 9 customers with different realizations of X. One of these customers is a default. X Y

-2 0

-1 0

0 0

1 0

2 0

3 0

4 1

5 0

6 0

Table 4.1: Realizations of X for 9 different customers, including one default.

After ordering the customers by X, the 100 n1 th until 100 n−1 n th percentiles are calculated. Next, the algorithm treats the n − 1 values as possible boundaries, and calculates the n − 1 corresponding Gini’s. In our example, the possible boundaries are the 25th, 50th, 75th percentiles: {0, 2, 4}. α1 X Y

-2 0

-1 0

0 0

α2 1 0

2 0

0.375

α3 3 0

0.625

4 1

5 0

6 0

0.25

Table 4.2: Values of the Ginis, considering the 25h, 50th and 75th percentile as boundary.

As can be seen in Table 4.2, the boundary with the highest Gini is 2, the 50th percentile. For the next step, the algorithm concentrates on the interval (0,4], between the 25th and 75th percentile. In general, if the highest Gini is on the 100 ni th percentile, the algorithm concentrates on the i+1 interval between the 100 i−1 n th and 100 n th percentile in the next step. 1 n−1 On that interval, the 100 n th until 100 n th percentiles are calculated again. Subsequently, the n − 1 corresponding Gini’s are calculated. In our example, the possible boundaries in the next step are {1, 2, 3}.

32

CHAPTER 4. METHODOLOGY

α1 X Y

-2 0

-1 0

0 0

1 0

α2 2 0

0.5

α3 3 0

0.625

4 1

5 0

6 0

0.75

Table 4.3: Values of the Ginis, considering the boundaries of step 2. Now, 3 is the optimum boundary of the second step, i.e. R1 = (−∞, 3] and R2 = (3, ∞). In fact, this is th´e optimum boundary in this case: It is not possible to have more non-defaults in a bucket without any defaults. By increasing n, the possibility of converging to a local maximum rather than the global maximum can be minimised. On the other hand, the computational time will increase as well. In fact, the number of steps taken are ≈ nsteps . Increasing steps will increase the accuracy of the boundary. If L is the length of the range of X, the deviation from the true value is to the utmost ≈ L n , since that is the distance between two possible boundaries. After the second step, the 2∗ L n , since an interval of length 2 · L accuracy is ≈ n has been cut into n pieces. In general, the n 2t−1 ∗ L deviation after the t-th step, we call this accuracy from now, is ≈ . nt 21 ∗ 8 Thus in our example, the accuracy after the first step is ≈ 48 = 2 and ≈ = 1 after the 42 second step.

4.2.2

Outliers

This algorithm takes the percentiles into account. Initially, the algorithm chose the possible boundaries slightly different. It cuts the interval in n equal pieces, so the boundaries would have L L been {M + L n , M + 2 ∗ n , . . . , M + (n − 1) ∗ n } where M is the minimum value of the considered interval. In the previous example, this would not have made any difference, but in some specific cases, it really does. For instance, consider the fictive explanatory variable with a huge gap: Every customer has a value between 0 and 1 000, but there is one ‘big-spender’ with a value of 10 000 000. Fix n = 10. Initially, the possible boundaries would have been {1 000 000, 2 000 000, . . . , 9 000 000}. These boundaries would not have made any sense, since they define the same buckets. Specifically, the ‘big spender’ is separated from the rest. Only after the 6th iteration, the boundaries 25 ∗ 10 000 000 become non-trivial (accuracy ≈ = 320). Therefore, the algorithm considers per106 centiles to avoid this. Using percentiles, we have a guarantee that every bucket contains an n-th proportion of the customers.

4.2.3

Analysis of boundaries

In the ideal situation, the algorithm should only stop when the Gini on the stop set starts to decline, so the limit of 5 buckets should not be there. However, as mentioned in the previous section, the computational time of the algorithm is quite high. That is mainly the reason why it is impossible to remove the limit on the number of buckets: specifically, let n be the number of cuts, then the computational time of 6 buckets will already be approximately n times the computational time of 5 buckets.

33

CHAPTER 4. METHODOLOGY

If the optimum set of boundaries Bk in the case of k buckets is contained in the optimum set of boundaries in the case of k + 1 buckets (Bk+1 ), the computational time would be utterly low relatively to the current algorithm. For instance, consider the case where k = 2 and n = 10 and we perform just one iteration step. The optimum boundary is selected from 10 possible boundaries. For k = 3, the optimum boundaries (2) are selected from 12 ∗ 10 ∗ (10 − 1) possible couples of boundaries (every couple where the first boundary is strictly lower then the second boundary). In general, the boundaries for k + 1 buckets are selected from nk possible sets of boundaries. If Bk ⊆ Bk+1 , the optimum boundary would also be selected from 10 possible boundaries. However, for k = 3, the optimum set of boundaries would be selected from 10 possible boundaries as well, since the first boundary is already chosen. Hence, in general, the boundaries for k + 1 buckets are selected from n possible sets of boundaries. This difference would ever increase if k rises and if we increase the number ofP iteration steps. 5 For the first iteration step in our algorithm, where the limit is 5 buckets, Sn := k=1 nk Gini’s P 5 are calculated in total. If Bk ⊆ Bk+1 , only Sn0 := k=1 n Gini’s are calculated. For n = 10 this means 637 respectively 50 calculations. This difference would become much bigger as the limit of 5 buckets is raised and if we iterate more than once. Unfortunately, Bk ⊆ Bk+1 does not hold in general. This will be shown by the following counterexample. Example 4.2.2. Consider a discrete variable X ∈ {1, 2, 3, 4, 5} with the following data points: X 1 2 3 4 5

observations 10 10 10 10 10

defaults 6 4 3 2 1

Calculating the Gini’s for the 4 possible boundaries for two buckets, yields: boundary 2 3 4 5

bucket 1 (−∞, 2) (−∞, 3) (−∞, 4) (−∞, 5)

bucket 2 [2, ∞) [3, ∞) [4, ∞) [5, ∞)

Gini 0.2573529412 0.3308823529 0.3125 0.2022058824

The Gini’s for 3 buckets are, according to the algorithm: boundary 1 2 2 2 3 3 4

boundary 2 3 4 5 4 5 5

bucket 1 (−∞, 2) (−∞, 2) (−∞, 2) (−∞, 3) (−∞, 3) (−∞, 4)

bucket 2 [2, 3) [2, 4) [2, 5) [3, 4) [3, 5) [4, 3)

So B2 = {3} and B3 = {2, 4}, hence B2 6⊂ B3 .

34

bucket 3 (3, ∞) (4, ∞) (5, ∞) (4, ∞) (5, ∞) (5, ∞)

Gini 0.3676470588 0.4044117647 0.3676470588 0.3860294118 0.3860294118 0.3308823429

CHAPTER 4. METHODOLOGY

This example is a special case where the number of defaults decreases as X increases: there is a monotone relationship between X and the number of defaults. Note that the Gini of two boundaries (3 buckets) is strictly higher than the Gini of one of those boundaries (2 buckets). The following theorem shows that the optimum Gini for k + 1 buckets (Gk+1 ) is always strictly higher than the optimum Gini for k buckets (Gk ), under the assumption that Gk < 1. Theorem 4.2.3. Let Gk be the optimum Gini for k buckets. The following implication holds: Gk < 1 ⇒ Gk+1 > Gk . In other words, if the optimum Gini for k buckets is not equal to 1, i.e. we are not in the ideal situation yet, the optimum Gini for k + 1 buckets is always strictly greater than Gk . Before we start to prove this theorem, two lemmas will be introduced to achieve and maintain a better overview. Lemma 4.2.4. If Gk < 1, then ∃ bucket Rm (m ∈ {1, · · · , k}), having percentage defaults dm and percentage customers cm , with a partition {Rm1 , Rm2 } (by adding one boundary), such that dm dm2 dm1 > . > cm1 cm cm2

Proof. dm2 dm1 Note that if we can divide Rm into buckets Rm1 and Rm2 such that cm < cm , switching the 1 2 two buckets yields the result as well. We will prove this lemma by contradiction. Suppose the contrary, i.e. ∀ buckets m, ∀ partitions dm2 dm1 = cm . m1 and m2 , cm 1 2 From Lemma 3.3.7 we know the existence of a bucket containing at least two observations. Choose m1 from one of those buckets, such that it contains exactly one customer. d

m1 • If this customer is a non-default, then 0 = cm = 1 Hence, the whole bucket consists of non-defaults.

• If this customer is a default, then Since dm2 := defaults.

Dm2 D

1 D 1 C

=

dm2 cm2

dm2 cm2

⇒ dm2 = 0.

⇒ dm2 =

C∗cm2 D

=

Nm2 D

.

, it follows that Dm2 = Nm2 . Hence, the whole bucket consists of

i This holds for every bucket Ri , i ∈ {1, · · · , k}. Thus, D Ni ∈ {0, 1} for every bucket Ri . Reasoning as in the proof of Lemma 3.3.7 yields G = 1, a contradiction.

35

CHAPTER 4. METHODOLOGY

The following lemma is almost trivial, but the result is convenient.

Lemma 4.2.5. Let A be the surface enclosed by the red and black lines and B the surface enclosed by the blue and black lines. If dc22 > dc11 , then A > B. i−1 , swapping In other words, as long as dcii > dci−1 these buckets only enlarges the surface.

Proof. Left to the reader. The results of Lemma 4.2.4 and Lemma 4.2.5 simplify the proof of Theorem 4.2.3. Now, we can proceed to the proof of this theorem. Proof of Theorem 4.2.3. Fix the set of boundaries Bk = {b1 , . . . , bk−1 }. We will add a new boundary b0 such that Gini G 0 , with boundaries {b1 , . . . , bk−1 , b0 }, is a strict improvement with respect to Gk . Since Gk+1 is the optimum Gini over every combination of k +1 boundaries, we can conclude that Gk+1 ≥ G 0 > Gk . This will end the proof. = 1 −A1 · D , improving the surface A is sufficient for improving the Gini coefficient Pk−12 2 CPi−1 G. Let Ak := i=1 ci ∗ ( j=1 dk + 21 ∗ di ) be the surface A corresponding to Gk . By Lemma 4.2.4 Since G =

A A+B

we know the existence of a bucket Rm with partition {Rm1 , Rm2 } such that (m) Ak

>

dm2 cm2

. Denote

for the surface after exchanging the bucket Rm for the buckets Rm1 and Rm2 . Now,

(m)

Ak

dm1 cm1

m−1 X

− Ak = cm1 ∗ (

j=1

m1 m−1 X X 1 1 1 dj + dm−1 ) + cm2 ∗ ( dj + dm1 + dm2 ) − cm ∗ ( dj + dm ) 2 2 2 j=1 j=1

= (cm1 + cm2 )

m−1 X j=1

m−1 X 1 1 1 dj + cm1 dm1 + cm2 dm1 + cm2 dm2 − cm ∗ ( dj + dm ) 2 2 2 j=1

1 1 = cm1 dm1 + cm2 dm1 + cm2 dm2 − 2 2 1 1 = cm1 dm1 + cm2 dm1 + cm2 dm2 − 2 2 1 1 = cm2 dm1 − cm1 mdm2 − cm2 dm1 2 2 1 = (cm2 dm1 − cm1 dm2 ) 2 > 0.

1 cm dm 2 1 (cm1 + cm2 )(dm1 + dm2 ) 2

Here we used that cm1 + cm2 = cm and dm1 + dm2 = dm . Now we have an improvement of Ak . However, this is not the surface A0 for the Gini G 0 since we have to re-order the buckets due to the new buckets bm1 and bm2 . Actually, only bm1 can shift to the left and bm2 can shift to the

36

CHAPTER 4. METHODOLOGY

right as long as

dm1 cm1

>

di ci

for i < m1 respectively

the Gini curve and Lemma 4.2.4, we have to the right. Similarly, we have

dm2 cm2

<

dm cm

dm1 cm1

<

di ci

for i > m2 . Indeed, by convexity of

dm1 dj dm cm1 > cm ≥ cj for d ≤ cjj for j < m.

j > m. Therefore, bm cannot shift (m)

However, Lemma 4.2.5 tells us that these shifts only enlarge the surface Ak . We can do this iteratively until the buckets m1 and m2 are at their right places. The resulting surface is A0 , which is strict greater than Ak . Hence, G 0 > Gk . This explains the need for a stop set, such that the construction set and stop set are disjoint. If the algorithm stops when the Gini on the construction set decreases, the algorithm would only stop once G = 1, which is the case where every customer is contained in a unique bucket (overfitting!).

4.3

Selecting the explanatory variables

After finding the optimum buckets for each explanatory variable, the variables will be transformed by the Weight of Evidence-transformation. Each value of all explanatory variables is transformed to the Weigh of Evidence value of the corresponding bucket. Hereafter, the transformed explanatory variables are ready to be selected for the logistic regression model. However, this can be done in many different ways. Therefore, the major advantages and disadvantages of the main methods are weighted against each other below, and it turned out that stepwise logistic regression is the most favorable one. Recall Theorem 3.4.4 and Theorem 3.4.6: Under H0 : β0,i = 0, Tw (βn,i ) and Ts (βn,i ) have an asymptotic χ21 -distribution. Therefore, in case of whether adding a new variable, the p-value is P[X > Ts (βn,i )] for X ∼ χ21 , and in case of eliminating an included variable, the p-value is P[X > Tw (βn,i )] for X ∼ χ21 . Full model fitted The full model fitted method is the default and provides no model selection capability. All of the explanatory variables are contained in this model. Forward selection The forward selection method begins with no variables in the model and adds variables by comparing the p-values for the Score statistics to a significance level. In this manner, the best explanatory variable of the non-selected ones, i.e. those with the lowest p-value, is added to the model each step. However, the p-value of an included explanatory variable, depending on the Wald statistic, may be increased after including a new one due to the dependency between those variables. Hence, after this selection method an included explanatory variable can still be insignificant, i.e. having a p-value, corresponding to the Score statistic, greater than a given significance level. Backward elimination The backward elimination method begins by including all of the quantitative variables and then deletes variables until all of the variables that remain produce Wald statistics significant at a given significance level. In this way, the model ends up consisting only significant explanatory variables. However, it is possible that a explanatory variable is deleted during the process, while afterwards it was better to retain the variable. For instance, if two variables are strongly related, one of those variables will be insignificant hence will be deleted from the model. Though, it 37

CHAPTER 4. METHODOLOGY

is possible that including the deleted variable on his own is better, i.e. more significant, than including the current one. Best subset The best subset method finds a specified number of models with the highest Wald statistic for all possible model sizes, from 1, 2, 3 effect models, and so on, up to the single model containing all of the explanatory variables. Hence, the full model fitted model is contained in this best subset method. The drawback of this model is that it is hard to determine which subset is the best one. Since the chi-square statistic will never fall after adding a variable, the conclusion of this method will be that the full model is the ’best’ one. However, the problem of overfitting occurs within this reasoning. Specifically, if we have more variables than defaults to be explained, it can be the case that every default can be explained by a unique combination of variables. This is overfitting. Besides, this problem can be solved by adding a new stop set, just like the foregoing algorithm. Though, this reduces the number of data points for estimating the coefficients which is undesirable. However, the major issue is the computational time. Since this method considers every possible subset of the set of variables, 2#variables models have to be computed. This is very inefficient and is very time consuming. Stepwise selection The stepwise selection method is a modification of the forward selection that differs in that variables already in the model do not necessarily stay there. The stepwise process ends when no variable outside the model has a p-value, corresponding to the Score statistic, significant at a given significance level and every variable in the model is significant at that significance level, or when the variable to be added to the model is the variable that was just deleted from it. Otherwise, this method will continue indefinitely. At each step, the most significance will be added to the model (if it satisfies the significance level) and after that, we check whether any added variable became insignificant. The advantage of this method is that it does not have the problems of forward selection and backward elimination. After adding the best explanatory variable, the variables which became insignificant are removed from the model. However, they can be added again in a subsequent step. Therefore, it is impossible to have an insignificant explanatory variable after performing this method and it is impossible to lose a significant variable during the process as well. As argued above, we use the stepwise logistic regression method since it results in the best collection of explanatory variables without overfitting. Moreover, the computational time is acceptable.

38

CHAPTER 5. MODEL ANALYSIS

Chapter 5

Model analysis In the previous chapters, some tools and the methodology are described and motivated. In this chapter, we will discuss and analyse the results of this methodology. We consider respectively Model 1, i.e. the Customer-based model, and Model 2, i.e. the Assets-based model. We started modelling with a dataset consisting of 1026 potential explanatory variables and two dependent variables (for both models). The four sets of data variables of Section 1.1 are all included except the macroenvironment variables, so only customer behaviour, perceptions and demographic variables are used. The majority of these variables was already available from existing tables, the remaining variables are added manually. After consultation with experts within Van Lanschot Bankiers, the number of variables was reduced to 26. These potential variables were used for both models. We performed Model 2 twice, in two different ways. This is motivated in Section 5.3. The data were used from a fixed date in the past1 .

5.1

Customer-based model

Conform the data tree of the previous chapter (4.1), the data points, i.e. customers, are split into subsets: 1. Construction set 2. Stop set 3. Coefficients set 4. Out-of-sample set Since the number of observations and number of defaults are market-sensitive information, these will not be mentioned. The first two sets are used for the algorithm to determine the bucket-boundaries (recall Figure 4.2). Subsequently, the WoE of each bucket and Gini coefficient (Gc , recall Definition 3.3.2) of the variable are calculated on the construction set. The order of the buckets for Gc is such that the WoE of bucket Ri , on the construction set, is greater than the WoE of bucket Ri+1 . Next, we want to avoid variables who have no distinctive character at all. Therefore, variables with a Gini coefficient lower than 0.1 will be eliminated. After transforming the coefficients set into these new values, stepwise logistic regression is performed on the transformed coefficients set. Finally, out-of-sample performance has been done on the out-of-sample set to test the model.

1 Date

will not be mentioned in this thesis.

39

CHAPTER 5. MODEL ANALYSIS

5.1.1

Transformation

The algorithm of Section 4.2 yielded the following Gini’s, Gc , and Weight of Evidences. Note that the algorithm automatically stops after 5 buckets. However, there is one variable with 6 buckets. This sixth bucket is added manually afterwards, for internal reasons. Every value (Gini, WoE’s) is calculated on the construction set. Variable X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26

Gini 0.0378 0.1667 0.1390 0.1353 0.2443 0.2429 0.1264 0.0984 0.2633 0.2645 0.1760 0.2163 0.1533 0.2701 0.1842 0.1555 0.1097 0.2205 0.1640 0.3500 0.3567 0.2076 0.4190 0.2573 0.3737 0.2129

Bucket 1

Bucket 2

Bucket 3

Bucket 4

Bucket 5

Bucket 6

-2.118 -2.521 -2.110 -2.501 -1.460 -1.689 -1.594 -2.103 -1.559 -1.642 -2.096 -1.938 -2.055 -1.616 -1.929 -1.874 -1.866 -1.334 -1.660 -2.320 -1.027 -1.613 -2.249 -0.842 -2.952 -1.475

-2.007 -2.039 -2.448 -1.327 -2.176 -2.386 -2.185 -2.203 -2.506 -2.839 -2.805 -3.651 -2.537 -2.698 -2.500 -2.283 -2.344 -2.134 -2.308 -2.879 -2.688 -3.044 -3.067 -2.228 -2.688 -2.741

-2.190 -2.309 -1.714 -2.286 -2.653 -2.617 -2.074 -2.003 -2.666 -1.861 -2.014 -1.609 -2.040 -1.859 -1.658 -1.237 -2.047 -2.540 -2.051 -1.353 -2.446 -1.205 -4.111 -2.263 -2.433

-1.657 -2.228 -1.934 -2.339 -2.305 -2.419 -2.553 -2.291 -2.241 -3.073 -2.757 -2.532 -2.366 -2.828 -2.584 -2.024 -2.509 -1.781 -2.290 -0.953 -2.248

-1.923 -1.472 -1.434 -1.456 -2.290 -1.825 -1.179 -1.193 -1.484 -2.197 -1.020 -1.050 -2.102 -1.946 -1.705 -2.439 -2.085 -1.910

-2.192 -

Table 5.1: Gini and Weight of Evidence for every bucket of the potential explanatory variables. As a consequence of the 0.1 cut-off, X1 and X8 are eliminated. Hereafter, the values in the coefficients set are transformed to the above WoE-values corresponding to their own buckets.

5.1.2

Logistic regression

To summarize, each row of the coefficients set consists of one dependent variable and 24 potential explanatory variables. The significance level to enter the model is αin = 0.05. This significance level is commonly used, since it can be linked to twice the standard deviation. In order to increase the accuracy, a significance level of αout = 0.025 is chosen to stay in the model. This yields the following selection, using the stepwise logistic regression of Section 4.3:

40

CHAPTER 5. MODEL ANALYSIS

Summary of Stepwise Selection Effect Number Score Entered Removed DF In Chi-Square X21 1 1 210.3942 X20 1 2 33.5447 X26 1 3 19.0019 X6 1 4 13.4211 X23 1 5 11.8252 X4 1 6 10.3731 X25 1 7 6.7326 X21 1 6 X20 1 5 X24 1 6 4.3657 X24 1 5

Step 1 2 3 4 5 6 7 8 9 10 11

Wald Chi-square

0.8005 3.9857 4.3617

Pr>ChiSq < .0001 < .0001 < .0001 0.0002 0.0006 0.0013 0.0095 0.3710 0.0459 0.0367 0.0368

Table 5.2: Summary of the stepwise selection for Model 1.

After the last step, the model building terminated because the last variable entered was removed immediately. Otherwise, it would cause an infinite loop. It seems that X21 depend strongly on X25 . Indeed, before adding X25 , pout was lower than 0.025 (otherwise, it would be removed). However, after adding X25 , pout was increased to 0.3710. This selection procedure reduced the number of explanatory variables to five: • X26 • X6 • X23 • X4 • X25 Next, the coefficient for the logistic regression are calculated by the Maximum Likelihood method (recall Section 2.3). This yielded the following coefficients βˆi corresponding to Xi (β0 is called the intercept): Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq β0 1 2.9154 0.4678 38.8395 < .0001 β4 1 0.5361 0.1625 10.8831 0.0010 β6 1 0.4459 0.1079 17.0636 < .0001 β23 1 0.4954 0.0724 46.8293 < .0001 β25 1 0.3316 0.0789 17.6622 < .0001 β26 1 0.5282 0.1227 18.5208 < .0001 Table 5.3: Analysis of the coefficients for the significant variables for Model 1.

41

CHAPTER 5. MODEL ANALYSIS

Hence, the conclusion of this methodology is the following estimate for the probability of default, i.e. the probability of removing at least M1 from the bank during the coming period T1 (recall Definition 1.3.1): P[Y = 1|X = x] =

1 1 + exp(−2.915 − 0.536x4 − 0.446x6 − 0.495x23 − 0.332x25 − 0.528x26 )

It is also interesting to mention the odds ratio of the estimates in Table 5.3. P[Y =1|X=x] > Recall the odds ratio, OR = P[Y =0|X=x] = exp(x β). Now, what happens when a customer switches to another bucket of a specific variable Xj ? Suppose a customer switches from bucket q to bucket r of a specific variable Xj , and let ∆j be the difference WoEr − WoEq of variable Xj . Then, the new odds ratio of this customer is equal exp(x> β+∆ β )

j j = exp(∆j βj ) = exp(βj )∆j . These to his/her old odds ratio multiplied by a factor exp(x> β) factors exp(βj ) can be found in the following Table 5.4. Moreover, the 95% confidence intervals are calculated. Recall the thumb rule for a 95% confidence interval: [βˆj −2SE(βj )j , βˆj +2SE(βj )j ]. Subsequently, the confidence interval for the odds ratio estimate of βj is ˆ ˆ exp βj − 2SE(βj ) , exp βj + 2SE(βj )

The following table shows these odds ratio estimates and confidence interval for the significant variables of Model 1: Effect β4 β6 β23 β25 β26

Odds Ratio Estimates Point Estimate 95% Confidence Limits 1.709 1.243 2.350 1.562 1.264 1.930 1.641 1.424 1.891 1.393 1.194 1.626 1.696 1.333 2.157

Table 5.4: Odds Ration Estimates for the significant variables for Model 1.

5.1.3

Out-of-sample performance

To test whether this model works fine on the out-of-sample set as well, we can construct a derivative of the earlier mentioned Gini coefficient (Definition 3.3.2). Since every transformed variable has up to 5 buckets, hence values, the total amount of estimated probabilities is finite. In fact, there are 5 ∗ 6 ∗ 3 ∗ 5 ∗ 5 = 2250 different values. Consider each of them as a bucket, we can order these buckets by their probability of default, in decreasing order, and calculate the corresponding out-of-sample Gini coefficient. This yields the graph on the next page.

42

CHAPTER 5. MODEL ANALYSIS

Figure 5.1: The graph for the out-of-sample Gini coefficient of Model 1. Note that there are not 2250 buckets in practise: Some of them are empty since the corresponding combination of buckets does not appear in the out-of-sample set. The associated Gini coefficient is 0.395. This is sufficient for a new model, conform the philosophy of Van Lanschot Bankiers. The number of observations and defaults originate from the out-of-sample set. Since this information is again market-sensitive, the exact numbers can not be mentioned.

5.2

Assets-based model

For our second model, we have to split the data points into four subsets again. Now, these data points corresponds to units (Definition 1.3.3) instead of customers. Therefore, we replicate each row of the original dataset, i.e. the dataset of Model 1, as described in Definition 1.3.3. This result in a bigger dataset than for Model 1. After this replicating, the dataset is split into four subsets conform the proportions of Figure 4.1. Recall that the four subsets are: 1. Construction set 2. Stop set 3. Coefficients set 4. Out-of-sample set Again, because of market-sensitivity, the number of observations and number of defaults will not be mentioned. After construction these subsets, the procedure is similar to the previous model. Specifically, the bucket-boundaries are determined by the upper two sets (recall Figure 4.2) and the WoE 43

CHAPTER 5. MODEL ANALYSIS

of each bucket and Gini coefficient (Ga , recall Definition 3.3.4) of the variable are calculated on the construction set. The order of the buckets for Ga is such that the WoE of bucket Ri , on the construction set, is greater than the WoE of bucket Ri+1 . The same Gini cut-off is used, i.e. variables with a Gini coefficient lower than 0.1 will be eliminated. Subsequently, the coefficients set is transformed into the WoE-values and stepwise logistic regression is performed on this transformed coefficients set. Finally, out-of-sample performance has been done on the out-of-sample set to test the model and to compare this model with the previous model.

5.2.1

Transformation

The algorithm of Section 4.2 yielded the following Gini’s and Weight of Evidences. Again, there is one variable with a sixth bucket added manually afterwards, for internal reasons. Similar to the previous model, every value (Gini, WoE’s) is calculated on the construction set. Variable X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26

Gini 0.0767 0.1391 0.1646 0.0932 0.1975 0.1883 0.1118 0.1178 0.1700 0.1588 0.1221 0.1824 0.0981 0.1721 0.1841 0.1193 0.0859 0.1349 0.1094 0.2432 0.2691 0.1481 0.2829 0.2263 0.2523 0.2128

Bucket 1

Bucket 2

Bucket 3

Bucket 4

Bucket 5

Bucket 6

-1.363 -1.870 -1.889 -2.282 0.095 0.174 -1.113 -1.451 -0.241 -1.157 -1.452 -1.706 -1.458 -1.168 -1.707 -1.538 -1.633 -1.605 -1.167 0.441 -0.935 -1.574 0.622 -1.086 0.133 -0.739

-1.758 -1.741 -0.898 -1.759 -1.704 -1.716 -1.599 -1.738 -1.625 -1.808 -1.943 -1.098 -1.882 -1.995 -1.169 -1.892 -1.854 -1.913 -1.564 -2.044 -1.690 -1.722 -1.859 -1.914 -2.056 -1.727

-1.628 -1.525 -1.909 -1.342 -1.923 -2.006 -1.719 -1.916 -1.777 -2.000 -1.654 -2.147 -1.483 -1.537 -2.131 -1.565 -0.797 0.895 -1.819 -1.525 -1.922 -1.916 -2.107 -0.376 -1.617 -2.051

-1.850 -0.856 -0.413 -1.787 -1.527 -1.874 -1.851 -1.634 -1.934 -1.500 -1.867 -1.822 -1.812 -1.845 -1.842 -1.767 -1.721 -1.853 -1.361 -2.054 -1.781 -1.606 -2.014 -1.099 -1.780

-1.596 -1.962 -1.539 -1.517 -0.820 -1.411 -1.625 -1.263 -1.320 -0.868 -1.235 -0.695 -1.472 -1.175 -0.714 -1.064 -1.503 -1.459 -1.008 0.695 -0.087 -0.947 -1.616 -1.537 -1.560

-1.575 -

Table 5.5: Gini and Weight of Evidence for every bucket of the potential explanatory variables. As a consequence of the 0.1 cut-off, X1 , X4 ,X13 and X17 are eliminated. Hereafter, the values in the coefficients set are transformed to the above WoE-values corresponding to their own buckets.

5.2.2

Logistic regression

To summarize, each row of the coefficients set consists of one dependent variable and 22 potential explanatory variables. The significance levels for Model 2 are chosen similar to Model 1, i.e. the 44

CHAPTER 5. MODEL ANALYSIS

significance level to enter the model is αin = 0.05. and the significance level to stay in the model is αout = 0.025. This yields the following selection, using the stepwise logistic regression of Section 4.3:

Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Summary of Stepwise Selection Effect Number Score Entered Removed DF In Chi-Square X21 1 1 29 473.1085 X5 1 2 3 975.3660 X23 1 3 2 394.8830 X3 1 4 1 994.2795 X15 1 5 1 457.5246 X25 1 6 769.9251 X2 1 7 766.5375 X7 1 8 595.4548 X8 1 9 557.0462 X16 1 10 361.1400 X26 1 11 325.0290 X22 1 12 201.9734 X24 1 13 225.4249 X20 1 14 190.4513 X12 1 15 173.0791 X18 1 16 104.0607 X19 1 17 47.4746 X9 1 18 47.9938 X6 1 19 52.1776 X10 1 20 9.6499

Wald Chi-square

Pr>ChiSq < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 0.0019

Table 5.6: Summary of the stepwise selection for Model 2.

In contrast to the selection of Model 1 (Table 5.2), the model building terminated because no additional variables met the significance level for entry into the model. Hence, the variables X11 and X14 did not make it to the model and were eliminated. Now, we have 20 variables left into the model. Next, the coefficients for the logistic regression are calculated by the Maximum Likelihood method (recall Section 2.3) for the 20 variables. This yielded the following coefficients βˆi corresponding to Xi (Again, β0 is called the intercept):

45

CHAPTER 5. MODEL ANALYSIS

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq β0 1 3.3280 0.0661 2 535.3547 < .0001 β2 1 0.6069 0.0242 627.4126 < .0001 β3 1 0.4502 0.0128 1 236.4115 < .0001 β5 1 0.2382 0.0198 144.9817 < .0001 β6 1 0.1327 0.0192 47.9055 < .0001 β7 1 -0.7326 0.0269 742.2494 < .0001 β8 1 0.5154 0.0218 558.0924 < .0001 β9 1 -0.1585 0.0200 63.0922 < .0001 β10 1 0.0660 0.0213 9.6507 0.0019 β12 1 0.2579 0.0220 137.3631 < .0001 β15 1 0.2433 0.0222 120.2898 < .0001 β16 1 0.3592 0.0188 364.6035 < .0001 β18 1 0.1668 0.0173 92.8419 < .0001 β19 1 -0.2495 0.0329 57.5685 < .0001 β20 1 -0.2426 0.0162 225.0537 < .0001 β21 1 0.4885 0.0143 1 174.7073 < .0001 β22 1 -0.2359 0.0187 159.6159 < .0001 β23 1 0.4352 0.0141 954.4685 < .0001 β24 1 0.2105 0.0124 287.9267 < .0001 β25 1 0.2077 0.0112 343.5358 < .0001 β26 1 0.2458 0.0129 362.7547 < .0001 Table 5.7: Analysis of the coefficients for the significant variables for Model 2. Hence, the conclusion of this methodology is the following estimate for the probability of default, i.e. the probability of removing at least M2 from the bank during the coming period T2 (recall Definition 1.3.4): P[Y = 1|X = x] =

1 1 + exp(−x> β)

for β = (β0 , . . . , β26 )> as in Table 5.7

As described in Section 5.1.2, it is also interesting to mention the odds ratio of the estimates in Table 5.7, together with their 95% confidence interval for Model 2. This table is on the next page.

46

CHAPTER 5. MODEL ANALYSIS

Effect β2 β3 β5 β6 β7 β8 β9 β10 β12 β15 β16 β18 β19 β20 β21 β22 β23 β24 β25 β26

Odds Ratio Estimates Point Estimate 95% Confidence Limits 1.835 1.750 1.924 1.569 1.530 1.609 1.269 1.221 1.319 1.142 1.100 1.186 0.481 0.456 0.507 1.674 1.604 1.747 0.853 0.821 0.887 1.068 1.025 1.114 1.294 1.240 1.351 1.276 1.221 1.332 1.432 1.380 1.486 1.182 1.142 1.222 0.779 0.731 0.831 0.785 0.760 0.810 1.630 1.585 1.676 0.790 0.761 0.819 1.545 1.503 1.589 1.234 1.205 1.265 1.231 1.204 1.258 1.279 1.247 1.311

Table 5.8: Odds Ration Estimates for the significant variables for Model 2.

5.2.3

Out-of-sample performance

As in Figure 5.1, we can construct a (Definition 3.3.4) Gini coefficient, derived from the earlier mentioned Gini coefficient from Definition 3.3.4. For Model 2, we have 521 ∗ 3 ≈ 1.43 ∗ 1015 compositions of buckets (one bucket for each of the 22 variables). However, a majority of those compositions do not appear in the out-of-sample set. The existing compositions yield a probability value and we can consider them as a bucket. Next, we order these buckets by their probability of default, in decreasing order, and calculate the corresponding Gini coefficient. This yields the graph on the next page, similar to Figure 5.1.

47

CHAPTER 5. MODEL ANALYSIS

Figure 5.2: The graph for the out-of-sample Gini coefficient of Model 2. The associated Gini coefficient is 0.351. Note that this value is lower than the Gini coefficient of Model 1 (Section 5.1.3), but still high enough for a new model within Van Lanschot Bankiers. Moreover, Figure 5.2 yields an interesting result: at the start of the curve, the Gini curve is almost equal to the ideal situation! This implies that the first few buckets consists of only defaults. In other words, the money in these bucket is almost certain to be removed. This is extremely useful information for the bank. The number of observations and defaults originate from the out-of-sample set. Indeed, there are quite a few buckets consisting of almost exclusively defaults. It is shown that 83% of the riskiest 3% of units is indeed removed within the time period T2 .

5.3

Modified Assets-based model

Using the splitting procedure of Section 5.2, some customers can be contained in multiple sets. For instance, in the Coefficients set and in the Out-of-sample set. Therefore, one could doubt the valuable of the out-of-sample performance, since customers can be contained in other sets next to the Out-of-sample set. Therefore, we repeated the procedure of Section 5.2 after a modified splitting procedure as follows: Recall Section 5.2, where we split the units proportionally over the four subsets, after duplicating the customers as described in Definition 1.3.3. For our model in this section, called the Modified assets-based model (Model 2’), we split the dataset slightly different: We use the subsets of Model 1, where the customers are split proportionally over the subsets. Next, we duplicate the customers as in Definition 1.3.3. Hence, customers cannot be contained in more than one subset. Subsequently, we proceed similar as the previous section. We only consider the results of each step, details are omitted, since they are similar to the previous

48

CHAPTER 5. MODEL ANALYSIS

section. The exact number of defaults and number of observations will not be mentioned because of market-sensitivity.

5.3.1

Transformation

Variable X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26

Gini 0.1658 0.2783 0.2504 0.1069 0.2714 0.2442 0.1246 0.1118 0.2618 0.2444 0.2650 0.3008 0.2038 0.2627 0.2528 0.2157 0.2224 0.2432 0.1584 0.3015 0.2763 0.2176 0.3528 0.2337 0.3618 0.2981

Bucket 1

Bucket 2

Bucket 3

Bucket 4

Bucket 5

Bucket 6

-1.848 -2.088 -1.928 -2.336 0.896 0.690 -1.525 -1.645 0.680 -1.855 -1.976 -0.800 -1.541 -1.828 -0.504 -1.933 -1.800 -0.637 -1.169 -0.415 -0.675 -1.619 1.037 -0.530 0.361 0.080

-1.206 1.755 2.404 -1.284 -1.968 -1.930 -1.380 -1.789 -1.736 1.708 -1.511 -2.652 -2.058 2.464 -1.790 -0.451 -1.336 -2.040 -1.846 -2.260 -1.964 -1.811 -1.957 -1.632 -2.640 -1.697

-1.893 -1.990 -1.995 -1.695 -1.769 -1.319 -1.894 -1.282 -2.068 -1.940 -1.895 -1.594 -1.777 -1.969 -2.022 -1.977 -2.212 -1.820 -1.995 -0.039 -1.937 -2.188 -2.019 -1.956 -2.104

-1.364 -1.581 -0.094 -2.025 -0.543 -1.193 -1.445 -1.039 -1.458 -1.983 -0.718 -1.581 0.016 -1.654 -1.618 -1.594 -1.574 -1.848 -1.309 -0.882 -1.864

-1.057 -0.574 -1.671 -1.413 -1.441 0.101 0.062 -2.428 -0.578 -0.765 -0.765 0.607 -0.323 0.107 -0.517 -1.295 -1.533

-1.482 -

Table 5.9: Gini and Weight of Evidence for every bucket of the potential explanatory variables. None of the 26 variables were eliminated, since they all met the 0.1 cut-off. We proceed to the logistic regression on the next page.

49

CHAPTER 5. MODEL ANALYSIS

5.3.2

Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Logistic regression Summary of Stepwise Selection Effect Number Score Entered Removed DF In Chi-Square X25 1 1 7 651.2352 X15 1 2 2 276.3215 X23 1 3 1 896.2813 X8 1 4 1 942.4861 X11 1 5 818.0979 X4 1 6 753.8284 X22 1 7 391.9289 X26 1 8 670.8292 X21 1 9 475.5891 X19 1 10 314.4434 X20 1 11 208.0004 X17 1 12 184.8396 X10 1 13 106.4905 X16 1 14 97.2573 X3 1 15 84.6692 X2 1 16 70.6319 X1 1 17 49.8075 X7 1 18 38.0772 X14 1 19 36.1624 X6 1 20 28.5486 X5 1 21 122.4905 X9 1 22 54.5414 X24 1 23 22.2079

Wald Chi-square

Pr>ChiSq < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001 < .0001

Table 5.10: Summary of the stepwise selection for Model 2’. Three variables (X12 , X13 and X18 ) did not make it to the model due to the significance level for entry the model. The coefficients of the remaining 23 variables can be found in Table 5.11 on the next page.

50

CHAPTER 5. MODEL ANALYSIS

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq β0 1 1.6990 0.0709 573.5742 < .0001 β1 1 -0.1043 0.0149 48.9542 < .0001 β2 1 -0.1112 0.0136 66.6761 < .0001 β3 1 0.0900 0.0102 78.6062 < .0001 β4 1 0.4049 0.0175 536.8597 < .0001 β5 1 0.1740 0.0128 184.2185 < .0001 β6 1 -0.1517 0.0143 112.8124 < .0001 β7 1 0.1075 0.0186 33.2611 < .0001 β8 1 0.9030 0.0214 1 773.5288 < .0001 β9 1 -0.0898 0.0116 60.0366 < .0001 β10 1 0.2239 0.0159 197.6031 < .0001 β11 1 -0.2381 0.0138 298.0895 < .0001 β14 1 -0.1391 0.0186 55.6806 < .0001 β15 1 0.3256 0.0095 1 176.6910 < .0001 β16 1 -0.1245 0.0136 83.2700 < .0001 β17 1 -0.1371 0.0128 114.4760 < .0001 β19 1 0.3600 0.0190 360.5192 < .0001 β20 1 0.1495 0.0104 206.6763 < .0001 β21 1 0.2333 0.0112 437.3042 < .0001 β22 1 -0.3891 0.0119 1 069.3971 < .0001 β23 1 0.1528 0.0103 219.6373 < .0001 β24 1 -0.0513 0.0109 22.2054 < .0001 β25 1 0.2898 0.0070 1 705.8961 < .0001 β26 1 0.1976 0.0098 409.3936 < .0001 Table 5.11: Analysis of the coefficients for the significant variables for Model 2’.

Now we can perform an out-of-sample performance as in the previous sections. The major difference relative to the out-of-sample performance of Model 2 (Section 5.2.3), is that customers in the Out-of-sample set are not contained in any other subset.

51

CHAPTER 5. MODEL ANALYSIS

5.3.3

Out-of-sample performance

Figure 5.3: The graph for the out-of-sample Gini coefficient of Model 2’. The associated Gini coefficient is 0.327, again still high enough for a new model within Van Lanschot Bankiers. Although we split our dataset different than in Section 5.2, the Gini curve in Figure 5.3 is quite similar as in Figure 5.2: at the start of the curve, the Gini curve is almost equal to the ideal situation. Hence, the first few buckets consists of only defaults. This is an extremely useful information for the bank! It is shown that 87% of the riskiest 4% of units is removed within the time period T2 (!).

52

CHAPTER 6. CONCLUSION

Chapter 6

Conclusion The purpose of this thesis was to investigate whether it was possible to model and predict the removal of customers’ assets from the bank. Specifically to identify the risky customers, the defaults. The methodology is composed of a variable transformation using a bucket algorithm and a logistic regression model. The algorithm determined the best, i.e with the highest Gini coefficient, bucket-boundaries and how many buckets were needed for each potential explanatory variables. Currently, the bucket algorithm has a limit of 5 buckets for the variables, due to computational time. It takes no effort to eliminate this limit, however, the computational time will increase tremendously. Once the algorithm is built up more efficient, for instance by programmers, the limit can be raised, or even removed. Three different models have been examined. The first model, Model 1, was based on customers and predicted the probability of removing a certain amount, M1 , within a given time period T1 . The customers in the dataset were spread over the four subsets needed for this approach: a construction set, a stop set, a coefficients set and an out-of-sample set. After reviewing the significancy of the potential variables, 5 variables were labeled as significant. These variables were used for the logistic regression model. The out-of-sample performance of this model yielded a out-of-sample Gini coefficient of G = 0.395. The second model, Model 2, was based on assets. The dataset is transformed to a dataset consisting of units instead of customers. Subsequently, this transformed dataset is divided into the four subsets. Note that units in two different subsets can consist of the same customer. It turned out that 20 variables were significant. The out-of-sample performance of the corresponding logistic regression model yielded a out-of-sample Gini coefficient of G = 0.351. The drawback of this method, is that the four subsets are not necessary disjoint relative to customers. Therefore, we developed a third model, Model 2’, by constructing the four subsets differently. First, we divided the customers into the four subsets, as in Model 1. Thereafter, we transformed the four subsets separately to sets of units. Thus, the four subsets were disjoint relative to customers and consisted of units, as in Model 2. Working similarly to Model 2 with these different subsets, yielded a logistic regression model with 23 variables. The out-of-sample performance of this modified model produced a out-of-sample Gini coefficient of G = 0.327. Hence, it turned out that all three models sufficiently met the criteria for a new model within Van Lanschot Bankiers. Moreover, according to this valuation, Model 1 would be the most powerful one and Model 2’ would be the weakest one. However, as illustrated in Figure 5.2.3 and Figure 5.3.3, the Gini curve of Model 2 and 2’ starts almost equal to the ideal situation. In other words, units in the most risky bucket(s) are very likely to be removed, so the customers corresponding to these units are very likely to remove the majority of their assets. Thus, it seems that Model 2 and Model 2’ can identify defaults. Model 2’ is even slightly more powerful

53

CHAPTER 6. CONCLUSION

than Model 2. It is shown that 87% of the riskiest 4% of units in the Out-of-sample set is indeed removed within the time period T2 . For Model 2, 83% of the riskiest 3% of units in the Outof-sample set is indeed removed within the time period T2 . This is a very useful result for Van Lanschot Bankiers. Indeed, by identifying the current risky customers, they can be supervised and once they exhibit runaway behavior, the bank can pay extra attention to the relationship between the bank, or account manager, and the customer. To conclude, despite of the lowest out-of-sample Gini coefficient, Model 2’ is most powerful in terms of identifying the risky customers. This prototype, Model 2’, is being tested in practise right now.

54

CHAPTER 7. SUGGESTIONS FOR FUTURE RESEARCH

Chapter 7

Suggestions for future research As stated in the conclusion (Chapter 6), all models sufficiently met the criteria for a new model within Van Lanschot Bankiers. Moreover, there is scope for improvement. I encountered items for improvement during my internship, which I was unable to do because of time constraints. Major items are recapitulated below.

Seasonality First of all, the data used for thesis was a snapshot at the end of one month. It would be better to use data obtained from each month of the year. This greatly reduces the seasonal effects and increases the dataset tremendously as well. Moreover, the data used in this thesis origins from the past. If we use the model in the future, this data would be outdated. It may be the case that macroeconomic effects are changed and that may influence the model. Therefore, it is better to involve macroeconomic variables as well. However, it is hard to model macroeconomic variables because a dataset over a prolonged period is needed. Note that these improvement are just data issues.

Efficiency bucket algorithm As explained in the design, Section 4.2.1, the bucket algorithm reiterates similar calculations quite frequently. If this calculation can be build up more efficiently, i.e. shorter and in fewer steps, the algorithm would work significantly faster. This can certainly be done by someone with more experience in programming. Once the computational is reduced, the limit of 5 buckets can be removed and more variables can be tested in less time. Removing the bucket limit yields a higher explanatory power for the variables, since it can increase the distinctiveness. Because of the ability to test more variables, we do not need to reduce the number of variables using expert opinions. Both will increase the performance of the prototype. Also, after analysing the bucket-boundaries, it seems that a small overfitting can still occur. Specifically, there are a few variables with a very small bucket, consisting almost only of defaults. This resembles overfitting. It may be the case that adding such a small bucket accidentally increases the Gini on the stop set as well, so that the algorithm does not stop. This can be prevented by requiring a minimum improvement of the Gini on the stop set. Then, an ‘accidental increase’ of the Gini on the stop set is not sufficient for adding the bucket-boundary, so that the previously mentioned event will not occur.

Variable selection Motivated in Section 4.3, we have chosen for the stepwise selection method. The variables may or may not be chosen resulting from their chi-square scores (Recall Definition 3.4.4 and Defini-

55

CHAPTER 7. SUGGESTIONS FOR FUTURE RESEARCH

tion 3.4.6). However, it might be interesting to select the variables based on a Gini coefficient, since this coefficient measures the power of distinctiveness. For instance, one could apply the following construction: For this, we need two sets, a construction set and a stop set, just like the bucket algorithm of Section 4.2.1. Firstly, calculate the out-of-sample Gini coefficient (Recall Section 5.1.3) for the logistic regression models consisting of one variable, on the construction set. Hereafter, fix the variable with the highest Gini coefficient on the construction set. Subsequently, calculate the Gini coefficient for the logistic models consisting of this fixed variable and one of the remaining variables, on the construction set and on the stop set. Continue this process until the Gini on the stop set starts to decrease. In this way, the best variables, i.e. with the highest explanatory powers, are added to the model and overfitting is avoided by the stop set. This approach is similar to the bucket algorithm. The drawback of this construction is that the current, already existing, stepwise selection tool has to be disassembled and re-programmed completely. However, it is definitely possible. Furthermore, after consultation with several account managers, it seems that there is scope of improvement in terms of the definition of the explanatory variables. Adapting existing variables or adding new variables, requested by these account managers, may improve the predictive power of our models as well.

Social Media Currently, there is another intern at Van Lanschot Bankiers, Marianne van Rosmalen, working on her thesis called Social Media and credit risk prediction models: a golden formula?. The purpose of her research is to explore whether the integration of social media into the current credit risk models of banks will increase the prediction rate accuracy of default. It might be interesting to explore whether the integration of social media into the models in this thesis will increase the prediction rate accuracy as well.

56

APPENDIX A. ADDITIONAL THEOREMS

Appendix A

Additional theorems Theorem A.1 (Continuous Mapping Theorem). Let g : Rk 7→ Rm be continuous at every point of a set C such that P[X ∈ C] = 1. 1. If Xn

d

X, then g(Xn )

d

p

p

as

as

g(X);

2. If Xn → X, then g(Xn ) → g(X); 3. If Xn → X, then g(Xn ) → g(X). Proof. A.W. van der Vaart [15].

Theorem A.2 (Law of Large Numbers). Let Y1 , . . . , Yn be an i.i.d. random sample with mean µ and variance σ 2 , such that E|Y1 | < ∞. Then the sequence of sample means converges almost as surely to µ, Y n → µ. Proof. Lee J. Bain et al. [3].

Theorem A.3 (Multivariate Central Limit Theorem). Let Y1 , Y2 , . . . be i.i.d. random vectors in Rk with mean vector µ = E[Y1 ] and covariance matrix Σ = E[(Y1 − µ)(Y1 − µ)> ]. Then n

√ 1 X √ (Y1 − µ) = n(Y n − µ) n i=1

d

Nk (0, Σ)

Proof. A.W. van der Vaart [15].

Lemma A.4 (Slutsky). Let Xn , X and Yn be random vectors or variables. If Xn

d

X and Yn d

1. Xn + Yn 2. Yn Xn

d

3. Yn−1 Xn

d

c for a constant c, then

X + c.

cX. d

c−1 X provided c 6= 0.

Proof. A.W. van der Vaart [15].

57

APPENDIX A. ADDITIONAL THEOREMS

Theorem A.5 (Uniform Law of Large Numbers). Let g be a function on Ω × Θ where Ω is an Euclidean space and Θ is a compact subset of an Euclidean Space. Let g(x, β) be a continuous function of β, ∀x ∈ Ω and a measurable function of x, ∀β ∈ Θ. Assume also that |g(x, β)| ≤ h(x) ∀x ∈ Ω, β ∈ Θ, such that E|h(X)| < ∞. If X1 , . . . , Xn is a random sample of X, then Pn p 1 i=1 g(X1 , β) → E[g(X, β)] uniformly for all β ∈ Θ. In other words, n n

sup | β∈Θ

1X p g(X1 , β) − E[g(X, β)]| → 0. n i=1

Proof. Robert I. Jennrich [6].

Theorem A.6. Let A : Rn → Rk be a linear map, let c ∈ Rk and suppose Y ∼ Nn (µ, Σ). Then AY + c ∼ Nk Aµ + c, AΣA> So a linear transformation of a normally distributed vector is again normally distributed. Proof. Marno Verbeek [16].

58

Bibliography [1] S. Alink, I. Snihir. PD-Segmentatiedocument, 2010. [2] B. Baesens, T. van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen. Benchmarking state-of-the-art classification algorithms for credit scoring, 2003. [3] Lee J. Bain, M. Engelhardt. Introduction to Probability and Mathematical Statistics, 1992. [4] I. Buchan. Calculating the Gini coefficient of inequality, 2002. [5] Scott A. Czepiel. Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation, 2002. [6] Robert I. Jennrich. Asymptotic Properties of Non-Linear Least Squares Estimators, 1969. [7] C.T. Kelley. Solving Nonlinear Equations with Newton’s Method, 1987. [8] V. Lazarov, M. Capota. Churn Prediction, 2007. [9] P. McCullagh, J.A. Nelder, F.R.S. Generalized Linear Models, 1983. [10] Douglas C. Montgomery, Elizabeth A. Peck. Introduction to linear regression analysis, 1992. [11] D. van den Poel, B. Larivi`ere. Customer attrition analysis for financial services using proportional Hazard Models, 2003. [12] D. Popovi´c, B. D. Ba˘ci´c. Churn prediction model in retail banking using fuzzy C-Means Algorithm, 2009. [13] A. van Rensen. Zorgen voor morgen, 2008. [14] Kattamuri S. Sarma. Combining Decision Trees with Regression in Predictive Modeling with SAS Enterprise Miner. [15] A.W. van der Vaart. Asymptotic Statistics, 1998. [16] Marno Verbeek. A Guide to Modern Econometrics, 2008.

59

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & Close