An empirical investigation of alternative semi-supervised segmentation methodologies

Segmentation of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main types of segmentation and examples of improved performance of predictive models exist for both approaches. However, both focus on a single aspect – either target separation or independent variable distribution – and combining them may deliver better results. This combination approach is called semi-supervised segmentation. Our objective was to explore four new semi-supervised segmentation techniques that may offer alternative strengths. We applied these techniques to six data sets from different domains, and compared the model performance achieved. The original semi-supervised segmentation technique was the best for two of the
data sets (as measured by the improvement in validation set Gini), but others outperformed for the other four data sets.

Significance:

We propose four newly developed semi-supervised segmentation techniques that can be used as additional tools for segmenting data before fitting a logistic regression.
In all comparisons, using semi-supervised segmentation before fitting a logistic regression improved the modelling performance (as measured by the Gini coefficient on the validation data set) compared to using unsegmented logistic regression.



Introduction
][3] According to Thomas 3 , its goal is to achieve more accurate, robust and transparent predictive models that allow lenders to better serve the segments identified.The origins of segmentation lie in marketing survey analysis, with the first application by Belson 4 when studying the effects of BBC broadcasts in England.(For more information on the history of segmentation refer to Morgan et al. 5 ) The only early approach still in broad use today is chi-squared automatic interaction detection 6 , which was developed initially by Kass 7 .
Predictive modelling refers to the use of statistical methods to construct formulae to estimate a target variable based on various explanatory variables.For this paper the target variables are binary, i.e. there are only two outcomes.The basis for model comparison is 'lift', i.e. the ability of the models to distinguish between the two outcomes compared to a naïve estimate. 17There are several ways to measure lift, and for this paper the Gini coefficient was chosen.
In this paper, the focus is segmentation when developing predictive models, irrespective of the application -credit risk, marketing, financial risk management, fraud detection, process monitoring, health and medicine, environmental analysis, etc.The results should therefore be of interest to researchers in any scientific or other field in which such models are applied.More specifically, this paper increases the number of available segmentation techniques available by proposing four alternative semi-supervised segmentation techniques.
There are two main types of segmentation -supervised and unsupervised -and the former are favoured in predictive modelling.Supervised techniques are used to identify cases that act alike, i.e.where 'independent' predictors display similar predictive patterns relative to a 'dependent' target variable.Separate segments are required to address interactions in which predictive patterns change with the values of other predictors, especially when developing generalised linear models.Interactions are often related to the target's value, and the focus is typically on maximising target separation -or impurity -between segments. 6The most obvious examples are decision trees derived using recursive partitioning algorithms, which identify homogenous risk groups on the assumption that they will display the greatest interactions.This is not always the case.By contrast, unsupervised segmentation 8 identifies subjects that look alike, i.e. have variables with similar values.It maximises segments' dissimilarities based on a distance function, with no dependent variable (one does not need a target).The most obvious examples are cluster and factor analysis, most commonly used in marketing.
The choice between supervised and unsupervised segmentation depends on the application and requirements of the models developed 9 , and many examples of improved model performance exist for both 10 .However, both focus on a single aspect (i.e.act or look alike) and so using them together may deliver better results.
2][13] It has many similarities with semi-supervised clustering 14 , supervised clustering 15 and semi-supervised semantic segmentation 16 (more used in image processing).For more detail on the differences and similarities see Breed 11 .In this paper, we explore four newly developed variations of an existing technique to see whether they can provide further benefits.
Six data sets were used from different disciplines, each of which was split into a training and validation set.The five different SSS approaches were applied to each to see which worked best, with models built per segment using logistic regression.A further model was developed on the unsegmented data.Of the five approaches, four are new alternatives and form the main contribution of this paper.They were inspired by an existing technique, semi-supervised segmentation using k-means clustering and information value (SSSKMIV), which was explored in Breed et al. 13 and described in more detail in a recent PhD thesis 11 .K-means clustering is used to measure the independent variable distribution, and information value for target separation.A 'supervised weight' controls the balance between the two aspects. 13The algorithm is quite complex and calculation intensive, so alternatives were sought.The four variations are: Variation 1: We replaced the information value with the chi-squared test statistic and call this technique SSSKMCSQ (semi-supervised segmentation as applied to k-means using chi-squared).The chi-squared calculation has similarities with the Hosmer-Lemeshow statistic, and further information can be found in Hand 6 .

Variation 2:
We developed a density-based semi-supervised technique using Wong's density-clustering algorithm. 18We call this the SSSWong technique (semi-supervised segmentation applied to Wong's density clustering methodology).

Variation 3:
We developed a semi-supervised technique with segment size equality (SSE).We call this the SSSKMIV SSE technique (semi-supervised segmentation applied to the k-means algorithm using information value as supervised component, with the addition of segment size equality).
Variation 4: These techniques (SSSKMIV, SSSKMCSQ, SSSKMIV SSE ) have some similarities with the k-means semi-supervised segmentation algorithm, proposed in Peralta et al. 19 which is called LK-Means.This methodology has many similarities to SSS techniques, but also has a number of clear differences. 11Our fourth variation augments other existing semi-supervised techniques 11 to make its results comparable to the others.It is thus not really new, but an existing supervised technique adapted to be comparable with other SSS techniques.

Semi-supervised techniques
Both unsupervised and supervised segmentation make intuitive sense depending on the application and the requirements of the models developed 9 and many examples exist in which the use of either improved model performance 10 .However, both focus on a single aspect (i.e.either target separation or independent variable distribution) and using them in tandem might deliver better results.Five semi-supervised techniques are described here, four of which are new.

Semi-supervised segmentation: SSKMIV
This approach is explored in Breed et al. 20 and described in more detail in a recent PhD thesis 11 and will be used as the first (original) segmentation method.It is called SSSKMIV, an abbreviation for semisupervised segmentation using k-means clustering and information values, where k-means is used to assess independent variable distributions, and information values for target separation.
The implementation of this approach is quite complex and calculation intensive. 11,20Further, the information value formula demands that there be at least one event and non-event each time (to avoid division by zero), and results can be distorted by small numbers.A general rule is that each bin and segment combination must have at least five events and five non-events.

Semi-supervised segmentation: SSKMCSQ (Variation 1)
In this variation the information value is replaced with the chi-squared 6 test statistic for the supervised part, and we call this SSSKMCSQ (semisupervised segmentation as applied to k-means using chi-squared).
The chi-square statistic is often used as a measure of separation.A good example is chi-squared automatic interaction detection, which is a recursive partitioning algorithm used to construct decision trees. 6It is used here to compare observed target values for each segment against naïve estimates (i.e.counts per class proportional to those for the population).
Using the chi-squared for the supervised part has two main advantages:

•
It is always defined within a segmentation scheme (no division by zero).Our techniques do have the option that a user can set a minimum number of cases.A popular rule of thumb is to have at least 5% of cases of the sample in each segment. 2 • It works for both binary and continuous variables -which allows its application to a broader range of problems.
Details of the k-means clustering technique are provided below, followed by a formal definition of chi-squared.
Consider a data set with n observations and m characteristics and let x i ={x i1 ,x i2 ,...x im } denote a single observation in the data set.The n x m matrix comprising all characteristics for all observations is denoted by X.Let X p = {X 1p , X 2p ,..., X np } denote a vector of all observations for a specific characteristic p.
On completion of the k-means clustering algorithm all observations x i , with i = {1,2,...,n}, will have been assigned to one of the segments S 1 ,S 2 ,...,S K where each S j denotes an index set containing the observation indices of all the variables assigned to it.That is, if observation x i is assigned to segment S j , then i∈S j .
Further, let u j = {u j1 ,u j2 ,...u jm } denote the mean (centroid) of segment S j , for example u j1 will be the mean of characteristic X 1 .The distance from each observation x i to the segment mean u j is given by a distance function where ||.|| 2 defines the distance.Note that the double vertical bars indicate distance and hence imply that a square root is used.
The objective of ordinary k-means clustering is to minimise withinsegment distances.For notational purposes, we introduce c∈C as an index of an assignment of all the observations to different segments, with C the set of all combinations of possible assignments.The notation S cj is now introduced to reference all the observations for a given assignment c∈C and for a given segment index j.In addition, u cj is the centroid of segment S cj .The objective function of the ordinary k-means clustering algorithm can now be stated in generic form as (1)   Note that the notation used for the k-means clustering is the same notation as used in Breed et al. 20 For the newly proposed SSSKMCSQ technique, a function is required to inform the segmentation process.For the supervised component, we will use the chi-squared value (rather than the information value).
The chi-square statistic is calculated as where n is the number of observations in the input data set; K is the number of segments over which X 2 c is calculated; and y is the target variable and can be either binary or continuous.The term |S| is used to represent the number of observations in segment S.
If chi-square is used in semi-supervised segmentation, then the supervised component ρ(c) for each observation x i and segment S cj (with i assigned to S cj in each case) can be defined as Let 0 ≤ w ≤ 1 be a weight that controls how much the clustering function is penalised by the chi-square statistic.The proposed optimisation problem for the SSSKMCSQ technique, taking within-segment distances into account, is the following (4)   In this paper, a heuristic approach is followed for the purpose of generating solutions to the optimisation problem in [4].This includes determining the optimal weight w for the supervised portion, using an algorithm that consists broadly of 10 steps similar to those of SSSKMIV.
For details of the steps, see Breed et al. 20

Semi-supervised segmentation: SSSWong (Variation 2)
Next, we propose a density-based semi-supervised technique using Wong's density clustering algorithm. 18We call this the SSSWong technique (semi-supervised segmentation applied to Wong's density clustering methodology).
Predictive models are often developed for relatively large data sets (>1000 observations and 20 or more characteristics), and more common kernel-based density methods (like k-nearest neighbours 21 ) are inviable because of their complexity.Wong's methodology combines the speed of k-means with the advantages of density-based clustering.It consists of two stages. 18,21Note that these two stages are in essence an iterative process.
Stage 1: A preliminary clustering analysis is performed using a k-means algorithm with k much larger than the number of final clusters required.
Stage 2: The k-clusters formed in stage one are analysed and combined based on density-clustering dissimilarities until the required number of clusters are formed, or only a single cluster remains.
Preliminary clusters s cr and s ct are considered adjacent if the midpoint between the centroids u cr and u ct are closer to each other than any other preliminary-cluster mean based on Euclidean distance.Each thus has only one potential cluster with which it can be combined (with ties typically dealt with based on the order of the observations in the data set).The pair combined each time is that with the minimum densitybased dissimilarity measure (see Wong 18 for further detail and the derivation): ( where |s| represents the number of observations in segment s and s cr ||s ct indicates that s cr is adjacent to s ct . Wong's clustering was incorporated in the original semi-supervised technique (SSKMIV).We adjusted Wong's second step to incorporate the target variable.Thus, the algorithm optimises both cluster density and target rate differences.Let c∈C denote an index of an assignment of all the preliminary segments s c1 ,s c2 ,...,s cq to the final segments S c1 ,S c2 ,...,S cK with K > q and with C the set of all combinations of possible assignments.In this case, q denotes the number of preliminary segments.Note that each will contain at least one observation, but is likely to contain a larger number that reduces computational complexity on large data sets.
The conglomeration of the preliminary segments into the final set of segments is done in a binary fashion, as illustrated by Figure 1.The final segments for the example are S1 = {S' 1 ,S' 2 } = {s 1 ,s 2 ,s 3 ,s 4 } and S 2 = {S' 3 ,S' 4 }={s 5 ,s 6 ,s 7 ,s 8 }.This previous example covers only one possible combination of assignments.We use the notation S cj to represent any set of segments assigned to it for a given combination c∈C.In order to evaluate the density dissimilarity between two segments or nodes, we make use of the notation d(S cj ).For example, to calculate the dissimilarity between nodes S' 1 and S' 2 , we can calculate d(S' c1 )=d (s 1 ,s 2 ).
The proposed optimisation problem for the SSSWong algorithm is: Note the values of ρ(c) and d(S cj ) are standardised for the same reasons as when using SSSKMIV. 20For a single segmentation analysis using SSSWong, there are five steps: 1. Preliminary segmentation: Similar to Wong's method, the first step creates the preliminary segments that will be iteratively combined using formula [6].
2. Preliminary segment inspection: Preliminary segments are investigated to identify any with no events or non-events (which the information value calculation cannot handle), which are combined using Wong's standard density measure.
3. Determine preliminary segment adjacency: Adjacent segments are identified for each preliminary segment using a k-nearest neighbour type approach.
4. Combine segments until K left: Segments are iteratively combined until the required number of segments remains.
5. Calculate data set statistics: Statistics like information value obtained per segment are calculated and stored for further use.
The details of these steps can be found in Chapter 7 of a recent PhD thesis. 11

Semi-supervised segmentation: SSKMIV SSE (Variation 3)
For the third variation we developed a semi-supervised technique with segment-size equality (SSE).We call this the SSSKMIV SSE technique (semi-supervised segmentation applied to the k-means algorithm using information value as supervised component, with the addition of segment size equality).Its purpose is to discourage the formation of small segments, or rather encourage segments of similar (or more equal) size.
Only minor adjustments were needed to the SSSKMIV's objective function 11 , by introducing v as the SSE weight and δ as the SSE function.
We define δ as (7)   where n c is the total number of assigned observations for c∈C.where w + v ≤ 1.

Semi-supervised segmentation: LK-Means (Variation 4)
The SSSKMIV variation has some similarities with the k-means semisupervised segmentation algorithm, proposed in Peralta et al. 19 which is called LK-Means.This methodology has many similarities to SSS algorithms, but also some clear differences. 11For our fourth variation we augmented the LK-Means methodology.All four variations of semi-supervised segmentation methods (as well as the original SSSKMIV) were implemented in SAS software (Version 9.4, SAS Institute Inc., Cary, NC, USA).The detail of the technical specifications (e.g. the optimal number of segments, the weight parameters in SSS, the optimal value of k in the k-mean algorithm, and a heuristic example) can be found in Breed 11 .
To facilitate representing the objective function of the LK-Means algorithm mathematically, we expand the S cj notation to S cjl , to reference all the observations for a given assignment c∈C, for a given segment index j and a given label l.Similarly, u cjl represents the mean, or centroid of S cjl .For this algorithm, the assumption is that the labels (or target variable values) take on L discrete values and are not continuous.The objective function of the LK-Means algorithm to be minimised becomes (9)   where v cjl is the ratio of the number of observations assigned to cluster j with label l divided by the number of total observations assigned to cluster j.This ratio represents the of label l in cluster j.
The distortion weight, w, is similar to the weight in SSSKMIV and again adjusts the supervised element with values between 0 and 1.More details of these steps can be found in Chapter 7 of the PhD thesis. 11

How to measure model performance: Data splitting and Gini coefficient
In order to compare model performance, each data set was divided randomly into equally sized development and validation sets.Data splitting is the dividing of a sample into two parts and then developing a hypothesis using one part and testing it on the other. 22Picard and Berk 23 review it in the context of regression and provide specific guidelines for the validation of regression models, i.e. 25% to 50% of the data is recommended for validation.Faraway 24 illustrates that split-data analysis is preferred to a full-data analysis for predictions with some exceptions.
We used the development set (i.e.training data) to develop the predictive models, whilst the validation set (i.e.hold-out data) was used to assess model performance (hereafter the 'lift').Lift was measured by calculating Gini coefficients 2 , to quantify a model's ability to discriminate between two possible values of a binary target variable 17 .Cases are ranked according to the predictions, and the Gini then provides a measure of correctness.It is one of the most popular measures used in retail credit scoring [1][2][3]25 , and has the added advantage that it is a single number 17 . Fo this paper, values are calculated for the combined validation data sets.
Although we used only Gini in this paper, more measures were used in the original PhD thesis. 11

Description of data sets
The above segmentation techniques were compared on six different data sets, described below.All explanatory variables were standardised by transforming them into z-scores, i.e. subtracting the mean and dividing by the standard deviation of each based on the full development data set.Weights of evidence or dummy variables would have been preferable, but were not considered because of the added complexity of binning each predictor -especially if done per segment.We cannot say whether or how the transformation methodology might have affected the results.
The data sets are the same as those used in the previous study. 12A short summary of the data used is given in Table 1.Details on the data sets can be found in Breed 11 .

Direct marketing
This data set contains information about a bank's customers, the products they have with the bank, and their utilisation of and behaviour with those products.The target variable is binary and indicates whether the customer responded to a direct marketing campaign for a personal loan or not.
24 explanatory variables and 4720 observations 2. Protein structures 26,27 This data set contains results of experiments performed by the Protein Structure Prediction Centre 27 on the latest protein structure prediction algorithms.

Credit applications 2
This data set contains 10 characteristics of customers who applied for credit.The target variable is binary, indicating whether or not the customer experienced a 90 days' or worse delinquency.
10 explanatory variables and 150 000 observations 4. Wine quality 26,29 This data set contains physicochemical properties of wines that are extracted through analytical tests that can be easily performed on most wines.The target variable is derived from a score between 0 and 10 which indicates the quality of the wine as scored by tasting experts.The binary target variable that is used for this analysis indicates whether the score is greater than 6, thereby indicating a great quality wine (only 20% of the wines score greater than 6).
11 explanatory variables and 6497 observations 5. Chess king-rook vs king 26,30,31 The data set is an 'Endgame database', which is a table of stored gametheoretic values for the legal positions of the pieces on a chessboard.This data set was first described by Clarke 32 .
18 explanatory variables and 28 056 observations 6.Insurance claims 28 The data set was used in a competition named 'Claim Prediction Challenge (Allstate)' concluded in 2011.The binary target that was used in this data set indicates whether or not a claim payment was made.The independent variables have been hidden but, according to the website, it contains information about the vehicle to which the insurance applies as well as some particulars about the policy itself.

Empirical results
The five semi-supervised segmentation techniques described above were applied to all six data sets, with performance assessed on the validation data.Results for all five are presented in Tables 2-7, respectively, with Table 8 providing a summary of the results.2][13] Also, while our focus was to compare different semi-supervised segmentation techniques, we have also included an unsegmented logistic regression in each table as a further baseline.
Table 2 summarises the performance of the modelling techniques when applied to the direct marketing data set (as measured by the Gini coefficient calculated on the validation set).SSSKMCSQ achieved the best result, with SSSKMIV second.
Table 3 summarises the results for the protein tertiary structures data set, where the ranking order is completely different from that in Table 2.As a start, SSSKMCSQ ranks fourth of five.Best is SSSKMIV, with SSSKMIV SSE second.The Gini coefficients are between 65% and 70%, which are quite high values.
Table 4 shows results for the credit application data set, where SSSKMIV again outperforms the other techniques.Note that strong bureau data as well as internal data were available on this credit application data set, hence the relatively high Gini values.The large difference between the unsegmented and segmented results is highly unusual, and may be related to the use of z-scores (i.e.standardisation of variables).It may be that the variables that predict credit risk (delinquency) best, are the least normally distributed.
For the wine quality data set, Table 5 shows that one of our new variations takes top position: LK-Means.
Table 6 shows results for the chess king-rook vs. king data set, where LK-Means again dominates.It is interesting that the Gini coefficients achieved are very high, from 75% to almost 88%.It seems that it is easier to obtain efficient ranking in this data set, which relates to a highly structured game.
Table 7 shows the results for the last data set, which is for insurance claim prediction.In this case, SSSKMIV SSE works best.The Gini coefficients are very low (Gini ranging between 12% and 16%), which makes one wonder about whether predictive models can provide any value in this domain.
And finally, Table 8 provides a summary of the median and average ranks for all five semi-supervised segmentation techniques.

Comments on using Gini as an absolute value
The analysis above illustrates the problem of using Gini as an absolute value. 27The best was 87.33% for LK-Means on the chess data set, but for the insurance data the best was SSSKMIV SSE with a Gini of 15.24%.Such results are not a reflection of the techniques being used, but the data under consideration. 33It is unreasonable to have a minimum Gini that is broadly applied. 34Using Gini coefficients for comparison makes sense only if the data are comparable -in this instance different models applied to the same data.

Concluding remarks
We proposed four newly developed semi-supervised segmentation techniques and provided their mathematical notation.Additionally, we evaluated our four variations against the original semi-supervised technique, SSSKMIV, on six different data sets, with Gini coefficients derived using combined validation data for each segment.The original SSSKMIV technique performed best overall and was the outright winner for two of the data sets, but other variations dominated elsewhere.Best performers were SSSKMIV in the protein and credit data sets, LK-Means in the wine quality and chess data sets, SSSKMIV SSE in the insurance prediction data set and SSSKMCSQ in the direct marketing data set.
The SSSWong technique produced the worst overall results, perhaps because some of k-means' weaknesses were already addressed by SSSKMIV 11 and the additional complexity of SSSWong adds no additional benefit.
We conclude that the four alternatives provide additional tools for segmenting data before fitting a logistic regression.Of the four, SSSWong is quickest to perform on a standard PC, but performs worst (as per the results observed).SSSKMCSQ is most versatile (as it can be performed on both binary and continuous variables) and achieves reasonable results.The most optimal variation will, however, be dependent on the characteristics of the data set being analysed.
The benefit of segmentation was also clearly illustrated in the six data sets used in previous work, 12 although the impact of the transformation methodology is not known.In this study, we have also clearly highlighted the danger of using an absolute Gini coefficient to evaluate the performance of any predictive model.The relative Gini value is more appropriate.Future research could include investigating which properties of data sets contribute to the differences in performance between the techniques.Another extension of the research could be to use measures other than Gini and information value; many other measures exist that could be alternatives to these values.Further comparisons could be done using an array of such alternative measures.It would also provide value to investigate transformation methodologies other than the z-score when doing such research.
The function is at its maximum when all segment sizes (|S c1 | ,..., | S cK |) are equal.Incorporating v and δ into the SSSKMIV technique results in a new objective function: (8)

Table 2 :
Direct marketing data set: comparison of performance of techniques

Table 3 :
Protein tertiary structures data set: comparison of performance of techniques

Table 4 :
Credit application data set: comparison of performance of techniques

Table 5 :
Wine quality data set: comparison of performance of techniques

Table 6 :
Chess king-rook vs. king data set: comparison of performance techniques

Table 8 :
Median and average rank of the semi-supervised segmentation (with logistic regression) techniques across all data sets Three of the four variations (SSSKMCSQ, SSSKMIV SSE , LK-Means) achieved a median rank of 3, while LK-Means achieved an average rank of 2.67 (only slightly higher than SSSKMCSQ and SSSKMIV SSE ).The overall loser was SSSWong, which came in last across the board.