High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm

We implement a master-slave parallel genetic algorithm (PGA) with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs) to implement a PGA and visualise the results using disjoint minimal spanning trees (MSTs). We demonstrate that our GPU PGA, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, significantly faster than a test case implementation of a comparable serial genetic algorithm, based on a subset of stocks in the South African market. This, combined with fast on-line intraday correlation matrix estimation from high frequency data for cluster identification, offers cost-effective, near-real-time risk assessment for financial practitioners.


I. INTRODUCTION
Research in exact algorithms, heuristics and meta-heuristics for solving combinatorial optimisation problems is increasingly relevant as data science grapples with larger and more complex data sets.The main advantage of using exact methods is the guarantee of finding the global optimum for the problem, but the critical disadvantage when solving real problems (NPhard) is of the exponential growth of the execution time proportional to the problem space size.On the other hand, heuristics tend to be efficient, but solutions obtained are generally not of high quality and difficult to apply to other similar problems.Meta-heuristics, in contrast, aim at consolidating the two approaches and find a middle ground by delivering an acceptable solution in a reasonable time frame.A large number of meta-heuristics designed for solving complex problems exist in the literature, the most prominent being the genetic algorithm (GA).GAs are intensive global search heuristics that explore a search space intelligently to solve optimisation problems.Although the algorithms must traverse large spaces, the computationally intensive calculations can be performed independently.Compute Unified Device Architecture (CUDA) is Nvidia's parallel computing platform, that is very well suited to many computational tasks which can be parallelised.Implementing a GA to perform cluster analysis on vast data sets using this platform allows one to mine though the data relatively quickly and at a fraction of the cost of large data centres or computational grids.
Giada and Marsili propose an unsupervised, parameter-free approach to finding data clusters, based on the maximum likelihood principle [7].They derive a log-likelihood function where a given cluster configuration can be assessed to determine whether it represents the inherent structure for the dataset: cluster configurations which approach the maximum log-likelihood are better representatives of the data structure.This log-likelihood function is thus a natural candidate for the fitness function in a GA implementation, where the population continually evolves to produce a cluster configuration which maximises the log-likelihood.The optimal number of clusters is a free parameter, unlike in traditional techniques where the number of clusters needs to be specied a priori.Unsupervised approaches have been explored elsewhere (see [15] and references therein); the advantage of the Giada and Marsili approach is that it has a natural interpretation for clustering in the application domain [7].
In this paper, we develop a maintainable and scalable Master-Slave parallel genetic algorithm (PGA) framework for unsupervised cluster analysis on the CUDA platform, which is able to isolate residual clusters using the Giada and Marsili likelihood function [7].In addition, we investigate the most appropriate genetic operators to maximise quality and performance of the algorithm.Applying the proposed cluster analysis approach and examining the clustering behaviour of financial instruments, allows one to gain greater insight into behaviour of the stock market.This in turn may enable the conception of early detection techniques that might anticipate turbulences in the market before they happen.The novel implementation presented in this paper is based on work done by [1].
This paper proceeds as follows: Section 2 introduces cluster analysis, focusing on the maximum likelihood approach proposed by Giada and Marsilli [7].Section 3 discusses the Master-Slave PGA.Section 4 discusses the CUDA computational platform.Section 5 provides an overview of our specific implementation.Section 6 discusses data and results from this analysis, before concluding in Section 7.

II. CLUSTER ANALYSIS
Cluster analysis groups objects according to metadata describing the objects or their associations [4].The goal is to ensure that objects in a group will be similar or related to one other and unrelated to the objects in other groups.
The greater the homogeneity within a group, and the greater the heterogeneity between groups, the more pronounced the clustering.In order to isolate clusters of similar objects, one needs to utilise a data clustering approach that will recover inherent structures efficiently.

A. The correlation measure of similarity
The correlation measure is an approach to standardise the data by using the statistical interdependence between data points.The correlation indicates the direction (positive or negative) and the degree or strength of the relationship between two data points.The most common correlation coefficient which measures the relationship between data points is the Pearson correlation coefficient, which is sensitive only to a linear relationship between them.The Pearson correlation is +1 in the case of a perfect positive linear relationship and -1 in the case of a perfect negative linear relationship and some value between -1 and +1 in all other cases, with values close to 0 signalling negligible interdependence.

B. Clustering procedures
Any specific clustering procedure entails optimising some kind of criterion, such as minimising the within-cluster variance or maximising the distance between the objects or clusters.
1) Cluster analysis based on the maximum likelihood principle: Maximum likelihood estimation is a method of estimating the parameters of a statistical model.Data clustering on the other hand deals with the problem of classifying or categorising a set of N objects or clusters, so that the objects within that group or cluster are more similar than objects belonging to different groups.If each object is identified by a number of D observable features, then that object i = 1, ..., N can be represented as a point xi = (x i ) in a D-dimensional space.Data clustering will try to identify clusters as more densely populated regions in this vector space.Thus, a configuration of clusters is represented by a set S = {s i , ..., s N } of integer labels, where s i denotes the cluster that object i belongs to and N is the number of objects [7].Let's assumes that s i takes on values from 1 to M and M = N , then each cluster is a singleton cluster constituting one object only.If s i = s j = s, then object i and object j reside in the same cluster.
2) Giada and Marsili clustering technique: Following Giada and Marsili [6], the essence of maximum likelihood data clustering is that objects belonging to the same cluster should share a common component: Here, xi represents the features of object i and s i is the label of the cluster that the object belongs to.The data has been normalised to have zero mean and unit variance.¯ i is a vector describing the deviation of object i from the features of cluster s and includes measurement errors.g s is a loading factor that emphasises the similarity or difference between objects in cluster s.In this research the data set refers to a set of the objects, denoting N assets or stocks, and their features are prices across D days in the data set.The variable i is indexing stocks or assets, whilst d is indexing days.
If g s = 1, all objects with s i = s are identical, whilst if g s = 0, all objects are different.The range of the cluster index is from 1 to N in order to allow for singleton clusters of one object or asset each.
If one takes Equation 1 as a statistical hypothesis and assumes that both ηsi and ¯ s are Gaussian vectors with zero mean and variance, for values of i, s = 1, ..., N , it is possible to compute the probability density P ({ xi }|G, S) for any given set of parameters (G, S) = ({g s }, {s i }) by observing the data set {x i }, i, s = 1, ..., N as a realisation of the common component of Equation 1 as follows [7]: (2) The variable δ is the Dirac delta function and ... denotes the mathematical expectation.For a given cluster structure S, the likelihood is maximal when the parameter g s takes the values n s in Equation 3 denotes the number of objects in cluster s, i.e.
The variable c s is the internal correlation of the s th cluster, denoted by the following equation: The variable C i,j is the Pearson correlation coefficient of the data, denoted by the following equation: The maximum likelihood of structure S can be written as P (G * , S| xi ) ∝ exp DL(S) , where the resulting likelihood function per feature L c is denoted by From Equation 7, it follows that L c = 0 for clusters of objects that are uncorrelated, i.e.where g * S = 0 or c s = n s or when the objects are grouped in singleton clusters for all the cluster indexes (n s = 1).Equation 7 illustrates that the resulting maximum likelihood function for S depends on the Pearson correlation cofficient C i,j and hence exhibits the following advantages in comparison to conventional clustering methods: • It is unsupervised: The optimal number of clusters is unknown a priori and not fixed at the beginning • The interpretation of results is transparent in terms of the model, namely Equation 1.
Giada and Marsili state that max s L c (S) provides a measure of structure inherent in the cluster configuration represented by the set S = {s 1 , ..., s n } [7].The higher the value, the more pronounced the structure.

III. PARALLEL GENETIC ALGORITHMS
In order to localise clusters of normalised stock returns in financial data, Giada and Marsili made use of a simulated annealing algorithm [7].They utilised −L c as the cost function in their application of the log-likelihood function on real-world data sets to substantiate their approach.This was then compared to other clustering algorithms, such as K-means, single linkage, centroid linkage, average linkage, merging and deterministic maximisation [7].The technique was successfully applied to South African financial data by Mbambiso et al., using a serial implementation of a simulated annealing algorithm (see [11] and [5]).
Simulated annealing and deterministic maximisation provided acceptable approximations to the maximum likelihood structure, but were inherently computationally expensive.This motivates to utilise a more optimal approach.We propose the use of PGAs as a viable an approach to approximate the maximum likelihood structure.L c will be used as the fitness function and a PGA algorithm will be used to find the maximum for L c , in order to efficiently isolate clusters in correlated financial data.

A. GA principle and genetic operators
One of the key advantages of GAs is that they are conceptually simple.The core algorithm can be summarised into the following steps: initialise population, evolve individuals, evaluate fitness, select individuals to survive to the next generation.GAs exhibit the trait of broad applicability [17], as they can be applied to any problem whose solution domain can be quantified by a function which needs to be optimised.
Specific genetic operators are applied to the parents, in the process of reproduction, which then give rise to offspring.The genetic operators can be classified as the following: selection, crossover, mutation, elitism and replacement.
1) Elitism: Elitism is the process of preserving the fittest individuals by inherent promotion to the next generation, without undergoing any of the genetic transformations of crossover or mutation [17].Coley states that fitness-proportional selection does not necessarily favour the selection of any particular individual, even if it is the fittest [3].This means that the fittest individuals might not survive an evolutionary cycle.Thus one can specify the number of fittest individuals which will be preserved and promoted to the next generation.
2) Scaling: Scaling is applied to the fitness values of the remainder of the population in order to return values which are within the range of the selection operator to follow.

3) Selection:
The purpose of selection is to isolate fitter individuals in the population and allow them to propogate in order to give rise to new offspring with higher fitness values.Selection has to be maintained in balance with varying crossover and mutation operators.The convergence rate of GAs is largely determined by selection pressure, with higher selection pressures resulting in higher convergence rates and lower selection pressure leading to longer search times.A selection pressure that is too high, however, may result in convergence towards a local optimum instead of the global optimum.
Various techniques exist to govern the selection operator, namely roulette wheel selection, rank selection, tournament selection, random selection and stochastic universal sampling [17].We implemented the stochastic universal sampling selection operator, where individuals are mapped to contiguous segments on a line in proportion to their fitness values [2].Individuals are then selected by sampling the line at uniformly spaced intervals.While fitter individuals have a higher probability of being selected, this technique improves the chances that weaker individuals will be selected, allowing diversity to enter the population and reducing the probability of convergence to a local optimum.
4) Crossover: Crossover is the process of mating two individuals, with the expectation that they can produce a fitter individual [17].The crossover genetic operation involves the selection of random loci to mark a cross site within the two parent chromosomes, copying the genes to the offspring.Thus, a child that contains both parents' genetic information can be interrogated for a fitness value.A bespoke knowledge-based crossover operator was developed for our implementation, in order to incoroporate domain knowledge and improve the rate of convergence.For details on its specification, please see [1].
5) Mutation: Mutation is the key driver of diversity in the candidate solution set or search space [17].It is usually applied after crossover and aims to ensure that genetic information is randomly distributed, preventing the algorithm from being trapped in local minima.It introduces new genetic structures in the population by randomly modifying some of its building blocks and enables the algorithm to traverse the whole search space.The mutation probability, P m , determines how often parts of the chromosome will be mutated.A mutation probability of 0% means that artefacts generated by the crossover operation are the final offspring after mating.A mutation probability of 100% means the whole chromosome is subjected to the mutation operator [17].
6) Replacement: Replacement is the last stage of any evolution cycle, where the algorithm needs to replace old members of the current population with new members [17].This mechanism ensures that the population size remains constant, eliminating the weakest individuals in each generation.
Although GAs are very effective for solving complex problems, this positive trait can unfortunately be offset by long execution times, due to the traversal of the search space.GAs do lend themselves to parallelisation and thus the fitness values can be determined independently for each of the candidate solutions.While a number of schemes have been proposed in the literature to achieve this parallelisation (see [8], [17] and [16]), we have chosen to implement the Master-Slave model.

B. Master-slave parallelisation
Master-slave GAs, also denoted as Global PGAs, involve a single population, but distributed amongst multiple processing units for determination of fitness values and the consequent application of genetic operators.They allow for computation on shared-memory processing entities or to any type of distributed system topology, for example grid computing [16].
Ismail provides a summary of the key features of the Master-Slave PGA [8]: The algorithm uses a single population (stored by the master) and the fitness evaluation of all of the individuals is performed in parallel (by the slaves).Communication occurs only as each slave receives the individual (or subset of individuals) to evaluate and when the slaves return the fitness values, sometimes after mutation has been applied with the given probability.The particular algorithm we implemented is synchronous, i.e. the master waits until it has received the fitness values for all individuals in the population before proceeding with selection and mutation.The synchronous Master-Slave PGA thus has the same properties as a conventional except evaluation of the fitness of the population is achieved at a faster rate.The algorithm is relatively easy to implement and a significant speedup can be expected if the communications cost does not dominate the computation cost.The whole process has to wait for the slowest processor to finish its fitness evaluations until the selection operator can be applied.
For our implementation, we made use of the Nvidia CUDA platform to achieve massive parallelism by utilising the Graphical Processing Unit (GPU) cores as slaves, and the CPU as master.
IV. COMPUTATIONAL PLATFORM Compute Unified Device Architecture (CUDA) is Nvidia's platform for massively parallel high-performance computing on the Nvidia GPUs.After development of technology for the gaming market, Nvidia developed the Tesla range of GPU products that caters for the simulations and computations in commercial applications, such as the energy or computational finance industries.At its core are three key abstractions: a hierarchy of thread groups, shared memories, and barrier synchronisation.They are provided to the developer as a minimal set of language extensions, which provide a platform for fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism [12].Details on the execution environment, thread hierarchy, memory hierarchy and thread synchronisation have been omitted here, but we refer the reader to [12] and [13] for a comprehensive discussion.

A. Specific computational environment
The CUDA algorithm and the respective testing tools were developed on the Linux platform, and then ported to the Windows environment.The following configurations were tested to determine the versatility of the CUDA clustering algorithms across platforms:  V. IMPLEMENTATION Two objectives were considered in this research: One was to investigate and tune the behaviour of the PGA implementation using a pre-defined set of 40 simulated stocks featuring 4 distinct disjoint clusters.The second objective was to identify clusters in a real-world dataset, viz.high-frequency price evolutions of stocks.

A. Representation
We used integer-based encoding for the representation of individuals in the genetic algorithm, i.e.
where c = 1, ..., K and i = 1, ..., N .Here, c i is the cluster that object i belongs to.In terms of the terminology pertaining to GAs, it means that the i th gene denotes the cluster that the i th object or asset belongs to.The numbers of objects or assets is N .This representation was implemented by Gebbie et al. in their serial GA and was adopted in this research [5].

B. Fitness function
The fitness function will be the Giada and Marsili maximum log-likelihood function L c , as shown in Equation 7. The fitness function will be used to determine whether the cluster configuration represents the inherent structure of the data set, i.e. the GA will converge to the fittest individual, which will represent the cluster configuration of distinct residual clusters of correlated assets or objects in the data set.

C. Master-slave PGA implementation
The unparalellised MATLAB GA implementation of the likelihood function by Gebbie, Wilcox and Mbambiso [5] served as a starting point.In order to maximise the performance of the GA, the application of genetic operators and evaluation of the fitness function were parallelised for the CUDA framework [1].A summarised exposition is presented here.
Emphasis was placed on outsourcing as much of the GA execution to the GPU and make use of GPU memory as extensively as possible [19].The Master-slave PGA uses a single population, where evaluation of the individuals and successive application of genetic operators are conducted in parallel.The global parallelisation model does not predicate anything about the underlying computer architecture, so it can be implemented efficiently on a shared-memory and distributed-memory model platform [17].By delegating these tasks to the GPU and making extensive use of GPU memory, this minimises the data transfers between the host and device.These transfers have a significantly lower bandwidth than data transfers transpiring between shared or global memory and the kernel executing on the GPU.It is recommended to that accesses to global memory be minimised [14], but this is not a viable option in our case since global memory is used to store the population.The algorithm in [5] was modified to maximise the performance of the Master-Slave PGA and have a clear distinction between the master node (CPU), which controls the evolutionary process by issuing the commands for the GA operations to be performed by the slave nodes (GPU streaming multiprocessors).The pseudo-code for the algorithm implemented is shown in Algorithm 1.To achieve data parallelism and make use of the CUDA thread hierarcy, we mapped individual genes onto a 2dimensional grid.Using the representation shown in Equation 8, assuming a population of 400 individuals and 16 stocks: . . .
would be mapped to grid cells, as illustrated in Figure 1.
The data grid cells are mapped to threads, where each thread executes a kernal processing the data cell at the respective xy-coordinate.
For details on the full implementation, as well as specific choices regarding initialisation, block sizes and threads per block, please see [1].

D. Data pre-processing
To generate the N -stock correlation matrices to demonstrate the viability of the algorithm on real-world test data, data correlations were computed on data where missing data was filled using zero-order hold interpolation [18].The market mode was removed using the method suggested by Giada and Marsili [6] using a recursive averaging algorithm.A covariance matrix was then computed using an iterative online EWMA filter with a default forgetting factor of λ = 0.98.The correlation matrix was computed from the covariance matrix and was cleaned using random matrix theory methods.In particular, Gaussian noise effects were reduced by eliminating eigenvalues in the Wishart range in a trace-preserving manner [18].This enhanced the clusters and improved the stability of estimated sequence of correlation matrices.

E. Data post-processing
Computed cluster configurations are read from the CUDA output flat file.Successively, an adjacency matrix is constructed by using data values from the correlation matrix in conjunction with computed cluster configuration of the respective data set.The adjacency matrix is then used to construct a disjoint set of Minimal Spanning Trees (MSTs), each tree capturing the interconnectedness of each cluster.Each MST exhibits n s − 1 edges, connecting the n s stocks of the cluster in such a manner that the sum of the weights of the edges is a minimum.Kruskal's algorithm was used to generate the MSTs, which depict the linkages between highly correlated stocks, providing a graphical visualisation of the resultant set of disjoint clusters [10].

A. Data
This investigation used two sets of data: the training set and the test set.The training set consisted of a simulated time series of 40 stocks which exhibit distinct, disjoint clusters.This was used to tune the PGA parameters.The test set consisted of real stock quoted midprice ticks, observed every 3 seconds, from 28 September 2012 to 10 October 2012.Stocks chosen represent the 18 most liquid stocks on the JSE, according to traded volumes.

B. Results
We show a sample set of results here.For a comprehensive discussion, please see [1].
1) Optimal algorithm settings: Various investigations were undertaken to identify optimal adjoint parameters for the PGA.In each case, the algorithm was successively applied to the training set, with known disjoint clusters.Settings were varied until the rate of convergence was maximised.Once the optimal value for each adjoint parameter had been determined from the training set, the optimal algorithm configuration was deployed on the test set.Details for the effect of various adjoint parameter choices on the rate of convergence can be found in [1].
The following optimal configuration for the PGA was deployed on the test set, given a population size of 1000:  2) Benchmark timing results: Table 3 illustrates the efficiency of the CUDA PGA implementation, compared to the MATLAB serial GA.The performance improvement for the training set cluster analysis run is in the region of 13000% on Linux, 6500% on Windows.This can be attributed to the utilisation of a parallel computation platform, a novel genetic operator and the algorithm tuning techniques employed.The Linux CUDA PGA takes 0.305 seconds to identify residual clusters inherent in a single correlation matrix of 18 realworld stocks, demonstrating its potential as a near-real-time risk assessment tool.The one caveat of these results is that we assume correlation matrices are readily available as inputs for the cluster analysis algorithm.Further research should investigate computationally efficient correlation estimation for high-frequency data to develop a robust and practical nearreal-time risk assessment tool.3) Real world test set results: In this section, we illustrate a sample of the resultant cluster configurations which were generated from our model, represented graphically as MSTs.A more detailed discussion can be found in [1].The thickness of the vertices connecting nodes gives an indication of the strength of the correlation between stocks.
The South African equity market is often characterised by diverging behaviour between financial/industrial stocks and resource stocks and strong coupling with global market trends.
In Figure 2, we see 4 distinct clusters emerge as a result of the early morning trading patterns, just after market open.Most notably, a 6-node financial/industrial cluster (SLM, SBK, ASA, SHF, GFI, OML) and a 3-node resource cluster (BIL, SOL, AGL).At face value, these configurations would be expected, however we notice that GFI, a gold mining company, appears in the financial cluster and FSR, a banking company, does not appear in the financial cluster.These are potentially examples of short-term decoupling behavior of individual stocks due to idiosynchratic factors.
Figure 3 illustrates the effect of the UK market open on local trading patterns.We see a clear emergence of a single large cluster, indicating that trading activity by UK investors has a significant impact on the local market.When examining the large single cluster, all of the stocks have either primary of secondary listings in the US and UK.In particular, SAB and ANG have secondary listings on the London Stock Exchange (LSE), whereas BIL and AGL have primary listings on the LSE [9].It is also unusual to see such a strong link (correlation) between AGL, a mining company, and CFR, a luxury goods company.This could be further evidence that significant UK trading in these 2 stocks can cause a short-term elevated correlation, which may not be sustained throughout the day.Figure 5 illustrates the effect of the US market open on local trading patterns.Similar to what we observed in Figure 3, we see the emergence of a large single cluster, driven by elevated short-term correlations amongst constituent stocks.This provides further evidence that significant trading by foreign investers in local stocks can cause a material impact on stock market dynamics.
Detecting cluster anomalies and measuring persistence of effects may provide financial practitioners with useful information to support local trading strategies.

VII. CONCLUSION
This paper verifies that the Giada and Marsili likelihood function [7] is a viable, parallelisable approach for isolating residual clusters in datasets on a CUDA platform.Key advantages compared to conventional clustering methods are: 1.) the method is unsupervised and 2.) the interpretation of results is transparent in terms of the model.
The implementation of the Master-Slave PGA showed that efficiency depends on various algorithm settings.The type of mutation operator utilised has a significant effect on the algorithm's efficiency to isolate the optimal solution in the search space, whilst the other adjoint parameter settings primarily impact the convergence rate.According to the benchmark test results, the CUDA PGA implementation runs 100-130 times faster than the serial GA implementation in MATLAB.Thus, an efficient clustering analysis approach has been demonstrated.
Provided intraday correlation matrices can be estimated from high frequency data, this significantly reduced computation time suggests intraday cluster identification can be practical, for near-real-time risk assessment for financial practitioners.

Fig. 1 :
Fig. 1: Mapping of individuals onto the CUDA thread hierarchy

TABLE I :
Development, testing and benchmarking environments Algorithm 1 Master-slave PGA for cluster identification

TABLE II :
Development, testing and benchmarking environment

TABLE III :
Benchmark computational speed results