Machine learning and Fourier transform to speed up training
Machine learning is a fast growing field in computational sciences and data analysis. Its recent growing has been related to big data, which is the study of data sets that are too large or complex to use conventional analysis methods.
In particle physics experiments it is necessary to analyze a large number of collisions to be able to find events of interest. In the search of supersymmetric particles, Higgs bosons and new particles with unknown masses, Monte Carlo simulations have been used to produce vast amounts of data to train learning machines to distinguish signal processes that produce particles of interest from background processes. Machine learning techniques have also been employed in the classification of data to aid discovery of pulsar candidates and even to reduce noise from LIGO’s gravitational wave’s detectors.
It isn’t necessary to go that far to find applications for ML algorithms. For example, in computational condensed matter systems the distributions of the bond order parameters can be smeared out by thermal fluctuations, making it difficult to determine the phase in which the system is. The averaged order parameters (see Measuring crystallinity with spherical harmonics) were indeed proposed in [1] to reduce overlapping between different parameters’ values to help differentiate between phases, but even with the averaged parameters the distributions for some of the solid phases may overlap, at least for the L-J and Gaussian core potentials. Clustering algorithms involving machine learning can be used to separate the distributions and help differentiate between solid phases in an efficient and statistically ad hoc way.
There’s nothing special about any of the examples given above, ML algorithms can be useful computational tools in data analysis in general. As such, it might be interesting to understand the principles on which these methods are based. In fact it may be surprising that familiar concepts like the Lagrangian or Fourier transforms are strongly related to some ML models.
Here we will see how ML algorithms work using one of the most popular models and a method based on Fourier transforms to speed up training. If you are interested in programming a ML algorithm a complete tutorial and many subroutines can be found on [2]. The HIGGS, SUSY and other data sets are available in [5-9].
SUPPORT VECTOR MACHINES
Support Vector Machines are a supervised learning model of ML used for classification. Here the machine is given a set of input-output instances and has to infer a mapping from them, in order to predict the output of future non-classified inputs.
Let’s say we have N instances (instances can be signals, images, medical diagnoses, etc.) and that we can numerically describe each instance with d attributes (time of duration of a signal, level of symmetry, amplitude, level of cholesterol in blood, etc.) and suppose instances can be separated into two classes (signal or noise, true or false, benign or malign, etc.). Using information like this, one can train a learning machines like SVMs to obtain a function that receives the information of an instance as an input and classifies it correctly, in a way that the process of classification is then automatized and the accuracy doesn’t rely on humans but rather in a machine that can learn from millions of instances in a short time.
The information to be used to train the machine can be expressed as N instances, each one having d attributes that can be stored in a d-dimensional vector, x, and the classification of the instance can be expressed with labels y equal to 1 or -1. So we have
What SVMs do is to find a hyperplane that separates the two classes in the most efficient way, so when one evaluates the hyperplane’s equation for a new vector (new instance), the machine classifies it by calculating on what side of the plane it is.
In the figure below, a bi-dimensional case (where instances are described by two only attributes) is shown for a linearly separable data set. We can suppose the white points represent the class with labels 1 and the gray points represent the other class, with labels -1.
There is of course many lines that can successfully separate the two classes, but the most efficient way to separate them is by using the line that leaves the maximum margin between the closest points (called support vectors), as we only have certain instances and don’t know where exactly a class ends and the other begins. We define that line (or more generally, hyperplane) as
Where w is called the weight vector and b is called the bias, and the problem of obtaining the classifier machine reduces to find these two parameters.
The equation above, which holds zero for points on the separating line, is positive when evaluated for points on one side and negative for points on the opposite side, with magnitude proportional to the distance to the line. So we can evaluate a new instance with the parameters w and b, and classify it by looking at the sign of the result. These parameters can be escalated in any convenient way so we will define that the evaluation of the points closest to the separating line results in unity, that is 1 or -1 depending on which side of the separating line they are. In general for all the points in the training set we have
Or in general
To find the weight vector for which the separating hyperplane leaves the largest margin to the support vectors we use the point-line distance equation
this is
which represents the distance from the separating hyperplane to one of the support vectors, so the total size of the margin (considering both sides) is
We have to maximize the size of this margin. This is equivalent to minimize the magnitude of the weight vector.
For convenience we get rid of the square root (minimizing the magnitude is the same as minimizing the magnitude squared) and write the problem as
This is a minimization problem subject to certain constrains, we can therefore introduce Lagrange multipliers and write the Lagrangian:
This is a convex function, which means that it has a global extremum, so one doesn’t have to worry about getting stuck on local minima or maxima and the problem reduces to finding that global extremum.
With the condition of the gradient being zero:
Applying the conditions to the Lagrangian one obtains
And substituting these conditions into the Lagrangian we get
It is important to notice here that this Lagrangian only depends on the scalar products between the training vectors.
There’s only left to get the optimized values of the Lagrange multipliers and this is where all the computational cost of the problem comes in. This is an example of a more general type of computational problem known as quadratic programming and there are many numerical algorithms as well as software packages used to solve them. Data sets used to train machines are generally high dimensional and can have up to ~10^5 or ~10^7 vectors, so training a machine may take long times even on supercomputers. The data set can also be partitioned into smaller subsets to optimize each in parallel [10].
There are a lot of methods to optimize training times; one of the most used is the Sequential Minimal Optimization algorithm, which analytically solves for the multipliers in pairs. In [2], subroutines to implement SVMs using the SMO algorithm are given in C++. Once the optimized values of the multipliers are computed, one can derive w using the equation derived above from the gradient condition. The derivation of b is deduced from the calculation of the multipliers.
SOFT MARGIN
Generally, the instances are not perfectly separable, but rather there are some vectors crossing the separating hyperplane to the wrong class and have to be considered as noise, as in the figure below.
One can add a new term to the formulation of the problem to allow the separation to have some error tolerance (while trying to minimize the misclassifications) by taking into consideration the sum of the distances of the misclassified instances to the hyperplane as
where C is a free parameter that modulates the error tolerance. One can repeat the process followed before and see that this formulation leads to the exact same Lagrangian, but now the multipliers have an upper bound:
which reduces to the first case when C goes to infinity.
Using this formulation, once one computes the multipliers, we obtain the ‘learned function’
The sign function is used to obtain a simple 1 or -1 classification for a given input instance x.
Using the equation for w obtained before, one can also write the learned function as
where again the vectors only appear as scalar products.
The figure below shows the learned function (red line) obtained for bi-dimensional randomly produced data sets with different parameters.
NON-LINEAR SVM
Of course that generally the data isn’t linearly separable either, and this is where the advantage of the SVM algorithm comes in.
One can map the data to a higher-dimensional space where it is separable. Imagine for example a bi-dimensional data set like the one shown below, where the data is not linearly separable, but the classification is clearly radial-dependent. If the vectors are mapped into a three dimensional space, where the new component of the vectors (e.g. its height) is proportional to its distance to the origin, then the classes could be separated in this higher-dimensional space by a plane.
The advantage of the SVM algorithm is that one doesn’t need to find a particular mapping that suits for a specific data set, as the vectors on the formulation of the SVM only appear as scalar products. So, one only needs a kernel function defined as
Some of the most used kernels are:
The figure below shows the classification (on the bottom) obtained for bi-dimensional non-linear randomly-produced data sets (on the top).
One disadvantage is that in the linear SVM the learned function can be evaluated by using
which only takes ~d operations (from the scalar product). However, in the non-linear SVM it is not possible to define a weight vector since, as derived before, it has the form
and we don’t know the mapping (we just know the kernel), so each time the learned function is evaluated, one has to substitute the complete sum instead, since in that way only the scalar products appear:
which takes ~N*d operations (from the N evaluations of the kernel).
RANDOM FOURIER FEATURES
There is a method to significantly reduce the number of operations that have to be made as a consequence of not knowing the form of the mapping function in non-linear SVMs. It is based on using the Fourier transform of the kernel as a probability distribution to approximate a mapping function.
According to Bochner’s theorem, a continuous shift-invariant kernel, which is a kernel of the form
is positive definite if and only if its Fourier transform corresponds to a finite positive measure. This guarantees that if properly normalized, the Fourier transform of this kernel can be used as a probability distribution.
We write the Fourier transform of a shift invariant kernel, like the Gaussian kernel, as
To denote that its Fourier transform is a proper probability distribution we will write it as
The Fourier transform of the Gaussian kernel is given by
As a consequence of Bochner’s theorem, the Fourier transform of the kernel has the form of an expectation value E:
This means that one can estimate the integral above by sampling D independent and identically distributed values of omega from the probability distribution and taking the average:
Notice that this average can be written a product
where the dagger stands for the conjugate transpose. So we could just define it as the scalar product of two vectors by defining a function that contains the samples of the probability distribution
As the kernel and p are both real, the integral of the Fourier transform converges when the exponentials are substituted by cosines. In this way one obtains a real mapping z defined with components
where omega is sampled from p and b is a phase sampled from the uniform distribution on [0,2π].
So this results in
This means that we have obtained an approximation for the mapping
and as a consequence we can now define a weight vector
as in the linear SVM, and using
the evaluation of the learned function takes only ~d+D operations.
Training and testing a machine using Random Fourier Features only takes four lines [4] of MATLAB code:
% Training using the Gaussian kernel w = randn(D, size(X,1)); Z = exp(i*w*X); alpha = (eye(size(X,1))*lambda+Z*Z')\(Z*y(:)); % testing ztest = alpha(:)'*exp(i*w*xtest(:));
References:
Bond order prameters: [1] W. Lechner and C. Dellago (2008). “Accurate determination of crystal structures based on averaged local bond order parameters”. http://dx.doi.org/10.1063/1.2977970.
For a complete tutorial and subroutines on SVM and the SMO algorithm: [2] Vasant Honavar’s notes on SMV and SMO: http://web.cs.iastate.edu/~honavar/smo-svm.pdf [3] SVM lecture from th Artificial Intelligence course at MIT: https://www.youtube.com/watch?v=_PwhiWxHK8o
Fourier transform to speed training: [3] A. Rahimi and B. Recht (2008). Random fourier features for large-scale kernel machines https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf [4] MATLAB code https://keysduplicated.com/~ali/random-features/
Physics-related data sets in UCI’s Machine Learning Repository: [5] HIGGS Data set: https://archive.ics.uci.edu/ml/datasets/HIGGS (ML to detect Higgs bosons) [6] SUSY Data set: https://archive.ics.uci.edu/ml/datasets/SUSY (ML to detect SUSY particles) [7] HEPMASS Data set: https://archive.ics.uci.edu/ml/datasets/HEPMASS (ML to detect particles with unknown masses) [8] HTRU2 Data set: https://archive.ics.uci.edu/ml/datasets/HTRU2 (ML to classify pulsar candidates) [9] MiniBoone Data set: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification (ML to detect electron neutrinos) Parallel computing SVM:
[10] N. K. Alham et al (2011). "A MapReduce-based distributed SVM algorithm for automatic image annotation". https://doi.org/10.1016/j.camwa.2011.07.046
Comentários