The optimization begins with an initial guess supplied by the user and searches for an X which locally minimizes target(X). Since this problem can have many local minima the quality of the starting point can significantly influence the results.
Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems by Chris Biemann.In particular, this is a method for automatically clustering the nodes in a graph into groups. The method is able to automatically determine the number of clusters.
A New Discriminant Principal Component Analysis Method with Partial Supervision (2009) by Dan Sun and Daoqiang ZhangThis algorithm is basically a straightforward generalization of the classical PCA technique to handle partially labeled data. It is useful if you want to learn a linear dimensionality reduction rule using a bunch of data that is partially labeled.
This object represents a map from objects of sample_type (the kind of object a kernel function operates on) to finite dimensional column vectors which represent points in the kernel feature space defined by whatever kernel is used with this object.
To use the empirical_kernel_map you supply it with a particular kernel and a set of basis samples. After that you can present it with new samples and it will project them into the part of kernel feature space spanned by your basis samples.
This means the empirical_kernel_map is a tool you can use to very easily kernelize any algorithm that operates on column vectors. All you have to do is select a set of basis samples and then use the empirical_kernel_map to project all your data points into the part of kernel feature space spanned by those basis samples. Then just run your normal algorithm on the output vectors and it will be effectively kernelized.
Regarding methods to select a set of basis samples, if you are working with only a few thousand samples then you can just use all of them as basis samples. Alternatively, the linearly_independent_subset_finder often works well for selecting a basis set. I also find that picking a random subset typically works well.
An example use of this object is as an online algorithm for recursively estimating the centroid of a sequence of training points. This object then allows you to compute the distance between the centroid and any test points. So you can use this object to predict how similar a test point is to the data this object has been trained on (larger distances from the centroid indicate dissimilarity/anomalous points).
The object internally keeps a set of "dictionary vectors" that are used to represent the centroid. It manages these vectors using the sparsification technique described in the paper The Kernel Recursive Least Squares Algorithm by Yaakov Engel. This technique allows us to keep the number of dictionary vectors down to a minimum. In fact, the object has a user selectable tolerance parameter that controls the trade off between accuracy and number of stored dictionary vectors.
If you want to use the linear kernel (i.e. do a normal k-means clustering) then you should use the find_clusters_using_kmeans routine.
The long and short of this algorithm is that it is an online kernel based regression algorithm. You give it samples (x,y) and it learns the function f(x) == y. For a detailed description of the algorithm read the above paper.
Note that if you want to use the linear kernel then you would be better off using the rls object as it is optimized for this case.
Performs kernel ridge regression and outputs a decision_function that represents the learned function.
The implementation is done using the empirical_kernel_map and linearly_independent_subset_finder to kernelize the rr_trainer object. Thus it allows you to run the algorithm on large datasets and obtain sparse outputs. It is also capable of automatically estimating its regularization parameter using leave-one-out cross-validation.This function is an implementation of the algorithm described in the following papers:
Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods by John C. Platt. March 26, 1999
A Note on Platt's Probabilistic Outputs for Support Vector Machines by Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng
This function is the tool used to implement the train_probabilistic_decision_function routine.
This is an implementation of an online algorithm for recursively finding a set (aka dictionary) of linearly independent vectors in a kernel induced feature space. To use it you decide how large you would like the dictionary to be and then you feed it sample points.
The implementation uses the Approximately Linearly Dependent metric described in the paper The Kernel Recursive Least Squares Algorithm by Yaakov Engel to decide which points are more linearly independent than others. The metric is simply the squared distance between a test point and the subspace spanned by the set of dictionary vectors.
Each time you present this object with a new sample point it calculates the projection distance and if it is sufficiently large then this new point is included into the dictionary. Note that this object can be configured to have a maximum size. Once the max dictionary size is reached each new point kicks out a previous point. This is done by removing the dictionary vector that has the smallest projection distance onto the others. That is, the "least linearly independent" vector is removed to make room for the new one.
Many learning algorithms attempt to minimize a function that, at a high level, looks like this:
f(w) == complexity + training_set_error
The idea is to find the set of parameters, w, that gives low error on your training data but also is not "complex" according to some particular measure of complexity. This strategy of penalizing complexity is usually called regularization.
In the above setting, all the training data consists of labeled samples. However, it would be nice to be able to benefit from unlabeled data. The idea of manifold regularization is to extract useful information from unlabeled data by first defining which data samples are "close" to each other (perhaps by using their 3 nearest neighbors) and then adding a term to the above function that penalizes any decision rule which produces different outputs on data samples which we have designated as being close.
It turns out that it is possible to transform these manifold regularized learning problems into the normal form shown above by applying a certain kind of preprocessing to all our data samples. Once this is done we can use a normal learning algorithm, such as the svm_c_linear_trainer, on just the labeled data samples and obtain the same output as the manifold regularized learner would have produced.
The linear_manifold_regularizer is a tool for creating this preprocessing transformation. In particular, the transformation is linear. That is, it is just a matrix you multiply with all your samples. For a more detailed discussion of this topic you should consult the following paper. In particular, see section 4.2. This object computes the inverse T matrix described in that section.
Linear Manifold Regularization for Large Scale Semi-supervised Learning by Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin
This object represents a multilayer layer perceptron network that is trained using the back propagation algorithm. The training algorithm also incorporates the momentum method. That is, each round of back propagation training also adds a fraction of the previous update. This fraction is controlled by the momentum term set in the constructor.
It is worth noting that a MLP is, in general, very inferior to modern kernel algorithms such as the support vector machine. So if you haven't tried any other techniques with your data you really should.
mlp_kernel_1:
This is implemented in the obvious way.
kernel_1ais a typedef for mlp_kernel_1 kernel_1a_cis a typedef for kernel_1a that checks its preconditions.
Modularity and community structure in networks by M. E. J. Newman.In particular, this is a method for automatically clustering the nodes in a graph into groups. The method is able to automatically determine the number of clusters and does not have any parameters. In general, it is a very good clustering technique.
dlib contains a few "training post processing" algorithms (e.g. reduced and reduced2). These tools take in a trainer object, tell it to perform training, and then they take the output decision function and do some kind of post processing to it. The null_trainer_type object is useful because you can use it to run an already learned decision function through the training post processing algorithms by turning a decision function into a null_trainer_type and then giving it to a post processor.
Support Vector Machine Active Learning with Applications to Text Classification by Simon Tong and Daphne Koller.
This is a batch trainer object that is meant to wrap other batch trainer objects that create decision_function objects. It performs post processing on the output decision_function objects with the intent of representing the decision_function with fewer basis vectors.
It begins by performing the same post processing as the reduced_decision_function_trainer object but it also performs a global gradient based optimization to further improve the results. The gradient based optimization is implemented using the approximate_distance_function routine.
find w minimizing: 0.5*dot(w,w) + C*sum_i(y_i - trans(x_i)*w)^2Where (x_i,y_i) are training pairs. x_i is some vector and y_i is a target scalar value.
So for example, suppose you wanted to set the bias term so that the accuracy of your decision function on +1 labeled samples was 99%. To do this you would use an instance of this object declared as follows: roc_trainer_type<trainer_type>(your_trainer, 0.99, +1);
Performs linear ridge regression and outputs a decision_function that represents the learned function. In particular, this object can only be used with the linear_kernel. It is optimized for the linear case where the number of features in each sample vector is small (i.e. on the order of 1000 or less since the algorithm is cubic in the number of features.). If you want to use a nonlinear kernel then you should use the krr_trainer.
This object is capable of automatically estimating its regularization parameter using leave-one-out cross-validation.Trains a relevance vector machine for solving regression problems. Outputs a decision_function that represents the learned regression function.
The implementation of the RVM training algorithm used by this library is based on the following paper:Tipping, M. E. and A. C. Faul (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6.
Trains a relevance vector machine for solving binary classification problems. Outputs a decision_function that represents the learned classifier.
The implementation of the RVM training algorithm used by this library is based on the following paper:Tipping, M. E. and A. C. Faul (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6.
A Nonlinear Mapping for Data Structure Analysis (1969) by J.W. Sammon
Hidden Markov Support Vector Machines by Y. Altun, I. Tsochantaridis, T. Hofmann
Shallow Parsing with Conditional Random Fields by Fei Sha and Fernando Pereira
Internally, the sequence_segmenter uses the BIO (Begin, Inside, Outside) or BILOU (Begin, Inside, Last, Outside, Unit) sequence tagging model. Moreover, it is implemented using a sequence_labeler object and therefore sequence_segmenter objects are examples of chain structured conditional random field style sequence taggers.
To elaborate, a graph labeling problem is a task to learn a binary classifier which predicts the label of each node in a graph. Additionally, we have information in the form of edges between nodes where edges are present when we believe the linked nodes are likely to have the same label. Therefore, part of a graph labeling problem is to learn to score each edge in terms of how strongly the edge should enforce labeling consistency between its two nodes.
Note that this is just a convenience wrapper around the structural_svm_graph_labeling_problem to make it look similar to all the other trainers in dlib. You might also consider reading the book Structured Prediction and Learning in Computer Vision by Sebastian Nowozin and Christoph H. Lampert since it contains a good introduction to machine learning methods such as the algorithm implemented by the structural_graph_labeling_trainer.
Note that this is just a convenience wrapper around the structural_svm_object_detection_problem to make it look similar to all the other trainers in dlib.
Note that this is just a convenience wrapper around the structural_svm_sequence_labeling_problem to make it look similar to all the other trainers in dlib.
This object internally uses the structural_sequence_labeling_trainer to solve the learning problem.
It learns the parameter vector by formulating the problem as a structural SVM problem. The general approach is similar to the method discussed in Learning to Localize Objects with Structured Output Regression by Matthew B. Blaschko and Christoph H. Lampert. However, the method has been extended to datasets with multiple, potentially overlapping, objects per image and the measure of loss is different from what is described in the paper.
Predicting Structured Objects with Support Vector Machines by Thorsten Joachims, Thomas Hofmann, Yisong Yue, and Chun-nam YuFor a more detailed discussion of the particular algorithm implemented by this object see the following paper:
T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of Structural SVMs, Machine Learning, 77(1):27-59, 2009.Note that this object is essentially a tool for solving the 1-Slack structural SVM with margin-rescaling. Specifically, see Algorithm 3 in the above referenced paper.
Structured Prediction and Learning in Computer Vision by Sebastian Nowozin and Christoph H. Lampert
Hidden Markov Support Vector Machines by Y. Altun, I. Tsochantaridis, T. HofmannWhile the particular optimization strategy used is the method from:
T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of Structural SVMs, Machine Learning, 77(1):27-59, 2009.
A Dual Coordinate Descent Method for Large-scale Linear SVM by Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen LinThis trainer has the ability to disable the bias term and also to force the last element of the learned weight vector to be 1. Additionally, it can be warm-started from the solution to a previous training run.
Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization by Vojtech Franc, Soren Sonnenburg; Journal of Machine Learning Research, 10(Oct):2157--2192, 2009.This trainer has the ability to restrict the learned weights to non-negative values.
Trains a C support vector machine for solving binary classification problems and outputs a decision_function. It is implemented using the SMO algorithm.
The implementation of the C-SVM training algorithm used by this library is based on the following paper:Trains a nu support vector machine for solving binary classification problems and outputs a decision_function. It is implemented using the SMO algorithm.
The implementation of the nu-svm training algorithm used by this library is based on the following excellent papers:Trains a one-class support vector classifier and outputs a decision_function. It is implemented using the SMO algorithm.
The implementation of the one-class training algorithm used by this library is based on the following paper:The implementation of the Pegasos algorithm used by this object is based on the following excellent paper:
Pegasos: Primal estimated sub-gradient solver for SVM (2007) by Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro In ICML
This SVM training algorithm has two interesting properties. First, the pegasos algorithm itself converges to the solution in an amount of time unrelated to the size of the training set (in addition to being quite fast to begin with). This makes it an appropriate algorithm for learning from very large datasets. Second, this object uses the kcentroid object to maintain a sparse approximation of the learned decision function. This means that the number of support vectors in the resulting decision function is also unrelated to the size of the dataset (in normal SVM training algorithms, the number of support vectors grows approximately linearly with the size of the training set).
However, if you are considering using svm_pegasos, you should also try the svm_c_linear_trainer for linear kernels or svm_c_ekm_trainer for non-linear kernels since these other trainers are, usually, faster and easier to use than svm_pegasos.
Optimizing Search Engines using Clickthrough Data by Thorsten JoachimsFinally, note that the implementation of this object is done using the oca optimizer and count_ranking_inversions method. This means that it runs in O(n*log(n)) time, making it suitable for use with large datasets.
This object implements a trainer for performing epsilon-insensitive support vector regression. It is implemented using the SMO algorithm, allowing the use of non-linear kernels. If you are interested in performing support vector regression with a linear kernel and you have a lot of training data then you should use the svr_linear_trainer which is highly optimized for this case.
The implementation of the eps-SVR training algorithm used by this object is based on the following paper:Trains a probabilistic_function using some sort of binary classification trainer object such as the svm_nu_trainer or krr_trainer.
The probability model is created by using the technique described in the following papers:Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods by John C. Platt. March 26, 1999
A Note on Platt's Probabilistic Outputs for Support Vector Machines by Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng