A brief introduction and summary of MLlib
MLlib stands for Machine learning Library used for Apache Spark, which is a really convenient tool combined many common used Machine learning/data analysis algorithms.
Let’s go through what algorithms MLlib currently have now.
1. regression
Linear regression is one of the most common methods for regression, predicting the output variable as a linear combination of the features.
2.Classification (Supervised)
2.1 Logistic regression (Binary Classification)
Logistic regression is a binary classification method that identifies a linear separating plane between positive and negative examples. In MLlib, it takes LabeledPoints with label 0 or 1 and returns a LogisticRegressionModel that can predict new points.
2.2 Linear SVM (Binary Classification)
Support Vector Machines, or SVMs, are another binary classification method with linear separating planes, again expecting labels of 0 or 1. They are available through the SVMWithSGD class, with similar parameters to linear and logisitic regression. The returned SVMModel uses a threshold for prediction like LogisticRegressionModel.
2.2 Naive Bayes (Multiclass Classification)
Naive Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on a linear function of the features. It is commonly used in text classification with TF-IDF features, among other applications. MLlib implements Multinomial Naive Bayes, which expects nonnegative frequencies (e.g., word frequencies) as input features.
3. Regression and Classification (Ensemble Method)
3.1 Decision trees
Decision trees are a flexible model that can be used for both classification and regression. They represent a tree of nodes, each of which makes a binary decision based on a feature of the data (e.g., is a person’s age greater than 20?), and where the leaf nodes in the tree contain a prediction (e.g., is the person likely to buy a product?). Decision trees are attractive because the models are easy to inspect and because they support both categorical and continuous features.
3.2 Random forests
return a WeightedEnsembleModel that contains several trees (in the weakHypotheses field, weighted by weakHypothesisWeights) and can predict() an RDD or Vector. It also includes a toDebugString to print all the trees.Actually decision tress is just one tree of Random Forests
3.3 Gradient-Boosted Trees (GBT)
Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. Like decision trees, GBTs handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.
MLlib supports GBTs for binary classification and for regression, using both continuous and categorical features. MLlib implements GBTs using the existing decision trees implementation. Please see the decision tree guide for more information on trees.
Note: GBTs do not yet support multiclass classification. For multiclass problems, please use decision trees or random forest.
4. Cluster (Unsupervised)
Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity. Unlike the supervised tasks seen before, where data is labeled, clustering can be used to make sense of unlabeled data. It is commonly used in data exploration (to find what a new dataset looks like) and in anomaly detection (to identify points that are far from any cluster).
4.1 K-means
MLlib includes the popular K-means algorithm for clustering, as well as a variant called K-means|| that provides better initialization in parallel environments.5 Kmeans is similar to the K-means++ initialization procedure often used in singlenode
settings.
Collaborative filtering is a technique for recommender systems wherein users’ ratings and interactions with various products are used to recommend new ones. Collaborative filtering is attractive because it only needs to take in a list of user/product interactions: either “explicit” interactions (i.e., ratings on a shopping site) or “implicit” ones (e.g., a user browsed a product page but did not rate the product). Based solely on these interactions, collaborative filtering algorithms learn which products are similar to each other (because the same users interact with them) and which users are similar to each other, and can make new recommendations. While the MLlib API talks about “users” and “products,” you can also use collaborative filtering for other applications, such as recommending users to follow on a social network, tags to add to an article, or songs to add to a radio station.
ALS works by determining a feature vector for each user and product, such that the dot product of a user’s vector and a product’s is close to their score.
5. Dimensionality Reduction
5.1 PCA
The main technique for dimensionality reduction used by the machine learning community is principal component analysis (PCA). In this technique, the mapping to the lower-dimensional space is done such that the variance of the data in the low dimensional representation is maximized, thus ignoring non-informative dimensions. To compute the mapping, the normalized correlation matrix of the data is constructed and the singular vectors and values of this matrix are used. The singular vectors that correspond to the largest singular values are used to reconstruct a large fraction of the variance of the original data.
5.2 SVD
MLlib also provides the lower-level singular value decomposition (SVD) primitive. The SVD factorizes an m ~ n matrix A into three matrices A . Uƒ°V T , where:
. U is an orthonormal matrix, whose columns are called left singular vectors.
. ƒ° is a diagonal matrix with nonnegative diagonals in descending order, whosediagonals are called singular values.
. V is an orthonormal matrix, whose columns are called right singular vectors.
Reference:
1. https://spark.apache.org/docs/1.2.1/mllib-guide.html
2. Holden Karau et al., Learning Spark : Lightning-Fast Big Data Analytics, O’ Reilly 2014.