Study Apache Spark MLlib on IPython—Linear Regression

HI Guys,

In the previous lecture, we have basic understanding of what the algorithms MLib have. Now I will gradually share my practice about utilizing the algorithm in each category of functionalities(regression, classification, clustering…) in MLlib.

Let’s start from the linear regression and set the goals for this mini project.

a). Analysis the bill-tips dataset by seaborn linear fit plot function

b). Using scikit-learn linear regression model to train and predict the dataset

c). Apply  MLlib three different linear regression model to the sample dataset and calculate the MSE(Mean squre error)

1. Introduce several common used python packages

I. Data analysis

(1) Numpy

NumPy is the fundamental package for scientific computing with Python

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

(2) Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas has the concept of data-frame which makes it quite compatible to the Spark RDD (Resilient Distribute Dataset/DataFrame)

II. Visualization and Plotting

(3) matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms

(4) Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

III. Machine Learning

(5) Scikit-learn

Machine Learning in Python. You can regard it as the same as MLlib, while the later one has less algorithm integrated for now than the former one. Since we are learning MLlib, we won’t use Scikit-learn here.

IV import the packages

If you missed any one of them use synaptic package manager to install or use sudo pip install <package_name>

2. Programming Guild for linear regression in MLlib

3. We are using the seaborn dataset

4. Let’s analysis using purely python packages first.

I will post my ipython notebook here. The source code can be found on my github





A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate. It looks like our training model is good and it is a linear regressoin model.

Now let’s start the Linear regression model in Pyspark. (refer to here)


Let’s compare three different Linear -Regression model with regularization set diferently


The final result are pretty similar and fitting well, mostly perhaps the dataset is very small only about 500. Meanwhile, we have to notice the linearRegression model of MLlib is just grapped the entire Y dataset(Tips) as input without partitioning training and testing dataset

Congratulations! We have code our first Machine learning program in Spark with MLlib! Next time, let’s go to some classification models



2., Jose Portilla

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s