Scikit-learn Complete Walk Through (for R-Users)

Python is one if not the most used language by data scientists interested in machine learning and deep learning techniques.

One reason for Python’s popularity is without a doubt scikit-learn. scikit-learn does not need an introduction, but in case you are new to the machine learning space in Python, scikit-learn is an all-in-one machine learning library for Python that implements most machine learning techniques you will need in daily use and provides a consistent api and a framework to work with.

As the name of the post implies, this walk-through is partially intended for users coming from R who already have experience with one or both versions of scikit-learn in the R world, namely caret and mlr3.

So without further ado, let’s get started :)

Survial Chances on the Titanic

For this walk-through I will use one of the most widely used datasets in machine learning: the titanic dataset.

Firstly, because it’s a classic, but more importantely, it is small, readily available and has both numeric, string and missing features. If you want to follow along, you can find the dataset here.

# Load autoreload module 
# autoreload 1: reload modules imported with %aimport everytime before execution
# autoreload 2: reload all modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

# Import Libs
import os
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# # Import dataset
df_titanic = pd.read_csv("../../data/titanic.csv", sep=",", na_values="?")
df_titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

So, now that we have our dataframe let’s take a quick look at some summary statistics and plots:

df_titanic.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

# Note: by default, only numeric attributes are described and NaN values excluded!
df_titanic.describe()

	pclass	survived	age	sibsp	parch	fare	body
count	1309.000000	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000	121.000000
mean	2.294882	0.381971	29.881135	0.498854	0.385027	33.295479	160.809917
std	0.837836	0.486055	14.413500	1.041658	0.865560	51.758668	97.696922
min	1.000000	0.000000	0.166700	0.000000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	21.000000	0.000000	0.000000	7.895800	72.000000
50%	3.000000	0.000000	28.000000	0.000000	0.000000	14.454200	155.000000
75%	3.000000	1.000000	39.000000	1.000000	0.000000	31.275000	256.000000
max	3.000000	1.000000	80.000000	8.000000	9.000000	512.329200	328.000000

df_titanic.describe(include=[object])

	name	sex	ticket	cabin	embarked	boat	home.dest
count	1309	1309	1309	295	1307	486	745
unique	1307	2	929	186	3	27	369
top	Kelly, Mr. James	male	CA. 2343	C23 C25 C27	S	13	New York, NY
freq	2	843	11	6	914	39	64

# Check for NaN values per column
df_titanic.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

px.histogram(df_titanic, x="age", color = "survived",title="Histogram of Age by Survial")

png

# Since pandas >0.25 you can aggregate and rename columns in one go:
df_gd = df_titanic.groupby(["sex"], as_index=False). \
                    agg(
                        n_survived=("survived", "sum"),
                        n_sex=("sex", "count")
                       )
# I really like plotly.express, but unfortunately it is not as powerful as ggplot yet
# Converting between discrete and numeric colors for example can only be done by recoding the
# underlying column (https://plotly.com/python/discrete-color/)
df_gd['prop_survived'] = df_gd.n_survived/df_gd.n_sex
px.bar(df_gd, x = "sex", y="prop_survived", title="Survival Probability by Gender")

png

So, from these initial plots it seems that being male does not really help your survival chances. Now that we have a rough idea how the dataset looks like, let’s start preparing the data for modeling.

Data Preparation

Unlike R, Python does not have a dataframe as a native data structure which was later provided by the pandas library and nowadays also by datatable. However, when scikit-learn was originally designed in 2007, pandas was not available yet (pandas was first released in 2008). This is probably one of the main reasons why scikit-learn takes some getting used to for R converts who are used to working with dataframes, because scikit-learn expects all inputs in the form of numpy-arrays (though it will automatically convert dataframes as long as they contain only numeric data).

Practically, this used to be a bit of a headache, when you had to work with categorial string data as you needed to transform strings into a numeric representation manually. As a result, there are lots and lots of different tutorials online how to best convert strings to work with scikit-learn.

I will show one way based on pandas and one using scikit-learn. But let’s get rid of some columns first. Passenger name, ticket number and home.dest might be slightly correlated to survivial, but probably not. I don’t know what “boat” and “body” represent and could not find it in the docs, so to make sure we are not accidentally including feature leaks, such as which life boat a person was on, better exclude them.

df_titanic.drop(["name", "ticket", "boat", "body", "home.dest"], axis=1, inplace=True)

# Let's grab only the deck part of the cabin to simplify things:
df_titanic['cabin'] = df_titanic.cabin.astype(str)
df_titanic['cabin'] = df_titanic.cabin.apply(lambda s: s[0])

# cat_vars = df_titanic.columns[df_titanic.dtypes == object].tolist()
cat_vars = ['pclass', 'sex', 'sibsp', 'parch', 'cabin', 'embarked']
num_vars = ['fare', 'age']
# You can also use DataFrame.select_dtypes(...)
# list(df_titanic.select_dtypes(include=['object']).columns)

# 1) get_dummies() from pandas
# Attention: dummy_na = True includes a dummy column even if there are none!
df_m = pd.get_dummies(df_titanic, columns=cat_vars, drop_first=False, dummy_na=True)

Using get_dummies() on a Pandas dataframe can be dangerous when features change between training/test and prediction, because scikit-learn does not know about column names, it only considers column positions!

Let’s use scikit-learn for our encoding:

oh_enc = OneHotEncoder(sparse=False, categories="auto", drop=None, handle_unknown='error')
X_oh = oh_enc.fit_transform(df_titanic[cat_vars])

# OneHotEncoder only includes dummy nan column for columns with nan!
oh_enc.categories_

# Combine numeric and categorical features:
X = np.hstack([df_titanic[num_vars], X_oh])

Using the OneHotEncoder was a bit more work vs. using get_dummies(), but one huge benefit we get is that it integrates neatly into scikit-learns api. Using pipelines, we can do the following:

num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
                         ('std_scaler', StandardScaler())
                        ])
# Alternatively, a shorter way to construct a pipeline is:
num_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
# Take a look at the pipeline steps:
num_pipeline.steps
# You can access steps by name:
num_pipeline.named_steps["simpleimputer"]

cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                         ('onehot', OneHotEncoder(sparse=False, handle_unknown="ignore"))
                        ])

preprocessing = ColumnTransformer([('num', num_pipeline, num_vars),
                                   ('cat', cat_pipeline, cat_vars)
                                  ])
# Again, you can use the shortcut: make_column_transformer() similar to make_pipeline()

model_logreg = make_pipeline(preprocessing, LogisticRegression())

A quick note here: If you use a OneHotEncoder in cross-validation, you need to either specify the categories beforehand or allow it to ignore unknown values. Otherwise you will quite likely induce NaNs in your data, because your train folds might not be big enough to contain all categories found in the corresponding test folds.

If you are interested in running feature selection in a pipeline, check out this article from the scikit-learn docs.

y = df_titanic.survived.values
X = df_titanic.drop(["survived"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Fit the model including the pipeline:
model_logreg.fit(X_train, y_train)
print("Logistic Regression score: {}".format(model_logreg.score(X_test, y_test)))

Logistic Regression score: 0.801829268292683

Pipelines really shine when they are combined with for example grid search, because we can search for the optimal parameters over the entire pipeline:

# By default make_pipeline() uses the lowercased class name as name.
# Again, we can use model_logreg.steps to check the names we have to use:
param_grid = {'columntransformer__num__simpleimputer__strategy': ['median', 'mean'],
              'logisticregression__C': [0.01, 0.1, 1, 10, 100]
             }

grid = GridSearchCV(model_logreg, param_grid, cv=10, scoring='roc_auc', verbose=1, error_score="raise")
grid.fit(X_train, y_train)
grid.best_score_
# Display results table
# pd.DataFrame(grid.cv_results_).T

Fitting 10 folds for each of 10 candidates, totalling 100 fits





0.8416663071613459

Let’s take a quick look at the results from our grid-search:

print("Best estimator:\n{}".format(grid.best_estimator_))

Best estimator:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['fare', 'age']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['pclass', 'sex', 'sibsp',
                                                   'parch', 'cabin',
                                                   'embarked'])])),
                ('logisticregression', LogisticRegression(C=1))])

print("Best parameters:\n{}".format(grid.best_params_))

Best parameters:
{'columntransformer__num__simpleimputer__strategy': 'mean', 'logisticregression__C': 1}

print("Best cross-validation score (AUC):{}".format(grid.best_score_))
print("Test set AUC: {}".format(roc_auc_score(y_true=y_test, y_score=grid.best_estimator_.predict(X_test))))
print("Test set best score (default=accuracy): {}".format(grid.best_estimator_.score(X_test, y_test)))

Best cross-validation score (AUC):0.8416663071613459
Test set AUC: 0.775
Test set best score (default=accuracy): 0.801829268292683

# Run the model on our test data with the best pipeline:
print(grid.best_estimator_.predict(X_test)[0:5])
# Return confidence socres for samples:
# From Sklearn docs:
# The confidence score for a sample is the signed distance of that sample to the hyperplane.
print(grid.best_estimator_.decision_function(X_test)[0:5])

[0 1 0 0 0]
[-2.51798632  1.75714031 -1.92149984 -2.26651451 -1.26756091]

Last, but not least, I want to quickly show you how you can easily perform nested-cross-validation:

scores = cross_val_score(
            GridSearchCV(
                model_logreg, 
                param_grid, 
                cv=10, scoring='roc_auc', 
                verbose=1, error_score="raise"), 
            X, y)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits

print("Nested cross-validation scores: {}".format(scores))
print("Mean nested-cv-score: {}".format(scores.mean()))

Nested cross-validation scores: [0.8687963  0.78845679 0.70944444 0.74746914 0.73950311]
Mean nested-cv-score: 0.7707339544513457

So, that is about it. I hope this quick walk-through was helpful.

In the beginning I had quite some trouble deciding on a ‘canonic’ way to do things, because there are quite a number of different ways to get to the same end result. I am quite happy with this workflow at the moment, but if you have anys suggestions to improve the workflow, feel free to open a Github issue :)

References

My post follows many recommendations from the book “Introduction to Machine Learning with Python” and only deviates from it significantely when the book I had was not up-to-date (e.g. OneHotEncoder could not handle strings when the book was published)

semper quaerere

Scikit-learn Complete Walk Through (for R-Users)

Survial Chances on the Titanic

Data Preparation

References