## Scikit-learn Pipeline

### Pipeline 1

In most machine learning projects the data that you have to work with is unlikely to be in the ideal format for producing the best performing model. There are quite often a number of transformational steps such as encoding categorical variables, feature scaling and normalisation that need to be performed. Scikit-learn has built in functions for most of these commonly used transformations in it’s preprocessing package.

However, in a typical machine learning workflow you will need to apply all these transformations at least twice. Once when training the model and again on any new data you want to predict on. Of course you could write a function to apply them and reuse that but you would still need to run this first and then call the model separately. Scikit-learn pipelines are a tool to simplify this process. They have several key benefits:

• They enforce the implementation and order of steps in your project.
• These in turn make your work much more reproducible.

Before building the pipeline I am splitting the training data into a train and test set so that I can validate the performance of the model.

The first step in building the pipeline is to define each transformer type. The convention here is generally to create transformers for the different variable types.

Next we use the ColumnTransformer to apply the transformations to the correct columns in the dataframe. Before building this I have stored lists of the numeric and categorical columns using the pandas dtype method.

The next step is to create a pipeline that combines the preprocessor created above with a classifier. In this case I have used a simple RandomForestClassifier to start with.

Fitting the classifier

A pipeline can also be used during the model selection process. The following example code loops through a number of scikit-learn classifiers applying the transformations and training the model.

The pipeline can also be used in grid search to find the best performing parameters. To do this you first need to create a parameter grid for your chosen model. One important thing to note is that you need to append the name that you have given the classifier part of your pipeline to each parameter name. In my code above I have called this ‘classifier’ so I have added classifier__ to each parameter. Next I created a grid search object which includes the original pipeline. When I then call fit, the transformations are applied to the data, before a cross-validated grid-search is performed over the parameter grid.

### Pipeline 2

The example below demonstrates the pipeline defined with four steps:

• Feature Extraction with Principal Component Analysis (3 features)
• Feature Extraction with Statistical Selection (6 features)
• Feature Union
• Learn a Logistic Regression Model

### Building Scikit-Learn transformers

Scikit-Learn’s API uses duck typing: if you want to write your own custom estimators (including transformers and predictors), you only need to implement the right methods, you don’t have to inherit from any particular class.

For example, all estimators must implement a fit() method, and get_params() and set_params() methods. All transformers must also implement transform() and fit_transform() methods. All predictors must implement a predict() method. And so on.

The most basic implementation of the fit_transform() method is just this:

You don’t have to inherit from the TransformerMixin class, but that’s what you get if you do: if you implement the fit() method and the transform() method, it gives you the fit_transform() method for free, just like the above.

Similarly, the BaseEstimator class just gives you the get_params() and set_params() methods for free. By default, get_params() does some introspection to get the parameters of the constructor init(), and it assumes that the class has corresponding instance variables. For example:

Let’s say I have a lot of text and I want to extract certain data from it. I’m going to build a featurizer that takes a list of functions, calls each function with our text, and returns the results of all functions as a feature vector.