Machine Learning in Python with Scikit-Learn (94/100 Days of Python)

Martin Mirakyan
4 min readApr 5, 2023

--

Day 94 of the “100 Days of Python” blog post series covering machine learning with scikit-learn

Scikit-learn is an open-source Python library that provides a versatile and user-friendly interface for a variety of machine-learning algorithms. With an extensive set of tools for data preprocessing, model selection, and evaluation, scikit-learn is a go-to library for both novice and experienced data scientists. In this comprehensive tutorial, we will explore the main components of scikit-learn, walk you through a step-by-step process of implementing machine learning models, and show you some useful tips and tricks to get the most out of this powerful library.

To install scikit-learn using pip, simply run:

pip install scikit-learn

Understanding Scikit-Learn’s Key Components

Scikit-learn is organized into several key components:

  • Datasets: Built-in datasets to practice and learn from
  • Preprocessing: Functions for data preprocessing and feature engineering
  • Model Selection: Tools for splitting data, cross-validation, and parameter tuning
  • Supervised Learning: Algorithms for classification, regression, and ensemble methods
  • Unsupervised Learning: Algorithms for clustering, dimensionality reduction, and anomaly detection
  • Model Evaluation: Metrics for assessing model performance and diagnosing issues

Preprocessing Data with Scikit-Learn

Preprocessing is an essential step in machine learning as it prepares raw data for modeling. Scikit-learn offers a variety of preprocessing functions, including:

  • StandardScaler: Standardize features by removing the mean and scaling to unit variance
  • MinMaxScaler: Scale features to a given range, usually between 0 and 1
  • LabelEncoder: Encode categorical labels as integers
  • OneHotEncoder: Convert categorical features to binary one-hot vectors
  • Imputer: Fill in missing values using various strategies
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, SimpleImputer

# Standard Scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# MinMax Scaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

# Categorical to One-Hot
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)

# Impute values (fill in missing ones)
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Implementing Supervised Learning Models

Scikit-learn supports various supervised learning algorithms. This section provides an overview of how to implement some popular models:

  • Linear Regression
  • Logistic Regression
  • Support Vector Machines (SVM)
  • Decision Trees
  • Random Forests
  • Gradient Boosting

All of the models in Scikit-learn have the same interface for training and predicting. To train a model on a dataset, you need to call the fit method on the model object. Then, after training, you can call the model.predict() method to get the predictions of the model on some data:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Or to train a Logistic Regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Implementing Unsupervised Learning Models

Scikit-learn also provides a range of unsupervised learning algorithms. Some examples include:

  • K-Means Clustering
  • DBSCAN
  • Hierarchical Clustering
  • Principal Component Analysis (PCA)
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(X)
labels = model.labels_

# For DBSCAN
from sklearn.cluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
labels = model.labels_

Model Evaluation and Hyperparameter Tuning

To assess and improve model performance, scikit-learn offers several tools, such as:

  • Train-test split: Split the dataset into training and testing sets
  • Cross-validation: Partition the dataset into multiple folds for validation
  • Grid search: Exhaustively search over specified hyperparameter values
  • Random search: Sample from a distribution of hyperparameter values
  • Model-specific scoring functions: Evaluate model performance using various metrics
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)

Pipelines and Feature Union

Scikit-learn’s Pipeline and FeatureUnion classes streamline complex workflows, allowing you to chain multiple steps together:

  • Pipeline: A sequence of data processing steps and a final estimator. Pipelines ensure that the entire workflow is treated as a single entity, simplifying cross-validation and hyperparameter tuning.
  • FeatureUnion: Combine the output of multiple transformer objects into a single new feature space. This is particularly useful when working with heterogeneous data sources or combining multiple feature extraction mechanisms.

Example of a simple pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define the pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])

# Train the pipeline
pipe.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = pipe.predict(X_test)

Tips and Tricks

Here are some tips and tricks to help you get the most out of scikit-learn:

  • Use built-in functions for common tasks: Scikit-learn offers many utility functions to simplify your code and reduce errors.
  • Keep your data in NumPy arrays or pandas DataFrames: Scikit-learn is designed to work seamlessly with these data structures.
  • Familiarize yourself with scikit-learn’s API conventions: Understand how to instantiate models, fit them, make predictions, and evaluate their performance.
  • Leverage the extensive documentation and examples: Scikit-learn’s documentation is thorough and includes many examples to learn from.

What’s next?

--

--

Martin Mirakyan
Martin Mirakyan

Written by Martin Mirakyan

Software Engineer | Machine Learning | Founder of Profound Academy (https://profound.academy)

No responses yet