Machine Learning in Python with Scikit-Learn (94/100 Days of Python)

4 min readApr 5, 2023

Day 94 of the “100 Days of Python” blog post series covering machine learning with scikit-learn

Scikit-learn is an open-source Python library that provides a versatile and user-friendly interface for a variety of machine-learning algorithms. With an extensive set of tools for data preprocessing, model selection, and evaluation, scikit-learn is a go-to library for both novice and experienced data scientists. In this comprehensive tutorial, we will explore the main components of scikit-learn, walk you through a step-by-step process of implementing machine learning models, and show you some useful tips and tricks to get the most out of this powerful library.

To install scikit-learn using pip, simply run:

pip install scikit-learn

Understanding Scikit-Learn’s Key Components

Scikit-learn is organized into several key components:

Datasets: Built-in datasets to practice and learn from
Preprocessing: Functions for data preprocessing and feature engineering
Model Selection: Tools for splitting data, cross-validation, and parameter tuning
Supervised Learning: Algorithms for classification, regression, and ensemble methods
Unsupervised Learning: Algorithms for clustering, dimensionality reduction, and anomaly detection
Model Evaluation: Metrics for assessing model performance and diagnosing issues

Preprocessing Data with Scikit-Learn

Preprocessing is an essential step in machine learning as it prepares raw data for modeling. Scikit-learn offers a variety of preprocessing functions, including:

StandardScaler: Standardize features by removing the mean and scaling to unit variance
MinMaxScaler: Scale features to a given range, usually between 0 and 1
LabelEncoder: Encode categorical labels as integers
OneHotEncoder: Convert categorical features to binary one-hot vectors
Imputer: Fill in missing values using various strategies

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, SimpleImputer

# Standard Scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# MinMax Scaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

# Categorical to One-Hot
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)

# Impute values (fill in missing ones)
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Implementing Supervised Learning Models

Scikit-learn supports various supervised learning algorithms. This section provides an overview of how to implement some popular models:

Linear Regression
Logistic Regression
Support Vector Machines (SVM)
Decision Trees
Random Forests
Gradient Boosting

All of the models in Scikit-learn have the same interface for training and predicting. To train a model on a dataset, you need to call the fit method on the model object. Then, after training, you can call the model.predict() method to get the predictions of the model on some data:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Or to train a Logistic Regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Implementing Unsupervised Learning Models

Scikit-learn also provides a range of unsupervised learning algorithms. Some examples include:

K-Means Clustering
DBSCAN
Hierarchical Clustering
Principal Component Analysis (PCA)
t-Distributed Stochastic Neighbor Embedding (t-SNE)

from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(X)
labels = model.labels_

# For DBSCAN
from sklearn.cluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
labels = model.labels_

Model Evaluation and Hyperparameter Tuning

To assess and improve model performance, scikit-learn offers several tools, such as:

Train-test split: Split the dataset into training and testing sets
Cross-validation: Partition the dataset into multiple folds for validation
Grid search: Exhaustively search over specified hyperparameter values
Random search: Sample from a distribution of hyperparameter values
Model-specific scoring functions: Evaluate model performance using various metrics

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)

Pipelines and Feature Union

Scikit-learn’s Pipeline and FeatureUnion classes streamline complex workflows, allowing you to chain multiple steps together:

Pipeline: A sequence of data processing steps and a final estimator. Pipelines ensure that the entire workflow is treated as a single entity, simplifying cross-validation and hyperparameter tuning.
FeatureUnion: Combine the output of multiple transformer objects into a single new feature space. This is particularly useful when working with heterogeneous data sources or combining multiple feature extraction mechanisms.

Example of a simple pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Train the pipeline
pipe.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = pipe.predict(X_test)

Tips and Tricks

Here are some tips and tricks to help you get the most out of scikit-learn:

Use built-in functions for common tasks: Scikit-learn offers many utility functions to simplify your code and reduce errors.
Keep your data in NumPy arrays or pandas DataFrames: Scikit-learn is designed to work seamlessly with these data structures.
Familiarize yourself with scikit-learn’s API conventions: Understand how to instantiate models, fit them, make predictions, and evaluate their performance.
Leverage the extensive documentation and examples: Scikit-learn’s documentation is thorough and includes many examples to learn from.

What’s next?

If you found this story valuable, please consider clapping multiple times (this really helps a lot!)
Hands-on Practice: Free Python Course
Full series: 100 Days of Python
Previous topic: Mastering Data Analysis with Pandas
Next topic: Creating an Interactive Website with Streamlit