Machine Learning with Imbalanced Data

Discover what you'll learn in the course (enable cookies if the video doesn’t play).

What you'll learn

👉 Random under- and over-sampling.

👉 Cleaning under-sampling methods.

👉 Create synthetic data using SMOTE.

👉 Cost-sensitive learning

👉 Ensemble methods for imbalanced data.

👉 Performance evaluation metrics for imbalanced data.

👉 Apply methods Python open source libraries.

What you'll get

00

Hours on-demand video

00

Jupyter Notebooks

00

Quizzes and Assignments

Lifetime access

instructor support

Certificate of completion

Access on mobile

💬 English subtitles

Enroll Now

What our students say

Train in Data Reviews

Pricing

Master Machine Learning with Imbalanced Data

$49.99

Discover the truth about SMOTE and other resampling methods.

Apply cost sensitive Learning
Learn how to correctly evaluate your models
Find out more about over- and under-sampling
Learn about special ensemble methods

Enroll Now

30 days money back guarantee

If you're disappointed for whatever reason, you'll get a full refund.

So you can buy with confidence.

Instructor

Soledad Galli, PhD

Sole is a lead data scientist, instructor and developer of open source software. She created and maintains the Python library for feature engineering Feature-engine, which allows us to impute data, encode categorical variables, transform, create and select features. Sole is also the author of the book "Python Feature engineering Cookbook" by Packt editorial.

Course description

Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques to improve the performance of machine learning models trained with imbalanced datasets.

What are imbalanced datasets?

Imbalanced datasets are those typically used in classification problems where one of the target classes is extremely under-represented. When this happens, we talk about a class imbalance. The class with a small number of samples is called the minority class, and the class or classes with plenty of data are called the majority class or classes.

Imbalanced datasets are a common occurrence in data science. Examples of imbalanced datasets are those used for fraud detection or medical diagnosis.

Why is class imbalance a problem?

Most machine learning algorithms assume balanced class distributions. Thus, training classifiers on imbalanced data will naturally bias the model towards the majority class.

In addition, because the number of samples for the minority class is small, rules to accurately predict these classes are hard to find. Thus, observations belonging to the minority class most often end up being misclassified by the classification models.

Fortunately, there are various ways in which we can improve the performance of classifiers trained on data with imbalanced classes, including resampling, cost-sensitive learning, and ensemble methods.

What will you learn in this online course?

In this course, you will learn multiple methods to improve the performance of machine learning models trained on imbalanced data and decrease the misclassification of the minority class or classes.

The course is divided into the following sections:

Evaluation metrics
Resampling methods
Cost-sensitive learning
Ensemble algorithms

Evaluation metrics

You will learn suitable metrics to assess imbalanced classification models trained with imbalanced datasets. You will learn about the roc-curve and the roc-auc. You will create a confusion matrix, find true positives, true negatives, false positives, and false negatives, and then use them to calculate other metrics like precision, recall, and the f1-score. You will also learn about specific performance metrics to assess imbalanced classification models, like the imbalanced accuracy, among others.

Some of these metrics are geared toward binary classification problems. Other metrics can handle multi-class targets out-of-the-box. You will learn when you can use each metric and why in your classification tasks.

Resampling techniques

Next, you will learn about resampling methods, including under-sampling and over-sampling.

Among the under-sampling methods, you will learn random under-sampling and cleaning methods based on k-nearest neighbors, like tomek links and nearmiss.

Among the over-sampling techniques, you will learn random over-sampling and methods that create new data points, like the synthetic minority over-sampling technique (SMOTE) and its variations. SMOTE creates synthetic data, that is, new data, and therefore avoids the mere duplication of samples introduced by random over-sampling.

Resampling methods are usually classified as data preprocessing methods because they change the distribution of the training dataset. In particular, the aim of resampling techniques is to create balanced datasets with a similar distribution across the different classes.

You will learn how to correctly set up the resampling strategy, modifying the training dataset and leaving a test set untouched with the original class distribution, to correctly perform the model validation in a similar setting to how it will be used in the real world.

Cost-sensitive learning

Next, you will learn how to introduce class weights to perform cost sensitive learning. Cost sensitive learning uses the original dataset to train the models, without changing the class distribution. It aims to compensate for the misclassification of the minority class by penalizing harder the mistakes the classifier makes when classifying these observations.

Ensemble methods

Finally, we will carry out specific bagging and boosting algorithms designed to handle imbalanced data.

By the end of the course, you will be able to decide which technique is suitable for your dataset, and/or apply and compare the boost in performance returned by the different methods on multiple datasets.

Feature engineering with Python

Throughout the tutorials, we will use Python as the main language. We will implement the resampling methods with the open-source library imbalanced learn (imblearn) and the cost-sensitive techniques with Scikit-learn (sklearn).

Who is this course for?

If you are working with imbalanced datasets right now and want to boost the performance of your classifiers, or you simply want to learn more about how to handle imbalanced data, this course will show you how.

Course prerequisites

To get the most out of this course, you need to have basic knowledge of machine learning and familiarity with the most common predictive models, like linear and logistic regression, decision trees, and random forests. You also need to be familiar with the Python open-source libraries Pandas, Numpy, and Scikit-learn.

To wrap-up

This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.

Enroll Now

Course Curriculum

Watch any videos marked Preview to sample our lessons for free.

Welcome

Available in days

days after you enroll

Course material

Available in days

days after you enroll

Machine Learning with Imbalanced Data: Overview

Available in days

days after you enroll

Evaluation Metrics

Available in days

days after you enroll

Cost Sensitive Learning

Available in days

days after you enroll

Udersampling

Available in days

days after you enroll

Oversampling

Available in days

days after you enroll

Over and Undersampling

Available in days

days after you enroll

Ensemble Methods

Available in days

days after you enroll

Probability Calibration

Available in days

days after you enroll

Wrapping-up

Available in days

days after you enroll

Next steps

Available in days

days after you enroll

Enroll Now

Frequently Asked Questions

When does the course begin and end?

You can start taking the course from the moment you enroll. The course is self-paced, so you can watch the tutorials and apply what you learn whenever you find it most convenient.

For how long can I access the course?

The course has lifetime access. This means that once you enroll, you will have unlimited access to the course for as long as you like.

What if I don't like the course?

There is a 30-day money back guarantee. If you don't find the course useful, contact us within the first 30 days of purchase and you will get a full refund.

Will I get a certificate?

Yes, you'll get a certificate of completion after completing all lectures, quizzes and assignments.

Can I ask questions if I get stuck?

Absolutely! Under each video there is a comments section. Just pop your question in there, and the instructors will reply as soon as they can.

Is the course mobile-friendly?

It is indeed. Download Teachable's app on Google Play or Apple Store, log in with your Train in Data credentials and enjoy the courses from your mobile phone.

Other courses by Sole

Feature Engineering for Machine Learning

Learn imputation, variable encoding, discretization, feature extraction, how to work with datetime, outliers, and more.

Soledad Galli

Feature Selection for Machine Learning

Learn filter, wrapper, and embedded methods, recursive feature elimination, exhaustive search, feature shuffling & more.

Soledad Galli

Machine Learning Interpretability

Explain interpretable and black box models with LIME, Shap, partial dependency plots and more.

Soledad Galli

This site uses cookies

Machine Learning with Imbalanced Data