Machine learning with imbalanced data
Find out what you will learn throughout the course (if the video does not show, try allowing cookies in your browser).
What you'll learn
👉 Random under- and over-sampling.
👉 Cleaning under-sampling methods.
👉 Create synthetic data using SMOTE.
👉 Cost-sensitive learning
👉 Ensemble methods for imbalanced data.
👉 Performance evaluation metrics for imbalanced data.
👉 Apply methods Python open source libraries.
What you'll get
Lifetime access
instructor support
Certificate of completion
💬 English subtitles
Instructor
Soledad Galli, PhD
Sole is a lead data scientist, instructor and developer of open source software. She created and maintains the Python library for feature engineering Feature-engine, which allows us to impute data, encode categorical variables, transform, create and select features. Sole is also the author of the book "Python Feature engineering Cookbook" by Packt editorial.
Can't afford it? Get in touch.
30 days money back guarantee
If you're disappointed for whatever reason, you'll get a full refund.
So you can buy with confidence.
Course description
Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques to improve the performance of machine learning models trained with imbalanced datasets.
What are imbalanced datasets?
Imbalanced datasets are those typically used in classification problems where one of the target classes is extremely under-represented. When this happens, we talk about a class imbalance. The class with a small number of samples is called the minority class, and the class or classes with plenty of data are called the majority class or classes.
Imbalanced datasets are a common occurrence in data science. Examples of imbalanced datasets are those used for fraud detection or medical diagnosis.
Why is class imbalance a problem?
Most machine learning algorithms assume balanced class distributions. Thus, training classifiers on imbalanced data will naturally bias the model towards the majority class.
In addition, because the number of samples for the minority class is small, rules to accurately predict these classes are hard to find. Thus, observations belonging to the minority class most often end up being misclassified by the classification models.
Fortunately, there are various ways in which we can improve the performance of classifiers trained on data with imbalanced classes, including resampling, cost-sensitive learning, and ensemble methods.
What will you learn in this online course?
In this course, you will learn multiple methods to improve the performance of machine learning models trained on imbalanced data and decrease the misclassification of the minority class or classes.
The course is divided into the following sections:
- Evaluation metrics
- Resampling methods
- Cost-sensitive learning
- Ensemble algorithms
Evaluation metrics
You will learn suitable metrics to assess imbalanced classification models trained with imbalanced datasets. You will learn about the roc-curve and the roc-auc. You will create a confusion matrix, find true positives, true negatives, false positives, and false negatives, and then use them to calculate other metrics like precision, recall, and the f1-score. You will also learn about specific performance metrics to assess imbalanced classification models, like the imbalanced accuracy, among others.
Some of these metrics are geared toward binary classification problems. Other metrics can handle multi-class targets out-of-the-box. You will learn when you can use each metric and why in your classification tasks.
Resampling techniques
Next, you will learn about resampling methods, including under-sampling and over-sampling.
Among the under-sampling methods, you will learn random under-sampling and cleaning methods based on k-nearest neighbors, like tomek links and nearmiss.
Among the over-sampling techniques, you will learn random over-sampling and methods that create new data points, like the synthetic minority over-sampling technique (SMOTE) and its variations. SMOTE creates synthetic data, that is, new data, and therefore avoids the mere duplication of samples introduced by random over-sampling.
Resampling methods are usually classified as data preprocessing methods because they change the distribution of the training dataset. In particular, the aim of resampling techniques is to create balanced datasets with a similar distribution across the different classes.
You will learn how to correctly set up the resampling strategy, modifying the training dataset and leaving a test set untouched with the original class distribution, to correctly perform the model validation in a similar setting to how it will be used in the real world.
Cost-sensitive learning
Next, you will learn how to introduce class weights to perform cost sensitive learning. Cost sensitive learning uses the original dataset to train the models, without changing the class distribution. It aims to compensate for the misclassification of the minority class by penalizing harder the mistakes the classifier makes when classifying these observations.
Ensemble methods
Finally, we will carry out specific bagging and boosting algorithms designed to handle imbalanced data.
By the end of the course, you will be able to decide which technique is suitable for your dataset, and/or apply and compare the boost in performance returned by the different methods on multiple datasets.
Feature engineering with Python
Throughout the tutorials, we will use Python as the main language. We will implement the resampling methods with the open-source library imbalanced learn (imblearn) and the cost-sensitive techniques with Scikit-learn (sklearn).
Who is this course for?
If you are working with imbalanced datasets right now and want to boost the performance of your classifiers, or you simply want to learn more about how to handle imbalanced data, this course will show you how.
Course prerequisites
To get the most out of this course, you need to have basic knowledge of machine learning and familiarity with the most common predictive models, like linear and logistic regression, decision trees, and random forests. You also need to be familiar with the Python open-source libraries Pandas, Numpy, and Scikit-learn.
To wrap-up
This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.
Course Curriculum
- Introduction to Performance Metrics (3:22)
- Accuracy (4:21)
- Accuracy - Demo (5:39)
- Precision, Recall and F-measure (13:32)
- Install Yellowbrick
- Precision, Recall and F-measure - Demo (10:04)
- Confusion tables, FPR and FNR (6:03)
- Confusion tables, FPR and FNR - Demo (7:32)
- Balanced Accuracy (3:49)
- Balanced accuracy - Demo (2:43)
- Geometric Mean, Dominance, Index of Imbalanced Accuracy (4:29)
- Geometric Mean, Dominance, Index of Imbalanced Accuracy - Demo (9:28)
- ROC-AUC (7:26)
- ROC-AUC - Demo (4:46)
- Precision-Recall Curve (11:19)
- Precision-Recall Curve - Demo (3:53)
- Additional reading resources
- Probability (4:32)
- Tuning the probability threshold with sklearn (8:24)
- Bringing it all together - credit risk (13:14)
- Quiz - binary classification
- Metrics for Mutliclass (11:04)
- Metrics for Multiclass - Demo (8:55)
- PR and ROC Curves for Multiclass (5:16)
- PR Curves in Multiclass - Demo (8:40)
- ROC Curve in Multiclass - Demo (7:13)
- Quiz - multiclass classification
- How are we doing? (0:26)
- Cost-sensitive Learning (7:27)
- Types of Cost (10:55)
- Obtaining the Cost (4:28)
- Cost Sensitive Approaches (1:52)
- Misclassification Cost in Logistic Regression (3:35)
- Misclassification Cost in Decision Trees (3:50)
- Cost Sensitive Learning with Scikit-learn (7:13)
- Find Optimal Cost with hyperparameter tuning (3:33)
- Cost sensitive learning - credit risk (6:44)
- Cost sensitive learning in ensemble methods
- CSL: before or after feature engineering?
- Cost-sensitive pipelines (2:38)
- Quiz - cost sensitive learning
- Bayes Conditional Risk (13:44)
- MetaCost (8:03)
- MetaCost - Demo (3:40)
- Optional: MetaCost Base Code (6:39)
- Wrapping up (2:57)
- How are we doing? (0:24)
- Additional Reading Resources
- Under-Sampling Methods - Introduction (5:21)
- Random Under-Sampling - Intro (4:23)
- Random Under-Sampling - Demo (10:11)
- Condensed Nearest Neighbours - Intro (8:03)
- Condensed Nearest Neighbours - Demo (7:25)
- Tomek Links - Intro (4:43)
- Tomek Links - Demo (3:05)
- One Sided Selection - Intro (4:38)
- One Sided Selection - Demo (3:00)
- Edited Nearest Neighbours - Intro (5:01)
- Edited Nearest Neighbours - Demo (4:02)
- Repeated Edited Nearest Neighbours - Intro (4:39)
- Repeated Edited Nearest Neighbours - Demo (3:00)
- All KNN - Intro (6:16)
- All KNN - Demo (5:50)
- Neighbourhood Cleaning Rule - Intro (6:14)
- Neighbourhood Cleaning Rule - Demo (1:55)
- NearMiss - Intro (3:47)
- NearMiss - Demo (3:53)
- Instance Hardness Threshold - Intro (9:20)
- Instance Hardness Threshold - Demo (16:21)
- Instance Hardness Threshold Multiclass Demo (7:44)
- Undersampling Method Comparison (7:44)
- Quiz - undersampling comparison
- Setting up a classifier with under-sampling and cross-validation (10:54)
- Quiz - comparison with cross-validation
- Undersampling methods comparison with hyperparameter tuning
- Wrapping up the section (5:18)
- How are we doing? (0:24)
- Summary Table
- Added Treat: A Movie We Recommend 🍿
- Over-Sampling Methods - Introduction (3:41)
- Random Over-Sampling (5:00)
- Random Over-Sampling - Demo (4:55)
- ROS with smoothing - Intro (6:39)
- ROS with smoothing - Demo (4:36)
- SMOTE (9:26)
- SMOTE - Demo (2:35)
- SMOTE-NC (9:02)
- SMOTE-NC - Demo (2:56)
- SMOTE-N (19:25)
- SMOTE-N Demo (7:20)
- ADASYN (7:11)
- ADASYN - Demo (3:17)
- Borderline SMOTE (7:47)
- Borderline SMOTE - Demo (3:13)
- SVM SMOTE (16:40)
- Resources on SVMs
- SVM SMOTE - Demo (4:32)
- K-Means SMOTE (13:01)
- K-Means SMOTE - Demo (3:29)
- Over-Sampling Method Comparison (5:50)
- Quiz - oversampling methods comparison
- Oversampling method comparison - take 2
- Wrapping up the section (9:30)
- SMOTE in 2024
- How to Correctly Set Up a Classifier with Over-sampling (5:24)
- Summary Table
- Extra Treat: Our Reading Suggestion 📕
- Ensemble methods with Imbalanced Data (4:33)
- Foundations of Ensemble Learning (3:12)
- Bagging (3:04)
- Bagging with over- or undersampling (5:38)
- Boosting (10:03)
- Boosting with resampling (7:05)
- Hybdrid Methods (4:48)
- Ensemble Methods - Demo (9:59)
- Comparison of ensemble methods
- Wrapping up (5:31)
- Additional Reading Resources
- More Wisdom: Our Chosen Podcast Episode 🎧
- Probability Calibration (6:41)
- Probability Calibration Curves (5:56)
- Probability Calibration Curves - Demo (9:37)
- Brier Score (3:06)
- Brier Score - Demo (7:07)
- Under- and Over-sampling and Cost-sensitive learning on Probability Calibration (5:10)
- Calibrating a Classifier (5:25)
- Calibrating a Classifier - Demo (6:20)
- Calibrating a Classfiier after SMOTE or Under-sampling (8:05)
- Calibrating a Classifier with Cost-sensitive Learning (3:31)
- Quiz
- Additional reading resources
Frequently Asked Questions
When does the course begin and end?
You can start taking the course from the moment you enroll. The course is self-paced, so you can watch the tutorials and apply what you learn whenever you find it most convenient.
For how long can I access the course?
The courses have lifetime access. This means that once you enroll, you will have unlimited access to the course for as long as you like.
What if I don't like the course?
There is a 30-day money back guarantee. If you don't find the course useful, contact us within the first 30 days of purchase and you will get a full refund.
Will I get a certificate?
Yes, you'll get a certificate of completion after completing all lectures, quizzes and assignments.