2nd Edition. — O’Reilly, 2023. — 413 p. — ISBN 978-1-098-13572-0.
This practical guide provides more than 200 self-contained recipes to help you solve machine learning challenges you may encounter in your work. If you're comfortable with Python and its libraries, including pandas and scikit-learn, you'll be able to address specific problems, from loading data to training models and leveraging neural networks.
Each recipe in this updated edition includes code that you can copy, paste, and run with a toy dataset to ensure that it works. From there, you can adapt these recipes according to your use case or application. Recipes include a discussion that explains the solution and provides meaningful context.
Go beyond theory and concepts by learning the nuts and bolts you need to construct working machine learning applications. You'll find recipes for:
Vectors, matrices, and arrays
Working with data from CSV, JSON, SQL, databases, cloud storage, and other sources
Handling numerical and categorical data, text, images, and dates and times
Dimensionality reduction using feature extraction or feature selection
Model evaluation and selection
Linear and logical regression, trees and forests, and k-nearest neighbors
Supporting vector machines (SVM), naäve Bayes, clustering, and tree-based models
Saving, loading, and serving trained models from multiple frameworks
True PDFPrefaceWorking with Vectors, Matrices, and Arrays in NumPyIntroduction
Creating a Vector
Creating a Matrix
Creating a Sparse Matrix
Preallocating NumPy Arrays
Selecting Elements
Describing a Matrix
Applying Functions over Each Element
Finding the Maximum and Minimum Values
Calculating the Average, Variance, and Standard Deviation
Reshaping Arrays
Transposing a Vector or Matrix
Flattening a Matrix
Finding the Rank of a Matrix
Getting the Diagonal of a Matrix
Calculating the Trace of a Matrix
Calculating Dot Products
Adding and Subtracting Matrices
Multiplying Matrices
Inverting a Matrix
Generating Random Values
Loading DataIntroduction
Loading a Sample Dataset
Creating a Simulated Dataset
Loading a CSV File
Loading an Excel File
Loading a JSON File
Loading a Parquet File
Loading an Avro File
Querying a SQLite Database
Querying a Remote SQL Database
Loading Data from a Google Sheet
Loading Data from an S3 Bucket
Loading Unstructured Data
Data WranglingIntroduction
Creating a Dataframe
Getting Information about the Data
Slicing DataFrames
Selecting Rows Based on Conditionals
Sorting Values
Replacing Values
Renaming Columns
Finding the Minimum, Maximum, Sum, Average, and Count
Finding Unique Values
Handling Missing Values
Deleting a Column
Deleting a Row
Dropping Duplicate Rows
Grouping Rows by Values
Grouping Rows by Time
Aggregating Operations and Statistics
Looping over a Column
Applying a Function over All Elements in a Column
Applying a Function to Groups
Concatenating DataFrames
Merging DataFrames
Handling Numerical DataIntroduction
Rescaling a Feature
Standardizing a Feature
Normalizing Observations
Generating Polynomial and Interaction Features
Transforming Features
Detecting Outliers
Handling Outliers
Discretizating Features
Grouping Observations Using Clustering
Deleting Observations with Missing Values
Imputing Missing Values
Handling Categorical DataIntroduction
Encoding Nominal Categorical Features
Encoding Ordinal Categorical Features
Encoding Dictionaries of Features
Imputing Missing Class Values
Handling Imbalanced Classes
Handling TextIntroduction
Cleaning Text
Parsing and Cleaning HTML
Removing Punctuation
Tokenizing Text
Removing Stop Words
Stemming Words
Tagging Parts of Speech
Performing Named-Entity Recognition
Encoding Text as a Bag of Words
Weighting Word Importance
Using Text Vectors to Calculate Text Similarity in a Search Query
Using a Sentiment Analysis Classifier
Handling Dates and TimesIntroduction
Converting Strings to Dates
Handling Time Zones
Selecting Dates and Times
Breaking Up Date Data into Multiple Features
Calculating the Difference Between Dates
Encoding Days of the Week
Creating a Lagged Feature
Using Rolling Time Windows
Handling Missing Data in Time Series
Handling ImagesIntroduction
Loading Images
Saving Images
Resizing Images
Cropping Images
Blurring Images
Sharpening Images
Enhancing Contrast
Isolating Colors
Binarizing Images
Removing Backgrounds
Detecting Edges
Detecting Corners
Creating Features for Machine Learning
Encoding Color Histograms as Features
Using Pretrained Embeddings as Features
Detecting Objects with OpenCV
Classifying Images with Pytorch
Dimensionality Reduction Using Feature ExtractionIntroduction
Reducing Features Using Principal Components
Reducing Features When Data Is Linearly Inseparable
Reducing Features by Maximizing Class Separability
Reducing Features Using Matrix Factorization
Reducing Features on Sparse Data
Dimensionality Reduction Using Feature SelectionIntroduction
Thresholding Numerical Feature Variance
Thresholding Binary Feature Variance
Handling Highly Correlated Features
Removing Irrelevant Features for Classification
Recursively Eliminating Features
Model EvaluationIntroduction
Cross-Validating Models
Creating a Baseline Regression Model
Creating a Baseline Classification Model
Evaluating Binary Classifier Predictions
Evaluating Binary Classifier Thresholds
Evaluating Multiclass Classifier Predictions
Visualizing a Classifier’s Performance
Evaluating Regression Models
Evaluating Clustering Models
Creating a Custom Evaluation Metric
Visualizing the Effect of Training Set Size
Creating a Text Report of Evaluation Metrics
Visualizing the Effect of Hyperparameter Values
Model SelectionIntroduction
Selecting the Best Models Using Exhaustive Search
Selecting the Best Models Using Randomized Search
Selecting the Best Models from Multiple Learning Algorithms
Selecting the Best Models When Preprocessing
Speeding Up Model Selection with Parallelization
Speeding Up Model Selection Using Algorithm-Specific Methods
Evaluating Performance After Model Selection
Linear RegressionIntroduction
Fitting a Line
Handling Interactive Effects
Fitting a Nonlinear Relationship
Reducing Variance with Regularization
Reducing Features with Lasso Regression
Trees and ForestsIntroduction
Training a Decision Tree Classifier
Training a Decision Tree Regressor
Visualizing a Decision Tree Model
Training a Random Forest Classifier
Training a Random Forest Regressor
Evaluating Random Forests with Out-of-Bag Errors
Identifying Important Features in Random Forests
Selecting Important Features in Random Forests
Handling Imbalanced Classes
Controlling Tree Size
Improving Performance Through Boosting
Training an XGBoost Model
Improving Real-Time Performance with LightGBM
K-Nearest NeighborsIntroduction
Finding an Observation’s Nearest Neighbors
Creating a K-Nearest Neighbors Classifier
Identifying the Best Neighborhood Size
Creating a Radius-Based Nearest Neighbors Classifier
Finding Approximate Nearest Neighbors
Evaluating Approximate Nearest Neighbors
Logistic RegressionIntroduction
Training a Binary Classifier
Training a Multiclass Classifier
Reducing Variance Through Regularization
Training a Classifier on Very Large Data
Handling Imbalanced Classes
Support Vector MachinesIntroduction
Training a Linear Classifier
Handling Linearly Inseparable Classes Using Kernels
Creating Predicted Probabilities
Identifying Support Vectors
Handling Imbalanced Classes
Naive BayesIntroduction
Training a Classifier for Continuous Features
Training a Classifier for Discrete and Count Features
Training a Naive Bayes Classifier for Binary Features
Calibrating Predicted Probabilities
ClusteringIntroduction
Clustering Using K-Means
Speeding Up K-Means Clustering
Clustering Using Mean Shift
Clustering Using DBSCAN
Clustering Using Hierarchical Merging
Tensors with PyTorchIntroduction
Creating a Tensor
Creating a Tensor from NumPy
Creating a Sparse Tensor
Selecting Elements in a Tensor
Describing a Tensor
Applying Operations to Elements
Finding the Maximum and Minimum Values
Reshaping Tensors
Transposing a Tensor
Flattening a Tensor
Calculating Dot Products
Multiplying Tensors
Neural NetworksIntroduction
Using Autograd with PyTorch
Preprocessing Data for Neural Networks
Designing a Neural Network
Training a Binary Classifier
Training a Multiclass Classifier
Training a Regressor
Making Predictions
Visualize Training History
Reducing Overfitting with Weight Regularization
Reducing Overfitting with Early Stopping
Reducing Overfitting with Dropout
Saving Model Training Progress
Tuning Neural Networks
Visualizing Neural Networks
Neural Networks for Unstructured DataIntroduction
Training a Neural Network for Image Classification
Training a Neural Network for Text Classification
Fine-Tuning a Pretrained Model for Image Classification
Fine-Tuning a Pretrained Model for Text Classification
Saving, Loading, and Serving Trained ModelsIntroduction
Saving and Loading a scikit-learn Model
Saving and Loading a TensorFlow Model
Saving and Loading a PyTorch Model
Serving scikit-learn Models
Serving TensorFlow Models
Serving PyTorch Models in Seldon
Index