Code Repository for Machine Learning with PyTorch and Scikit-Learn

Overview

Machine Learning with PyTorch and Scikit-Learn Book

Code Repository

To be published January 2022

Paperback: TBD pages
Publisher: Packt Publishing
Language: English

ISBN-10: TBD
ISBN-13: 978-1789955750
Kindle ASIN: TBD

Links

Table of Contents and Code Notebooks

Helpful installation and setup instructions can be found in the README.md file of Chapter 1

Please note that these are just the code examples accompanying the book, which we uploaded for your convenience; be aware that these notebooks may not be useful without the formulae and descriptive text.

  1. Machine Learning - Giving Computers the Ability to Learn from Data [open dir]
  2. Training Machine Learning Algorithms for Classification [open dir]
  3. A Tour of Machine Learning Classifiers Using Scikit-Learn [open dir]
  4. Building Good Training Sets – Data Pre-Processing [open dir]
  5. Compressing Data via Dimensionality Reduction [open dir]
  6. Learning Best Practices for Model Evaluation and Hyperparameter Optimization [open dir]
  7. Combining Different Models for Ensemble Learning [open dir]
  8. Applying Machine Learning to Sentiment Analysis [open dir]
  9. Predicting Continuous Target Variables with Regression Analysis [open dir]
  10. Working with Unlabeled Data – Clustering Analysis [open dir]
  11. Implementing a Multi-layer Artificial Neural Network from Scratch [open dir]
  12. Parallelizing Neural Network Training with PyTorch [open dir]
  13. Going Deeper -- The Mechanics of PyTorch [open dir]
  14. Classifying Images with Deep Convolutional Neural Networks [open dir]
  15. Modeling Sequential Data Using Recurrent Neural Networks [open dir]
  16. Transformers -- Improving Natural Language Processing with Attention Mechanisms [open dir]
  17. Generative Adversarial Networks for Synthesizing New Data [open dir]
  18. Graph Neural Networks for Capturing Dependencies in Graph Structured Data [open dir]
  19. Reinforcement Learning for Decision Making in Complex Environments [open dir]



Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili. Machine Learning with PyTorch and Scikit-Learn. Packt Publishing, 2022.

@book{mlbook2022,  
address = {Birmingham, UK},  
author = {Sebastian Raschka, and Yuxi (Hayden) Liu, and Vahid Mirjalili},  
isbn = {978-1801819312},   
publisher = {Packt Publishing},  
title = {{Machine Learning with PyTorch and Scikit-Learn}},  
year = {2022}  
}
Comments
  • Loss functions for classification - logits/probabilities (page 472)

    Loss functions for classification - logits/probabilities (page 472)

    Hi Sebastian,

    There is the same value on the picture on the page 472 for y_pred in the columns for probabilities (BCELoss) and logits (BCEWithLogitsLoss): 0.8 Probably the value for the first column (BCELoss) is 0.69, which is equal to sigmoid(0.8)?

    Thank you.

    opened by labdmitriy 5
  • ch13 pag 438 no softmax needed

    ch13 pag 438 no softmax needed

    Hello, kindly clarify this: in the nn.Sequential model, there's no softmax at the end, because we use the cross-entropy loss, which probably doesn't need that, because it is equivalent to the combination of LogSoftmax and NLLLoss? Yet the text below says that the output layer is activated by the softmax.

    https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss Note

    $Clipboard01

    opened by GianniGi 4
  • LogSoftmax in the output but not in the description/code (page 532)

    LogSoftmax in the output but not in the description/code (page 532)

    Hi Sebastian,

    There is an output of created RNN model which includes log softmax as the last layer on the page 532:

    (softmax): LogSoftmax(dim=1)
    

    But based on the code of the model and on the following steps we do not need this layer because we use nn.CrossEntropyLoss() where the input is expected to contain raw, unnormalized scores for each class. Is it correct?

    Thank you.

    opened by labdmitriy 4
  • Different code between book and notebook for NN implementation

    Different code between book and notebook for NN implementation

    ## code in notebook
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split 
    
    iris = load_iris()
    X = iris['data']
    y = iris['target']
     
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=1./3, random_state=1)
    
    from torch.utils.data import TensorDataset
    from torch.utils.data import DataLoader
    import numpy as np 
    import torch
    X_train_norm = (X_train - np.mean(X_train)) / np.std(X_train)
    X_train_norm = torch.from_numpy(X_train_norm).float()
    y_train = torch.from_numpy(y_train) 
    
    train_ds = TensorDataset(X_train_norm, y_train)
    
    torch.manual_seed(1)
    batch_size = 2
    train_dl = DataLoader(train_ds, batch_size, shuffle=True)
    
    import torch.nn as nn
    class Model(nn.Module):
        def __init__(self, input_size, hidden_size, output_size):
            super(Model, self).__init__()
            self.layer1 = nn.Linear(input_size, hidden_size)  
            self.layer2 = nn.Linear(hidden_size, output_size)  
    
        def forward(self, x):
            x = self.layer1(x)
            x = nn.Sigmoid()(x)
            x = self.layer2(x)
            x = nn.Softmax(dim=1)(x)
            return x
        
    input_size = X_train_norm.shape[1]
    hidden_size = 16
    output_size = 3
     
    model = Model(input_size, hidden_size, output_size)
    
    learning_rate = 0.001
    
    loss_fn = nn.CrossEntropyLoss()
     
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    num_epochs = 100
    loss_hist = [0] * num_epochs
    accuracy_hist = [0] * num_epochs
    
    for epoch in range(num_epochs):
    
        for x_batch, y_batch in train_dl:
            pred = model(x_batch)
            loss = loss_fn(pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        
            loss_hist[epoch] += loss.item()*y_batch.size(0)
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
            accuracy_hist[epoch] += is_correct.sum()
            
        loss_hist[epoch] /= len(train_dl.dataset)
        accuracy_hist[epoch] /= len(train_dl.dataset)
    import matplotlib.pyplot as plt 
    fig = plt.figure(figsize=(12, 5))
    ax = fig.add_subplot(1, 2, 1)
    ax.plot(loss_hist, lw=3)
    ax.set_title('Training loss', size=15)
    ax.set_xlabel('Epoch', size=15)
    ax.tick_params(axis='both', which='major', labelsize=15)
    
    ax = fig.add_subplot(1, 2, 2)
    ax.plot(accuracy_hist, lw=3)
    ax.set_title('Training accuracy', size=15)
    ax.set_xlabel('Epoch', size=15)
    ax.tick_params(axis='both', which='major', labelsize=15)
    plt.tight_layout()
    
    #plt.savefig('figures/12_09.pdf')
     
    plt.show()
    
    ## code in book
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split 
    
    iris = load_iris()
    X = iris['data']
    y = iris['target']
     
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=1./3, random_state=1)
    
    
    from torch.utils.data import TensorDataset
    from torch.utils.data import DataLoader
    
    X_train_norm = (X_train - np.mean(X_train)) / np.std(X_train)
    X_train_norm = torch.from_numpy(X_train_norm).float()
    y_train = torch.from_numpy(y_train) 
    
    train_ds = TensorDataset(X_train_norm, y_train)
    
    torch.manual_seed(1)
    batch_size = 2
    train_dl = DataLoader(train_ds, batch_size, shuffle=True)
    
    class Model(nn.Module):
        def __init__(self, input_size, hidden_size, output_size):
           ## in book without Model,self but i added
            super().__init__()
            self.layer1 = nn.Linear(input_size, hidden_size)  
            self.layer2 = nn.Linear(hidden_size, output_size)  
    
        def forward(self, x):
            x = self.layer1(x)
            x = nn.Sigmoid()(x)
            x = self.layer2(x)
            x = nn.Softmax(dim=1)(x)
            return x
        
    input_size = X_train_norm.shape[1]
    hidden_size = 16
    output_size = 3
     
    model = Model(input_size, hidden_size, output_size)
    
    learning_rate = 0.001
    
    loss_fn = nn.CrossEntropyLoss()
     
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    num_epochs = 100
    loss_hist = [0] * num_epochs
    accuracy_hist = [0] * num_epochs
    ## got error here
    for epoch in range(num_epochs):
    
        for x_batch, y_batch in train_dl:
            pred = model(x_batch)
            loss = loss_fn(pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        
            loss_hist[epoch] += loss.item()*y_batch.size(0)
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
          ## change the mean() from the code in the book but still doesnt work
            accuracy_hist[epoch] += is_correct.sum()
            
        loss_hist[epoch] /= len(train_dl.dataset)
        accuracy_hist[epoch] /= len(train_dl.dataset)
    
    
    fig = plt.figure(figsize=(12, 5))
    ax = fig.add_subplot(1, 2, 1)
    ax.plot(loss_hist, lw=3)
    ax.set_title('Training loss', size=15)
    ax.set_xlabel('Epoch', size=15)
    ax.tick_params(axis='both', which='major', labelsize=15)
    
    ax = fig.add_subplot(1, 2, 2)
    ax.plot(accuracy_hist, lw=3)
    ax.set_title('Training accuracy', size=15)
    ax.set_xlabel('Epoch', size=15)
    ax.tick_params(axis='both', which='major', labelsize=15)
    plt.tight_layout()
    
    #plt.savefig('figures/12_09.pdf')
     
    plt.show()
    
    ## Note: I wrote the code from the local notebook step by step but got this error. However, the code works while running in the notebook  on google colab. Is it due to python version?--------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-145-4bceac91f560> in <module>
          7     for x_batch, y_batch in train_dl:
          8         pred = model(x_batch)
    ----> 9         loss = loss_fn(pred, y_batch)
         10         loss.backward()
         11         optimizer.step()
    
    ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
       1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1109                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1110             return forward_call(*input, **kwargs)
       1111         # Do not call functions when jit is used
       1112         full_backward_hooks, non_full_backward_hooks = [], []
    
    ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
       1161 
       1162     def forward(self, input: Tensor, target: Tensor) -> Tensor:
    -> 1163         return F.cross_entropy(input, target, weight=self.weight,
       1164                                ignore_index=self.ignore_index, reduction=self.reduction,
       1165                                label_smoothing=self.label_smoothing)
    
    ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torch\nn\functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
       2994     if size_average is not None or reduce is not None:
       2995         reduction = _Reduction.legacy_get_string(size_average, reduce)
    -> 2996     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
       2997 
       2998 
    
    RuntimeError: expected scalar type Long but found Int
    
    opened by naiborhujosua 3
  • CH 13: Changing the order of the cells gives different results. Pg 420 - 422

    CH 13: Changing the order of the cells gives different results. Pg 420 - 422

    In the Solving an XOR classification problem section, the author defined the model, then defined the loss function and the optimizer, then created the data loader. Finally, he defined the training function followed by plotting the results. If I follow this same sequence, I get this figure image This figure is different from what is shown in the book!

    However, in the notebook, the author defined the data loader, then the model, then the loss function and the optimizer, followed by the training and plotting procedures. That it, he defined the data loader first instead of being before the training procedure.

    Can anyone please explain why changing the order of the cells causes such error?

    opened by OmarAlmighty 2
  • Visualizing Transformer based on your notebook

    Visualizing Transformer based on your notebook

    Dear Prof. Sebastian Raschka, I published a blog with an accompanying 3D interactive website based on your published notebook, to visualize the inner working of Transformer, hope you can check it out!

    opened by jackli777 2
  • Typo in page 80 - logical_or

    Typo in page 80 - logical_or

    The last paragraph in page 80 says "Using the following code, we will create a simple dataset that has the form of an XOR gate using the logical_or function".

    It should be logical_xor, as we can deduce from the preceding explanation and from the code immediately below the text.

    opened by pablo-sampaio 2
  • Possible error in ch14_part2.ipynb of GitHub

    Possible error in ch14_part2.ipynb of GitHub

    There is a code 'get_smile = lambda attr: attr[18]' in In[6] cell. It should be 'get_smile = lambda attr: attr[31]' according to the 'list_attr_celeba' text document in the celeba file.

    opened by Unamu7simure 2
  • Downloading CelebA dataset from book's download link.

    Downloading CelebA dataset from book's download link.

    On page 483, one way to download the CelebA dataset is with the book's download link. In the instructions, you mentioned that we must unzip the downloaded file. But one step that is missing is that we have to unzip the img_align_celeba.zip too; otherwise, PyTorch will throw an error complaining the dataset is corrupt, which is caused by this line of code:

    https://github.com/pytorch/vision/blob/22400011d6a498ecf77797a56dfe13bc94c426ca/torchvision/datasets/celeba.py#L142

    So, I think it's better to mention that explicitly too.

    P.S: Thanks for this excellent book!

    opened by Mahyar24 2
  • Missing text chunk (page 530)

    Missing text chunk (page 530)

    Hi Sebastian,

    There is a code snippet for text preprocessing for language model on the page 530:

    text_chunks = [text_encoded[i:i+chunk_size]
                   for i in range(len(text_encoded)-chunk_size)]
    

    Probably the last text chunk is not included, and to include all text chunks we need to use the following code:

    text_chunks = [text_encoded[i:i+chunk_size]
                   for i in range(len(text_encoded)-chunk_size+1)]
    

    Then for the last i value (len(text_encoded)-chunk_size)) we will have text chunk: text_encoded[len(text_encoded)-chunk_size:len(text_encoded)] which has the size chunk_size and I suppose can be included as additional text chunk.

    Thank you.

    opened by labdmitriy 2
  • Embedding matrix dimension (page 519)

    Embedding matrix dimension (page 519)

    Hi Sebastian,

    There is the following statement on the page 519: "The output will have the dimensionality batchsize × input_length × embedding_dim, where embedding_ dim is the size of the embedding features (here, set to 3). The other argument provided to the embedding layer, num_embeddings, corresponds to the unique integer values that the model will receive as input (for instance, n + 2, set here to 10). Therefore, the embedding matrix in this case has the size 10×6."

    Based on these conclusions, probably there is a typo in the last sentence and embedding matrix dimension is 10x3?

    Thank you.

    opened by labdmitriy 2
  • Label `losses_` docstring as being log loss, not mean squared error

    Label `losses_` docstring as being log loss, not mean squared error

    The docstring for LogisticRegressionGD.losses_ specifies that is it composed of the mean squared error, when I think it's composed of log loss. Changed the docstring to reflect that.


    Thanks for the book!

    opened by paw-lu 1
  • chapter 16, page 547

    chapter 16, page 547

    "the columns in this attention matrix should sum to 1"

    Since you sum, for each row, all the elements, should instead be:

    "Each row in this attention matrix should sum to 1"?

    image

    Probably worth nothing that all the diagonal values have the maximum value, since there are no repetitions of words. If - for example - we had the same word two times, we would have two identical values in the corresponding row.

    opened by GianniGi 0
  • chapter 14 pag.489 transforms

    chapter 14 pag.489 transforms

    Hello, I don't know if I was the only one not getting this the first time that I read it, but I didn't notice that "transform trains" are applied to the full DataSet each time that it's reloaded, for each epoch. Probably because the line where the dataset is reloaded, is completely different from the usual one, which is more legible.

    I would change the red line: image image to this: image

    so the code becomes more legible and familiar??

    from torch.utils.data import DataLoader
    
    celeba_train_dataset = torchvision.datasets.CelebA(image_path, 
                                                       split='train', 
                                                       target_type='attr', 
                                                       download=False, 
                                                       transform=transform_train,
                                                       target_transform=get_smile)
    
    torch.manual_seed(1)
    data_loader = DataLoader(celeba_train_dataset, batch_size=2)
    
    fig = plt.figure(figsize=(15, 6))
    
    num_epochs = 5
    for j in range(num_epochs):
        for img_batch, label_batch in data_loader: # new line
            img = img_batch[0]
            ax = fig.add_subplot(2, 5, j + 1)
            ax.set_xticks([])
            ax.set_yticks([])
            ax.set_title(f'Epoch {j}:', size=15)
            ax.imshow(img.permute(1, 2, 0))
        
            img = img_batch[1]
            ax = fig.add_subplot(2, 5, j + 6)
            ax.set_xticks([])
            ax.set_yticks([])
            ax.imshow(img.permute(1, 2, 0))
            break #new break
            
          
        
    #plt.savefig('figures/14_16.png', dpi=300)
    plt.show()
    
    opened by GianniGi 2
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
Releases(v1.1)
Owner
Sebastian Raschka
Machine Learning researcher & passionate open source contributor. Author of the "Python Machine Learning" book.
Sebastian Raschka
AP1 Transcription Factor Binding Site Prediction

A machine learning project that predicted binding sites of AP1 transcription factor, using ChIP-Seq data and local DNA shape information.

1 Jan 21, 2022
A repository to index and organize the latest machine learning courses found on YouTube.

📺 ML YouTube Courses At DAIR.AI we ❤️ open education. We are excited to share some of the best and most recent machine learning courses available on

DAIR.AI 9.6k Jan 01, 2023
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validat

The Apache Software Foundation 121 Dec 28, 2022
MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

MosaicML Composer MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training. We aim to ease th

MosaicML 2.8k Jan 06, 2023
YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

MohamadReza Taalebi 5 Sep 27, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.

An open-source, low-code machine learning library in Python 🚀 Version 2.3.5 out now! Check out the release notes here. Official • Docs • Install • Tu

PyCaret 6.7k Jan 08, 2023
Predicting diabetes over a five year period using logistic regression and the Pima First-Nation dataset

Diabetes This script uses the Pima First Nations dataset to create a model to predict whether or not an individual will develop Diabetes Mellitus Type

1 Mar 28, 2022
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

2 Aug 23, 2022
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022
Uses WiFi signals :signal_strength: and machine learning to predict where you are

Uses WiFi signals and machine learning (sklearn's RandomForest) to predict where you are. Even works for small distances like 2-10 meters.

Pascal van Kooten 5k Jan 09, 2023
TIANCHI Purchase Redemption Forecast Challenge

TIANCHI Purchase Redemption Forecast Challenge

Haorui HE 4 Aug 26, 2022
Deep Survival Machines - Fully Parametric Survival Regression

Package: dsm Python package dsm provides an API to train the Deep Survival Machines and associated models for problems in survival analysis. The under

Carnegie Mellon University Auton Lab 10 Dec 30, 2022
Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

24 Oct 27, 2022
GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

Generator of Rad Names from Decent Paper Acronyms

264 Nov 08, 2022
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
Warren - Stock Price Predictor

Web app to predict closing stock prices in real time using Facebook's Prophet time series algorithm with a multi-variate, single-step time series forecasting strategy.

Kumar Nityan Suman 153 Jan 03, 2023
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023