The tutorial is a collection of many other resources and my own notes

Last update: Sep 19, 2022

Related tags

Overview

1. Why we need CTC? ---> looking back on history
- 1.1. About CRNN
- 1.2. from Cross Entropy Loss to CTC Loss
1. Details about CTC
- 2.1. intuition: forward algorithm
- 2.2. implementation: forward algorithm with dynamic programming
  - 2.2.1. dummy input ---> what the input looks like
  - 2.2.2. expand the target ---> what the target matrix look like
- 2.3. Alpha Matrix
- 2.4. Dynamic programming based on 3 conditions

# TOC

Before reading
the tutorial is a collection of many other resources and my own notes. Note that the ref if any in the tutorial means the whole passage. And part to be referred if any means the part has been summarized or detailed by me. Feel free to click the [the part to be referred] to read the original.

CTC_pytorch

1. Why we need CTC? ---> looking back on history

Feel free to skip it if you already know the purpose of CTC coming into being.

1.1. About CRNN

We need to learn CRNN because in the context we need an output to be a sequence.

ref: the overview from CRNN to CTC !! highly recommended !!

part to be referred

multi-digit sequence recognition

Characted-based
word-based
sequence-to-sequence
CRNN = CNN + RNN
- CNN --> relationship between pixel
- (the small fonts) Specifially, each feature vec of a feature seq is generated from left to right on the feature maps. That means the i-th feature vec is the concatenation of the columns of all the maps. So the shape of the tensor can be reshaped as e.g. (batch_size, 32, 256)

image1

1.2. from Cross Entropy Loss to CTC Loss

Usually, CE is applied to compute loss as the following way. And gt(also target) can be encoded as a stable matrix or vector.

image2

However, in OCR or audio recognition, each target input/gt has various forms. e.g. "I like to play piano" can be unpredictable in handwriting.

image3

Some stroke is longer than expected. Others are short.
Assume that the above example is encoded as number sequence [5, 3, 8, 3, 0].

image4

Tips: blank(the blue box symbol here) is introduced because we allow the model to predict a blank label due to unsureness or the end comes, which is similar with human when we are not pretty sure to make a good prediction. ref:lihongyi lecture starting from 3:45

Therefore, we see that this is an one-to-many question where e.g. "I like to play piano" has many target forms. But we not just have one sequence. We might also have other sequence e.g. "I love you", "Not only you but also I like apple" etc, none of which have a same sentence length. And this is what cross entropy cannot achieve in one batch. But now we can encode all sequences/sentences into a new sequence with a max length of all sequences.

e.g.
"I love you" --> len = 10
"How are you" --> len = 11
"what's your name" --> len = 16

In this context the input_length should be >= 16.

For dealing with the expanded targets, CTC is introduced by using the ideas of (1) HMM forward algorithm and (2) dynamic programing.

2. Details about CTC

2.1. intuition: forward algorithm

image5

image6

Tips: the reason we have - inserted between each two token is because, for each moment/horizontal(Note) position we allow the model to predict a blank representing unsureness.

Note that moment is for audio recognition analogue. horizontal position is for OCR analogue.

2.2. implementation: forward algorithm with dynamic programming

the complete code is CTC.py

given 3 samples, they are
"orange" :[15, 18, 1, 14, 7, 5] len = 6
"apple" :[1, 16, 16, 12, 5] len = 5
"watermelon" :[[23, 1, 20, 5, 18, 13, 5, 12, 15, 14] len = 10

{0:blank, 1:A, 2:B, ... 26:Z}

2.2.1. dummy input ---> what the input looks like

# ------------ a dummy input ----------------
log_probs = torch.randn(15, 3, 27).log_softmax(2).detach().requires_grad_()# 15:input_length  3:batchsize  27:num of token(class)
# targets = torch.randint(0, 27, (3, 10), dtype=torch.long)
targets = torch.tensor([[15, 18, 1,  14, 7, 5,  0, 0,  0,  0],
                        [1,  16, 16, 12, 5, 0,  0, 0,  0,  0],
                        [23, 1,  20, 5, 18, 13, 5, 12, 15, 14]]
                        )

# assume that the prediction vary within 15 input_length.But the target length is still the true length.
""" 
e.g. [a,0,0,0,p,0,p,p,p, ...l,e] is one of the prediction
 """
input_lengths = torch.full((3,), 15, dtype=torch.long)
target_lengths = torch.tensor([6,5,10], dtype = torch.long)

2.2.2. expand the target ---> what the target matrix look like

Recall that one target can be encoded in many different forms. So we introduce a targets mat to represent it as follows.

"-d-o-g-" ">

target_prime = targets.new_full((2 * target_length + 1,), blank) # create a targets_prime full of zero

target_prime[1::2] = targets[i, :target_length] # equivalent to insert blanks in targets. e.g. targets = "dog" --> "-d-o-g-"

Now we got target_prime(also expanded target) for e.g. "apple"
target_prime is
tensor([ 0, 1, 0, 16, 0, 16, 0, 12, 0, 5, 0]) which is visualized as the red part(also t1)

image7

Note that the t8 is only for illustration. In the example, the width of target matrix should be 15(input_length).

probs = log_probs[:input_length, i].exp()

Then we convert original inputs from log-space like this, referring to "In practice, the above recursion ..." in original paper https://www.cs.toronto.edu/~graves/icml_2006.pdf

2.3. Alpha Matrix

image8

# alpha matrix init at t1 indicated by purple boxes.
alpha_col = log_probs.new_zeros((target_length * 2 + 1,))
alpha_col[0] = probs[0, blank] # refers to green box
alpha_col[1] = probs[0, target_prime[1]]

blank is the index of blank(here it's 0)
target_prime[1] refers to the 1-st index of the token. e.g. "apple": "a", "orange": "o"

2.4. Dynamic programming based on 3 conditions

refer to the details in CTC.py

The tutorial is a collection of many other resources and my own notes

Related tags

Overview

CTC_pytorch

1. Why we need CTC? ---> looking back on history

1.1. About CRNN

image1

1.2. from Cross Entropy Loss to CTC Loss

image2

image3

image4

2. Details about CTC

2.1. intuition: forward algorithm

image5

image6

2.2. implementation: forward algorithm with dynamic programming

2.2.1. dummy input ---> what the input looks like

2.2.2. expand the target ---> what the target matrix look like

image7

2.3. Alpha Matrix

image8

2.4. Dynamic programming based on 3 conditions

reference:

Owner

手写AI

Generate a backend and frontend stack using Python and json-ld, including interactive API documentation.

My solutions to the Advent of Code 2021 problems in Go and Python 🎄

Soccerdata - Efficiently scrape soccer data from various sources

freeCodeCamp Scientific Computing with Python Project for Certification.

Data-science-on-gcp - Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

Autolookup GUI Plugin for Plover

CoderByte | Practice, Tutorials & Interview Preparation Solutions|

Some of the best ways and practices of doing code in Python!

This program has been coded to allow the user to rename all the files in the entered folder.

Mozilla Campus Club CCEW is a student committee working to spread awareness on Open Source software.

A course-planning, course-map rendering and GPA-calculation web service, designed for the SFU (Simon Fraser University) student.

Documentation for the lottie file format

🏆 A ranked list of awesome python developer tools and libraries. Updated weekly.

Generate a single PDF file from MkDocs repository.

Python Eacc is a minimalist but flexible Lexer/Parser tool in Python.

A hack to run custom shell commands when building documentation on Read the Docs.

This is a small project written to help build documentation for projects in less time.

FxBuzzly - Buzzly.art links do not embed in Discord, this fixes them (rudimentarily)

The mitosheet package, trymito.io, and other public Mito code.

Collections of Beautiful Latex Snippets