Ruixiang's blog

work harder, study better, do faster, become stronger

0%

Implementation of ML & DL & NLP algorithms (WIP)

This blog aims to implement some important Machine Learning, Deep Learning, Natural Language Processing, Large Language Models related algorithms from scratch, which can help people get deeper understanding about how these algorithms work in real code.

Classic Machine Learning algorithms

K-means

  • Steps:

    1. Clusters the data into k groups where k is predefined.
    2. Select k points at random as cluster centers.
    3. Assign objects to their closest cluster center according to the Euclidean distance function.
    4. Calculate the centroid or mean of all objects in each cluster.
    5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.
  • Implementation:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    import numpy as np

    def cal_dist(vector1, vector2):
    # euclidian distance
    eucl_dist = np.sqrt(np.sum((vector1 - vector2)**2))
    return eucl_dist

    def k_means(data, k):
    n_samples, n_shape = data.shape
    # random initialize centroids
    centroid_index = np.random.choice(range(n_samples), k)
    centroid_coords = data[centroid_index]
    # label of each data point
    labels = [-1] * n_samples
    # iterate until centroid no longer change
    while True:
    for i in range(n_samples):
    min_dist = float("inf")
    min_cluster_idx = -1
    for j in range(k):
    dist = cal_dist(data[i], centroid_coords[j])
    if dist < min_dist:
    min_dist = dist
    min_cluster_idx = k
    # update cluster index for each point
    label[i] = min_cluster_idx
    # flag indicate if the centroid change of not: stop condition for while loop
    centroid_update = False
    # recalculate the centroids
    for i, centroid in enumerate(centroid_coords):
    # gather all data points for each cluster
    cluster_points = []
    for k in range(n_samples):
    if label[k] == i:
    cluster_points.append(data[k])
    cluster_points = np.concatenate([cluster_points], axis=1)
    # calculate new centroid
    new_centroid_coord = np.mean(cluster_points, axis=0)
    # see if the centroid change for each cluster
    if cal_dist(centroid, new_centroid_coord) > 1e-5:
    centroid_coords[i] = new_centroid_coord
    centroid_update = True
    if not centroid_update:
    break
    return labels

KNN implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class kNN():
'''k-Nearest Neighbours'''
# Initialise
def __init__(self, k=3, metric='euclidean', p=None):
self.k = k
self.metric = metric
self.p = p

# Euclidean distance (l2 norm)
def euclidean(self, v1, v2):
return np.sqrt(np.sum((v1-v2)**2))

# Manhattan distance (l1 norm)
def manhattan(self, v1, v2):
return np.sum(np.abs(v1-v2))

# Minkowski distance (lp norm)
def minkowski(self, v1, v2, p=2):
return np.sum(np.abs(v1-v2)**p)**(1/p)

# Store train set
def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train

# Make predictions
def predict(self, X_test):
preds = []
# Loop over rows in test set
for test_row in X_test:
nearest_neighbours = self.get_neighbours(test_row)
# majority class prediction of a test instance.
majority = stats.mode(nearest_neighbours)[0][0]
preds.append(majority)
return np.array(preds)

# Get nearest neighbours
def get_neighbours(self, test_row):
distances = list()
# Calculate distance to all points in X_train
for (train_row, train_class) in zip(self.X_train, self.y_train):
if self.metric=='euclidean':
dist = self.euclidean(train_row, test_row)
elif self.metric=='manhattan':
dist = self.manhattan(train_row, test_row)
elif self.metric=='minkowski':
dist = self.minkowski(train_row, test_row, self.p)
else:
raise NameError('Supported metrics are euclidean, manhattan and minkowski')
distances.append((dist, train_class))
# Sort distances
distances.sort(key=lambda x: x[0])
# Identify k nearest neighbours
neighbours = list()
for i in range(self.k):
neighbours.append(distances[i][1])
return neighbours

Logistic regression

NumPy implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# https://blog.51cto.com/u_15661962/5531479 
import numpy as np

# trainable weights and bias init
def weight_init(n_features):
# n_features: size of each data point
w = np.zeros(1, n_features)
b = 0
return w, b

# activation function
def sigmoid(x):
return 1/(1 + np.exp(-x))

# loss function: binary cross entropy
def bce_loss(pred, y, n):
# n is the number of samples
# pred, y shape: [n, 1]
loss = (-1/n) * (np.sum(y * np.log(pred) + (1-y) * np.log(1-pred)))
return loss

def logistic_regression(X, Y):
# X: input data
# Y: labels
n_samples, n_features = X.shape[0], X.shape[1]
w, b = weight_init(n_features)
# hyper parameters
iteration = 10
lr = 1e-5
for i in range(iteration):
# np.dot(x,y.T) == x @ y.T
pred = sigmoid(np.dot(X, w.T) + b)
loss = (pred, Y, n_samples)
# calculate gradients
# reference: https://laobadao.github.io/2017/10/27/logistic-cost/index.html
# https://gist.github.com/golamSaroar/1a1ce33139cf77c37cd6bd123f0fa9cb
w_grad = (1/n_samples) * (np.dot(X.T, (pred - Y.T).T))
b_grad = (1/n_samples) * (np.sum(pred - Y.T))
# update w, b
w -= lr * w_grad
b -= lr * b_grad
return w, b

Linear Regression

NumPy implementation for multiple linear regression (two or more features for each data point), simple linear regression (only one feature for each data point)
reference:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np

# mean-squared error loss
def mse_loss(y_pred, y):
n = y.shape[0] # number of samples
loss = (1/n) * np.sum((y - y_pred)**2)
return loss

# trainable weights and bias init
# n_features >= 2 for multiple linear regression
# n_features = 1 for simple linear regression
def weight_init(n_features):
w = np.zeros(1, n_features)
b = 0
return w, b

def gradientDescent(n_iteration, x, y, lr=1e-5):
n_samples, n_features = x.shape[0], x.shape[1]
w, b = weight_init(n_features)
for i in range(n_iteration):
y_pred = x @ w.T + b
mse_loss = (y_pred, y)
w_grad = (1 / n_samples) * np.dot(x.T, y_pred - y)
b_grad = (1 / n_samples) * np.sum(y - y_pred)
w -= lr * w_grad
b -= lr * b_grad
return w, b

Activation Functions

activation_functions

  • sigmoid (value from 0 to 1)

    1
    2
    3
    import numpy as np
    def sigmoid(x):
    return 1 / (1 + np.exp(-x))
  • ReLU (from 0 to infinity)

    1
    2
    3
    4
    5
    def relu(x):
    if x < 0:
    return 0
    else:
    return x
  • Leaky ReLU

    1
    2
    3
    def leaky_relu(x, alpha=0.01):
    """Compute the Leaky ReLU activation."""
    return max(alpha * x, x)
  • tanh (from -1 to 1)

    1
    2
    3
    import numpy as np
    def tahn(x):
    return (np.exp(x)-np.exp(-x)) / (np.exp(x) + np.exp(-x))
  • Softmax (from 0 to 1)

    1
    2
    3
    4
    5
    import numpy as np
    def softmax(x):
    # assumes x is a vector
    x = x - np.max(x) # to avoid overflow
    return np.exp(x) / np.sum(np.exp(x))
  • GLU (Gated Linear Units) activation function
    glu

    1
    2
    3
    4
    5
    6
    7
    8
    9
    import torch.nn as nn

    class GLU(nn.Module):
    def __init__(self, in_size):
    super().__init__()
    self.linear1 = nn.Linear(in_size, in_size)
    self.linear2 = nn.Linear(in_size, in_size)
    def forward(self, x):
    return self.linear1(x) * self.linear2(x).sigmoid()
  • Sigmoid Linear Unit (SiLU/swish) activation function
    silu

1
2
3
4
5
import numpy as np

def silu(x):
"""Compute the SiLU activation for numpy arrays."""
return x / (1 + np.exp(-x))

Loss Functions

loss_func

Binary Cross Entropy Loss

1
2
3
4
5
6
import numpy as np
def binary_cross_entropy(actual, predicted):
sum_score = 0.0
for i in range(len(actual)):
sum_score += actual[i] * np.log(1e-10 + predicted[i])
mean_sum_score = -1.0 / len(actual) * sum_score

Categorical Cross Entropy Loss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np
def categorical_cross_entropy(actual, predicted):
sum_score = 0.0
for i in range(len(actual)):
for j in range(len(actual[i])):
sum_score += actual[i][j] * np.log(1e-10 + predicted[i][j])
mean_sum_score = -1.0 /len(actual) * sum_score
return mean_sum_score

# class CrossEntropyLoss():
# def forward(self, in_probs, labels):
# self.in_probs = self.in_probs
# self.labels = labels
# return np.sum(-labels*np.log(self.in_probs))

# def backward(self):
# return -self.labels/self.in_probs

KL divergence

kl_diver

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
def KL(P,Q):
"""
Epsilon is used here to avoid conditional code for
checking that neither P nor Q is equal to 0.
"""
epsilon = 0.00001
P = P+epsilon
Q = Q+epsilon
divergence = np.sum(P*np.log(P/Q))
return divergence

Evaluation metrics

Recall, Precision, F1, accuracy

rpf1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
ground_truth = [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1] 
prediction = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05, 0.9, 0.5, 0.3, 0.66, 0.3, 0.2, 0.85, 0.15, 0.99]
thresholds = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.99, 1.0]

#True Positive
def true_positive(ground_truth, prediction):
tp = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 1 and pred == 1:
tp +=1
return tp

#True Negative
def true_negative(ground_truth, prediction):
tn = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 0 and pred == 0:
tn +=1
return tn

#False Positive
def false_positive(ground_truth, prediction):
fp = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 0 and pred == 1:
fp +=1
return fp

#False Negative
def false_negative(ground_truth, prediction):
fn = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 1 and pred == 0:
fn +=1
return fn

# True Positive Rate (Sensitivity or Recall) = TP/ (TP + FN)
def tpr(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fn = false_negative(ground_truth, prediction)
pr = tp/ (tp + fn)
return pr

# False Positive Rate (Sensitivity or Recall) = FP/ (TN + FP)
def fpr(ground_truth, prediction):
fp = false_positive(ground_truth, prediction)
tn = true_negative(ground_truth, prediction)
fr = fp/ (tn + fp)
return fr

# ROC and AUC
true_positive_rate = []
false_poitive_rate = []

for threshold in thresholds:
#calculate predictions for threshold
value_pred = [1 if x >= threshold else 0 for x in prediction]
value_tpr = tpr(ground_truth, value_pred)
value_fpr = fpr(ground_truth, value_pred)
true_positive_rate.append(value_tpr)
false_positive_rate.append(value_fpr)
# ROC curve: X-axis => False Positive Rate (FPR), Y-axis => True Positive Rate (TPR)
# AUC : area under ROC curve
def _roc_auc_score(true_positive_rate, false_positive_rate):
tpr, fpr = true_positive_rate, false_positive_rate
# compute AUC using the trapezoidal rule;
# appending an extra 0 is just to ensure the length matches
zero = np.array([0])
tpr_diff = np.hstack((np.diff(tpr), zero))
fpr_diff = np.hstack((np.diff(fpr), zero))
auc = np.dot(tpr, fpr_diff) + np.dot(tpr_diff, fpr_diff) / 2
return auc

Marco & Micro average recall/precision

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In macro averaging, we compute the performance for each class, and then average over classes. In micro averaging, we collect the decisions for all classes into a single confusion matrix, and then compute precision and recall from that table.
In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).
micro_marco
micr0_marco_2
Macro-averaged metrics are used when we want to evaluate systems performance across on different datasets.
Micro-averaged metrics should be used when the size of datasets are variable.

ROC and AUC

roc

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model (can be used for multi class classification as well) at varying threshold values. The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting.

tpr_fpr

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

BLEU score

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts.

bleu

belu_cal

Rouge score

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

  • ROUGE-N: Overlap of n-grams between the system and reference summaries.
  • ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

rouge

rouge_p_r

Perplexity

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. Intuitively, it can be thought of as an evaluation of the model’s ability to predict uniformly among the set of specified tokens in a corpus. The lower perplexity means the model has better performance on the specific corpus.
log_ppl
ppl

1
2
3
4
5
6
7
8
9
10
11
12
13
import torch
import torch.nn.functional as F
num_classes = 10
batch_size = 1

# your model outputs / logits
output = torch.rand(batch_size, num_classes)
# your targets
target = torch.randint(num_classes, (batch_size,))

loss = F.cross_entropy(output, target)
# calculating perplexity
perplexity = torch.exp(loss)

MRR, MAP, NDCG

  • MRR
    Mean Reciprocal Rank (MRR) is one of the metrics that help evaluate the quality of recommendation and information retrieval systems. Mean Reciprocal Rank (MRR) defines how good a retrieval system is to return a relevant result as the top result. To calculate the MRR, multiple queries are invoked. If all of these queries return a relevant document at the top, the score will be 1.0.
    mrr

  • MAP
    The disadvantage of MRR is that only the top document is considered. Other use cases like comparative questions in Question Answering scenarios require multiple relevant documents. Mean Average Precision (MAP) addresses this shortcoming by considering the order of the returned items. To calculate this score, three steps are taken:

    1. Calculate precision
    2. Calculate average precision for K
    3. Calculate mean average precision for sample of queries
      As for MRR, multiple queries are run. Instead of calculating the Recall, Precision is computed by using relevant and non-relevant documents from the returned lists (not the complete dataset).
      map

Average Precision (AP) is a widely used metric in information retrieval that quantifies the quality of a retrieval system’s ranked results for a single query. Unlike precision, which considers the entire ranked list of retrieved items, AP focuses on how well relevant items are ranked within that list. It measures the area under the precision-recall curve for a single query.
AP is calculated as follows:

  1. First, consider a query that retrieves a set of items.
  2. Compute the precision for each position where a relevant item is retrieved in the ranked results list.
  3. Average these precision values, but only for the positions where relevant items were retrieved.

ap

  • CG, DCG, NDCG
    Discounted cumulative gain (DCG) is a measure of ranking quality which is often presented in its normalized and more easily interpretable form, known as Normalized DCG (nDCG or NDCG). In information retrieval, NDCG is often used to measure effectiveness of search engine algorithms and related applications.

ndcg

nDCG implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np

def ndcg(golden, current, n = -1):
log2_table = np.log2(np.arange(2, 102))

def dcg_at_n(rel, n):
rel = np.asfarray(rel)[:n]
dcg = np.sum(np.divide(np.power(2, rel) - 1, log2_table[:rel.shape[0]]))
return dcg

ndcgs = []
for i in range(len(current)):
k = len(current[i]) if n == -1 else n
idcg = dcg_at_n(sorted(golden[i], reverse=True), n=k)
dcg = dcg_at_n(current[i], n=k)
tmp_ndcg = 0 if idcg == 0 else dcg / idcg
ndcgs.append(tmp_ndcg)
return 0. if len(ndcgs) == 0 else sum(ndcgs) / (len(ndcgs))

Cross Validation

K-fold cross validation is a resampling technique used to evaluate the performance of machine learning models. It works by splitting the dataset into k folds, and then training the model on k-1 folds and testing it on the remaining fold. This process is repeated k times, with each fold being used as the test set once. The results are typically averaged to obtain a more robust estimate of your model’s performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# k-fold cross validation implementation
import numpy as np

# split data into k-folds (train, test) datasets
def kfold_indices(data, k):
fold_size = len(data) // k
indices = np.arange(len(data))
flods = []
for i in range(k):
test_indices = indices[i * fold_size, (i+1)*fold_size]
train_indices = np.concatenate([indices[: i*fold_size], indices[(i+1)*fold_size:]])
folds.append((train_indices, test_indices))
return folds


# perform k-fold validation
fold_indices = kfold_indices(X, k)
scores = []
# Iterate through each fold
for train_indices, test_indices in fold_indices:
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate the accuracy score for this fold
fold_score = accuracy_score(y_test, y_pred)
# Append the fold score to the list of scores
scores.append(fold_score)

# Calculate the mean accuracy across all folds
mean_accuracy = np.mean(scores)

Deep Learning algorithms

Batch Normalization

Why normalization?
The phenomenon by which the distributions of internal nodes (neurons) of a neural network change is referred to as Internal Covariate Shift. And we want to avoid it because it makes training the network slower, as the neurons are forced to re-adjust drastically their weights in one direction or another because of drastic changes in the outputs of the previous layers.
norm

batch_norm

  • PyTorch implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    import torch
    import numpy as np

    def batchnorm_forward(x, gamma, beta, eps=1e-5):
    N, D = x.shape # N is the batch size, D is the hidden size of each sample in batch
    sample_mean = x.mean(axis=0) # shape [1, D]
    sample_var = x.var(axis=0) # shape [1, D]
    std = np.sqrt(sample_var + eps)
    x_centered = x - sample_mean
    x_norm = x_centered / std
    out = gamma * x_norm + beta
    return out

    # with moving_mean and moving_var for inference
    def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    '''
    # X: (N, D) N is the batch size, D is the hidden size of each sample in batch
    # The scale parameter and the shift parameter (model parameters) are
    # initialized to 1 and 0, respectively
    gamma = nn.Parameter(torch.ones(D,))
    beta = nn.Parameter(torch.zeros(D,))
    # The variables that are not model parameters are initialized to 0 and
    # 1
    moving_mean = torch.zeros(shape)
    moving_var = torch.ones(shape)
    '''
    # test stage
    if not torch.is_grad_enabled():
    X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else: # training
    mean = X.mean(dim=0)
    var = ((X-mean)**2).mean(dim=0) # var = X.var(dim=0)
    X_hat = (X - mean) / torch.sqrt(var + eps)
    ## update the mean and variance using moving average
    moving_mean = (1.0 - momentum) * moving_mean + momentum * mean
    moving_var = (1.0 - momentum) * moving-var + momentum * var
    Y = gamma * X_hat + beta # scale and shift
    return Y, moving_mean.data, moving_var.data
  • NumPy implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    import numpy as np
    def batchnorm_forward(x, scales,, bias, eps=1e-5):
    mean = x.mean(axis=0) # Shape = (w, h, c)
    var = x.var(axis=0)
    std = np.sqrt(var+ eps) # shape = (w, h, c)
    # eps is used to avoid divisions by zero
    # Compute the normalized input
    x_norm = (x - mean) / std # shape (batch, w, h, c)
    # Output = scale * x_norm + bias
    if scales is not None:
    x_norm *= scales # Multiplication for scales
    if bias is not None:
    x_norm += bias # Add bias
    return x_norm

Layer Normalization

layer_norm
Explanation:

A well-known explanation of the success of LayerNorm is its re-centering and re-scaling invariance property. The former enables the model to be insensitive to shift noises on both inputs and weights, and the latter keeps the output representations intact when both inputs and weights are randomly scaled.

  • Numpy implementation

    1
    2
    3
    4
    5
    6
    import numpy as np
    def layer_norm(x, g, b, eps: float = 1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps) # normalize x to have mean=0 and var=1 over last axis
    return g * x + b # scale and offset with gamma/beta params
  • Pytorch implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    import torch
    import torch.nn as nn
    # feature : (batch_size, seq_len, hidden_dim)
    class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-5):
    super(LayerNorma, self).__init__()
    self.a_2 = nn.Parameter(torch.ones(features))
    self.b_2 = nn.Parameter(torch.zeros(features))
    self.eps = eps
    def farward(self, x):
    mean = x.mean(-1, keepdim=True) # (batch_size, seq_len, 1)
    std = x.std(-1, keepdim=True) # (batch_size, seq_len, 1)
    return self.a_2 * (x-mean) / (std+self.eps) + self.b_2

Dropout

dropout

1
2
3
4
5
6
7
8
9
10
class Dropout():
def __init__(self. drop_prob):
self.drop_prob = drop_prob

def forward(self, inputs):
self.dropout = (np.random.rand(*inputs.shape) > self.dorp_prob)
return np.where(self.dropout, inputs, 0)

def backward(self, grads):
return np.where(self.dropout, grads, 0)

Optimizers

SGD (Stochastic Gradient Descent)

Advantages:

  1. Frequent updates of model parameters, converges in less time
  2. Requires less memory as no need to store values of loss functions
  3. May get new minima’s
    Disadvantages:
  4. High variance in model parameters
  5. May shoot even after achieving global minima
  6. To get the same convergence as gradient descent needs to slowly reduce the value of learning rate

Mini-Batch Gradient Descent (BGD)

Advantages:

  1. Frequently updates the model parameters and also has less variance
  2. Requires medium amount of memory
    All types of Gradient Descent have some challenges:
  3. May get trapped at local minima
  4. Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge
  5. Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate

Momentum

Used in conjunction Stochastic Gradient Descent or Mini-Batch Gradient Descent, Momentum takes into account past gradients to smooth out the update.
This is seen in variable 𝑣, which is an exponentially weighted average of the gradient on previous steps. This results in minimizing oscillations and faster convergence.
momentum
moment
Compared with (minibatch) stochastic gradient descent the momentum method needs to maintain a set of auxiliary variables, i.e., velocity. It has the same shape as the gradients (and variables of the optimization problem). In the implementation below we call these variables states.

1
2
3
4
5
6
7
8
9
10
11
def init_momentum_states(feature_dim):
v_w = torch.zeros((feature_dim, 1))
v_b = torch.zeros(1)
return (v_w, v_b)

def sgd_momentum(params, states, hyperparams):
for p, v in zip(params, states):
with torch.no_grad():
v[:] = hyperparams['momentum'] * v + p.grad
p[:] -= hyperparams['lr'] * v
p.grad.data.zero_()

Adagrad (Adaptative Gradient)

Adagard optimizer changes the learning rate for each parameter at every time step. Particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.
adagrad
Advantages:

  1. Learning rate changes for each training parameters
  2. Don’t need to manually tune the learning rate
  3. Able to train on sparse data
    Disadvantages:
  4. Computationally expensive as a need to calculate the second order derivative
  5. The learning rate is always decreasing results in slow training
1
2
3
4
5
6
7
8
9
10
11
12
def init_adagrad_states(feature_dim):
s_w = torch.zeros((feature_dim, 1))
s_b = torch.zeros(1)
return (s_w, s_b)

def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
with torch.no_grad():
s[:] += torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()

Adadelta

Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w.
adadelta

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def init_adadelta_states(feature_dim):
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
return ((s_w, delta_w), (s_b, delta_b))

def adadelta(params, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
with torch.no_grad():
# In-place updates via [:]
s[:] = rho * s + (1 - rho) * torch.square(p.grad)
g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
p.grad.data.zero_()

RMSProp (Root Mean Square Propagation)

RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients. RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner. The learning rate needs to be scheduled by the experimenter in practice.
RMSProp

1
2
3
4
5
6
7
8
9
10
11
12
def init_rmsprop_states(feature_dim):
s_w = torch.zeros((feature_dim, 1))
s_b = torch.zeros(1)
return (s_w, s_b)

def rmsprop(params, states, hyperparams):
gamma, eps = hyperparams['gamma'], 1e-6
for p, s in zip(params, states):
with torch.no_grad():
s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()

Adam (Adaptive Moment Estimation)

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. Adam computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
adam

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def init_adam_states(feature_dim):
v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
return ((v_w, s_w), (v_b, s_b))

def adam(params, states, hyperparams):
beta1, beta2, eps = 0.9, 0.999, 1e-6
for p, (v, s) in zip(params, states):
with torch.no_grad():
v[:] = beta1 * v + (1 - beta1) * p.grad
s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)
+ eps)
p.grad.data.zero_()
hyperparams['t'] += 1

Note: AdamW(Adam with decoupled weight decay)

CV algorithms

CNN

Components of CNN: Convolution, Pooling, Flatten, Fully connected layer
Basic PyTorch CNN implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class SimpleCNN(nn.Module):
def __init__(self, num_classes):
super().__init__()
self.layer = nn.Sequential( # (N, 32, 32, 1) N:batch_size, 32: height, 32: width, 1: channel
nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0), # 1: in_channels, 6: out_channels, Convolution -> (N, 28, 28, 6)
nn.BatchNorm2d(6), # batch normalization (N, 28, 28, 6)
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2) # Max Pooling -> (N, 14, 14, 6)
)
self.fc = nn.linear(14*14*6, num_classes)
def forward(self, x):
out = self.layer(x)
out = out.reshape(out.size(0), -1) # Flatten -> (N, 14, 14, 6) => (N, 14*14*6)
out = self.fc(x)
return out

PyTorch implementation for Convolution and Pooling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def corr2d(image, kernel, padding_size, stride):  #cross-correlation operation
"""Compute 2D cross-correlation."""
h, w = kernel.shape
image = np.pad(image, padding_size)
Y = torch.zeros(((image.shape[0] - h) // stride + 1, (image.shape[1] - w) // stride + 1))
start_pos_h, start_pos_w = 0, 0
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[start_pos_h : start_pos_h + h, start_pos_w : start_pos_w + w] * kernel).sum()
start_pos_w += stride
start_pos_h += stride
return Y

def pool2d(X, pool_size, mode="max"):
p_h, p_w = pool_size
Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode == "max":
Y[i,j] = X[i:i+p_h, j:j+p_w].max()
elif model == "avg":
Y[i,j] = X[i:i+p_h, j:j+p_w].mean()
return Y

NumPy implementation of Convolution and Pooling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def conv2d(image, num_filters, stride, kernel_size, padding=0):
# init filters
filters = np.random.randn(num_filters, kernel_size, kernel_size)*0.1
# image size (num_channels, 28, 28)
in_dim = image.shape[1] # 28
out_dim = (in_dim - kernel_size) // stride + 1 # without padding
out = np.zeros((num_filters, out_dim, out_dim)) # the Convolution output
for f in range(num_filters):
temp_y = out_y = 0
while temp_y + kernel_size <= in_dim:
temp_x = out_x = 0
while temp_x + kernel_size <= in_dim:
patch = image[:, temp_x:temp_x+kernel_size, temp_y:temp_y+kernel_size]
out[f, out_y, out_x] = np.sum(filters[f] * patch)
temp_x += stride
out_x += 1
temp_y += stride
out_y += 1
return out

def max_pooling(image, stride, pooling_size):
num_channels, height, width = image.shape
out_h = (height - pooling_size) // stride + 1
out_w = (width - pooling_size) // stride + 1
out = np.zeros(num_channels, out_h, out_w)
for i in range(num_channels):
temp_y = out_y = 0
while temp_y + pooling_size <= height:
temp_x = out_x = 0
while temp_x + pooling_size <= width:
patch = image[i, temp_y:temp_y+pooling_size, temp_x:temp_x+pooling_size]
out[i, out_y, out_x] = np.max(patch)
temp_x += stride
out_x += 1
temp_y += stride
out_y += 1
return out

Patch Embedding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 2D image to patch embedding
class PatchEmbed(nn.Module):
def __init__(self, image_size=224, path_size=16, in_chans=3, embed_dim=768, flatten=True):
super().__init__()
image_size = (image_size, image_size)
path_size = (path_size, path_size)
self.image_size = image_size
self.path_size = path_size
self.grid_size = (image_size[0]//patch_size[0], image_size[1]//path_size[1])
self.num_patches = self.grid_size[0] * self.grid_size[1]
self.flatten = flatten
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
def forward(self, x):
# Output shape: (batch size, no. of patches, hidden_dim)
B, C, H, W = x.shape # (bs, 3, 224, 224)
x = self.proj(x) # (bs, 768, 14, 14)
x = x.flatten(2) # (bs, 768, 196)
x = x.transpose(1, 2) # (bs, 196, 768)
return x

# fuyu patch
def patch_img(x: Tensor, patches: int):
return rearrange(
x, "b c (h p1) (w p2) -> b (h w) (p1 p2 c)", p1=patches, p2=patches
)

def FuyuPatch(image, patch_size):
# image (channel, W, H)
patchs = []
channel, width, height = image.size()
for i in range(0, height, path_size):
for j in range(0, width, path_size):
patch = image[:, i:i+patch_size, j:j+patch_size] # (channel, path_size, pathc_size)
patches.append(patch)
assert len(patches) == width // patch_size * (height // patch_size)
# patches: (num_patches, channel, patch_size, patch_size)
patches = torch.cat(patches, dim=0)
patches = patches.flatten(1) #(num_patches, channel*patch_size*patch_size)
return patches

NLP algorithms

Basic Preprocessing

Here are the steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# remove URLs
def remove_urls(text):
url_pattern = re.compile(r'https?://\S+|www\.\S+')
return url_pattern.sub(r'', text)

# remove html tags
def remove_html(text):
html_pattern = re.compile('<.*?>')
return html_pattern.sub(r'', text)

# tokenize: Splitting the sentence into words
from nltk.tokenize import word_tokenize
sentence = "Books are on the table"
words = word_tokenize(sentence)

# lower case
word = word.lower()

# remove whitespace head and tail
word = word.strip(" ")

# remove Punctuation
import string
puncts = string.punctuation

# remove stop words
from nltk.corpus import stopwords
set(stopwords.words('english'))

# remove emoji
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)

# stemming or Lemmatization
# A better efficient way to proceed is to first lemmatise and then stem, but stemming alone is also fine for few problems statements, here we will not lemmatise.
stemmer = PorterStemmer()
stemmer.stem(word)

Cosine similarity

NumPy Implementation

1
2
import numpy as np
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

python implementation

1
2
3
4
5
def dot_product(A,B):
return sum([a*b for a,b in zip(A,B)])

def cosine_similarity(A,B):
return dot_product(A,B) / (dot_product(A,A)**0.5 * dot_product(B,B)**0.5)

TF-IDF

Terminology:
TF - Term Frequency, DF - Document Frequency, IDF - Inverse Document Frequency.
t — term (word), d — document (set of words), N — count of corpus, corpus — the total document set.
Formula:
TF(t,d) = count of t in d / number of words in d
DF(t) = the number of documents containing the term t
IDF(t) = N / DF(t)
To avoid IDF value explodes we take the log of IDF ==> IDF(t) = log( N / (DF(t) + 1) )
TD-IDF(t, d) = TF(t,d) * IDF(t)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from nltk.tokenize import word_tokenize

text = ['Topic sentences are similar to mini thesis statements.\
Like a thesis statement, a topic sentence has a specific \
main point. Whereas the thesis is the main point of the essay',\
'the topic sentence is the main point of the paragraph.\
Like the thesis statement, a topic sentence has a unifying function. \
But a thesis statement or topic sentence alone doesn’t guarantee unity.', \
'An essay is unified if all the paragraphs relate to the thesis,\
whereas a paragraph is unified if all the sentences relate to the topic sentence.']

#preprocessing the text data
sentences = []
word_list = []
for sent in text:
x = [w.lower() for w in word_tokenize(sent) if w.isalpha()]
sentences.append(x)
for word in x:
if word not in word_list:
word_list.append(word)

# total number of docs in corpus
total_doc = len(text)

# calculate term frequency for all documents
term_freq_docs = [] # shape: (num_docs, num_unique_words)
for sent in sentences:
term_freq = [0]*len(word_list)
for word in sent:
idx = word_list.index(word)
term_freq[idx] += 1
term_freq = [x/len(sent) for x in term_freq]
term_freq_docs.append(term_freq)

# calculate inverse document frequency
inverse_doc_freq = [0]*len(word_list) # shape: num_unique_words
for w in word_list:
for sent in sentences:
if w in sent:
idx = word_list.index(w)
inverse_doc_freq[idx] += 1
inverse_doc_freq = [np.log(total_doc / (x + 1)) for x in inverse_doc_freq]

# calculate tf-idf
tf_idf_all = [] # shape: (num_docs, num_unique_words)
for term_freq in term_freq_docs:
tf_idf = []
for idx, word in enumerate(term_freq):
tf_idf.append(term_freq[idx] * inverse_doc_freq[idx])
tf_idf_all.append(tf_idf)

Sometimes we use TF-IDF for query-document retrieval task, in this case, we will calculate TF and IDF for each word in query according to different documents. Here is the example: blog

FastText

fastText is usually used to do classification tasks, it has two main improvements:

  • used char-level n-gram as additional word embedding features
  • used hierarchical softmax to reduce computation cost
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    class FastText(nn.module):
    def __init__(self, config):
    super(FastText, self).__init__()
    self.embedding_word = nn.Embedding(
    config.vocab_size,
    config.embed_size,
    padding_idx = config.vocab_size-1
    )
    self.embedding_bigram = nn.Embedding(
    config.ngram_vocab_size,
    config.embed_size,
    padding_idx = config.ngram_vocab_size-1
    )
    self.embedding_trigram = nn.Embedding(
    config.ngram_vocab_size,
    config.embed_size,
    padding_idx = config.ngram_vocab_size-1
    )
    self.dropout = nn.Dropout(config.dropout_rate)
    self.fc1 = nn.Linear(config.embed_size * 3, config.hidden_size)
    self.fc2 = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):
    '''
    :param x: x[0] (words), x[1] (bigram), x[2] (trigram): all have shape [batch_size, seq_length]
    :return: [batch_size, num_classes] classification task
    '''
    embed_bow = self.embedding_word(x[0])
    embed_bigram = self.embedding_bigram(x[1])
    embed_trigram = self.embedding_trigram(x[2])
    input = torch.cat((embed_bow, embed_bigram, embed_trigram), -1)
    out = input.mean(dim=1)
    out = self.dropout(out)
    out = self.fc1(out)
    out = F.relu(out)
    out = self.fc2(out)
    return out

RNN implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class RNN(nn.Module):
def __init__(self, num_inputs, num_hiddens, sigma=0.01):
super().__init__()
self.num_inputs = num_inputs # dimension size of each step's input
self.num_hiddens = self.num_hiddens
self.W_xh = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
self.W_hh = nn.Parameter(torch.randn(num_hiddens, num_hiddens) * sigma)
self.b_h = nn.Parameter(torch.randn(num_hiddens))
def forward(self, inputs, state=None):
batch_size, seq_len = inputs.shape
if state == None:
state = torch.zeros((batch_size, self.num_hiddens)) # the state of last step
outputs = [] # shape: (num_steps, batch_size, num_hiddens)
for x_step in inputs: # shape of inputs: (num_steps, batch_size, num_inputs)
# in NLP the num_steps equals seq_len, num_inputs equals token embedding size
state = torch.tahn(torch.matmul(x_step, W_xh) + torch.matmul(state, self.W_hh) + self.b_h) # (batch_size, num_hiddens)
outputs.append(state)
return outputs, state

Absolute position embedding

absolute

1
2
3
4
5
6
7
8
9
10
11
class PositionEncoding(nn.Module):
def __init__(self, d_model, max_seq_length):
self.pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0, max_seq_length).unsqueeze(1) #(max_seq_lenght, 1)
div_term = torch.arrange(0, d_model, 2)* (-(match.log(10000)/d_models))

self.pe[:, 0::2] = torch.sin(position*div_term)
self.pe[:, 1::2] = torch.cos(position*div_term)

def forward(self, x):
return x + self.pe[:, :x.size(1)]

Greedy search decoding

1
2
3
4
5
6
7
8
9
10
11
# greedy decoder
def greedy_decoder(data):
# index for largest probability each row
return [argmax(s) for s in data]

# define a sequence of 3 words over a vocab of 5 words
data = [[0.1, 0.2, 0.3, 0.4, 0.5],
[0.5, 0.4, 0.3, 0.2, 0.1],
[0.1, 0.2, 0.3, 0.4, 0.5]]
# decode sequence
result = greedy_decoder(data)

Beam search decoding

Beam Seach details: here
beam_search

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
# beam search
def beam_search_decoder(data, k):
sequences = [[list(), 0.0]] #[[seq1, score1], [seq2, score2], ...] K sequence candidates with their scores. Seq: a list of tokens. Score: log probabilities
# walk through each step of data
for row in data:
all_candidates = []
# expand each current candidate
for i in range(len(sequences)):
seq, score = sequences[i]
for j in range(len(row)):
candidate = [seq + row[j], score - np.log(row[j])] # [new_seq, new_score]
all_candidates.append(candidate)
# order all candidates by score
ordered = sorted(all_candidates, key=lambda x:x[1])
# select top k
sequences = ordered[:k]
return sequences

# Pytorch version implementation
import torch
def beam_search_decoder(pred, k):
"""Beam Search Decoder
Parameters:
pred(Tensor) – the prediction of network.
k(int) – beam size of decoder.
Outputs:
indices(Tensor) – a beam of index sequence.
log_prob(Tensor) – a beam of log likelihood of sequence.
Shape:
pred: (batch_size, seq_length, vocab_size).
indices: (batch_size, beam_size, seq_length).
log_prob: (batch_size, beam_size).

Examples:
>>> pred = torch.softmax(torch.randn([32, 20, 1000]), -1)
>>> indices, log_prob = beam_search_decoder(pred, 3)
"""
batch_size, seq_length, _ = pred.shape
log_pred = pred.log()
log_prob, indices = log_pred[:, 0, :].topk(k, sorted=True) # first token with top k candidates
indices = indices.unsqueeze(-1)
for i in range(1, seq_length):
log_prob = log_prob.unsqueeze(-1) + log_pred[:, i, :].unsqueeze(1).repeat(1, k, 1)
log_prob, index = log_prob.view(batch_size, -1).topk(k, sorted=True)
indices = torch.cat([indices, index.unsqueeze(-1)], dim=-1)
return indices, log_prob

LLM algorithms

LoRA (Low Rank Adaption) Implementation

lora

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch.nn as nn
class LinearLoRA(nn.Module):
'''
A low rank adapted linear layer.
Args:
in_dim: An integer representing the input dimension of the linear layers
out_dim: An integer representing the output dimension of the linear layer
r: An integer representing the rank of low-rank approximated matrices
lora_alpha: An integer representing the number of scaling constant alpha / r
lora_dropout: A float between 0 and 1 representing the dropout probability
'''
def __init__(self, in_dim:int, out_dim:int, r: int=8, lora_alpha: int=16, lora_dropout: float=0.0):
super().__init__()
self.r = r # must >= 1
self.lora_alpha = lora_alpha
self.lora_dropout = nn.Dropout(lora_dropout)
# recreate the linear layer and freeze it (the actual weight values will be copied in outside of this class)
self.pretrained = nn.Linear(in_dim, out_dim, bias=True)
self.pretrained.weight.requires_grad = False
# create the low-rank A matrix and initialize with same method as in Hugging face PEFT
self.lora_A = nn.Linear(in_dim, r, bias=True)
nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5)) # or nn.init.normal_(self.lora_A, mean=0, std=1)
# another method to define LoRA_A matrix
# self.A = nn.Parameter(torch.randn(in_dim, r) * math.sqrt(r))

# create the low rank B matrix and initialize to zero
self.lora_B = nn.Linear(r, out_dim, bias=True)
nn.init.constant_(self.lora_B.weight, 0)
# another method to define LoRA_B matrix
# self.B = nn.Parameter(torch.zeros(r, out_dim))
self.scaling = self.lora_alpha / self.r

def forward(self,x):
pretrained_out = self.pretrained(x)
lora_out = self.lora_A(x)
lora_out = self.lora_B(lora_out)
# if we used another method defined A and B:
# lora_out = x @ self.A @ self.B
lora_out = self.lora_dropout(lora_out)
lora_out = lora_out * self.scaling
return pretrained_out + lora_out

## Another simpale to implement LoRA
class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha

def forward(self, x):
out = self.alpha * (x @ self.A @ self.B)
return out

Autoregressive decoding

Decoder models generate tokens one by one during inference, e.g. GPT, Llama, BART, T5 etc. We can this autoregressive decoding and it is the main bottleneck of inference speed for decoder models.
Here is an sample implementation of this autoregressive decoding process:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
'''
Take a conditioning sequence of indices idx (Tensor: batch_size, input_length) abd complete the sequence max_new_tokens times,
feeding the predictions back into the model each time.
'''
mode.eval()
for _ in range(max_new_tokens):
# forward the model to get the logits for the index in the sequence
logits, _ = mode(idx) # logits: (batch_size, input_length, vocab_size)
# get the logits at the final step and scale by desired temperature
# While applying temperature can make a distribution less random, in its limit, when setting temperature -> 0, temperature scaled sampling becomes equal to greedy decoding
logits = logits[:,-1,:] / temperature
# optionally crop the logits to only the top k options
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
# apply softmax to convert logits to (normalized) probabilities
probs = F.softmax(logits, dim=-1)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1)
# append sampled index to the running sequence to continue
idx = torch.cat((idx, idx_next), dim=1)
return idx

Speculative Decoding

This technology is used in LLM inference stage to speed up Auto-regressive decoding steps.
Main idea: (source)

In speculative sampling, we have two models:

  1. A smaller, faster draft model
  2. A larger, slower target model
    The idea is that the draft model speculates what the output is steps into the future, while the target model determines how many of those tokens we should accept.

The intuition behind speculative sampling is that certain strings of tokens (common phrases, pronouns, punctuation, etc …) are fairly easy to predict, so a smaller, less powerful, but faster draft model should be able to quickly predict these instead of having our slower target model doing all the work.

The algorithm is following:

  1. The draft model decodes K tokens in the regular autoregressive fashion.
  2. We get the probability outputs of the target and draft model on the new predicted sequence.
  3. We compare the target and draft model probabilities to determine how many of the K tokens we want to keep based on some rejection criteria. If a token is rejected, we resample it using a combination of the two distributions and don’t accept any more tokens.
  4. If all K tokens are accepted, we can sample an additional final token from the target model probability output.

As such, instead of decoding a single token at each iteration, speculative sampling decodes between 1 to K+1 tokens per iteration. If no tokens are accepted, we resample guaranteeing at least 1 token is decoded. If all K tokens are accepted, then we can also sample a final token from the target models probability distribution, giving us a total of K+1 tokens decoded.

sepcu_decoding

Mathematics

Different distributions

dist

References

Welcome to my other publishing channels