Implementation of ML & DL & NLP algorithms (WIP)

This blog aims to implement some important Machine Learning, Deep Learning, Natural Language Processing, Large Language Models related algorithms from scratch, which can help people get deeper understanding about how these algorithms work in real code.

Classic Machine Learning algorithms
Deep Learning algorithms
CV algorithms
- CNN
- Patch Embedding
NLP algorithms
LLM algorithms
Mathematics
- Different distributions
References

Classic Machine Learning algorithms

K-means

Steps:
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

Implementation:

import numpy as np

def cal_dist(vector1, vector2):
    # euclidian distance 
    eucl_dist = np.sqrt(np.sum((vector1 - vector2)**2))
    return eucl_dist

def k_means(data, k):
    n_samples, n_shape = data.shape
    # random initialize centroids 
    centroid_index = np.random.choice(range(n_samples), k)
    centroid_coords = data[centroid_index]
    # label of each data point
    labels = [-1] * n_samples
    # iterate until centroid no longer change
    while True:
        for i in range(n_samples):
            min_dist = float("inf")
            min_cluster_idx = -1
            for j in range(k):
                dist = cal_dist(data[i], centroid_coords[j])
                if dist < min_dist:
                    min_dist = dist
                    min_cluster_idx = k
            # update cluster index for each point
            label[i] = min_cluster_idx
        # flag indicate if the centroid change of not: stop condition for while loop
        centroid_update = False
        # recalculate the centroids 
        for i, centroid in enumerate(centroid_coords):
            # gather all data points for each cluster
            cluster_points = []
            for k in range(n_samples):
                if label[k] == i:
                    cluster_points.append(data[k])
            cluster_points = np.concatenate([cluster_points], axis=1)
            # calculate new centroid
            new_centroid_coord = np.mean(cluster_points, axis=0)
            # see if the centroid change for each cluster
            if cal_dist(centroid, new_centroid_coord) > 1e-5:
                centroid_coords[i] = new_centroid_coord
                centroid_update = True
        if not centroid_update:
            break
    return labels

KNN implementation

class kNN():
    '''k-Nearest Neighbours'''
    # Initialise
    def __init__(self, k=3, metric='euclidean', p=None):
        self.k = k
        self.metric = metric
        self.p = p
    
    # Euclidean distance (l2 norm)
    def euclidean(self, v1, v2):
        return np.sqrt(np.sum((v1-v2)**2))
    
    # Manhattan distance (l1 norm)
    def manhattan(self, v1, v2):
        return np.sum(np.abs(v1-v2))
    
    # Minkowski distance (lp norm)
    def minkowski(self, v1, v2, p=2):
        return np.sum(np.abs(v1-v2)**p)**(1/p)
        
    # Store train set
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        
    # Make predictions
    def predict(self, X_test):
        preds = []
        # Loop over rows in test set
        for test_row in X_test:
            nearest_neighbours = self.get_neighbours(test_row)
						# majority class prediction of a test instance.
            majority = stats.mode(nearest_neighbours)[0][0]
            preds.append(majority)
        return np.array(preds)
    
    # Get nearest neighbours
    def get_neighbours(self, test_row):
        distances = list()
        # Calculate distance to all points in X_train
        for (train_row, train_class) in zip(self.X_train, self.y_train):
            if self.metric=='euclidean':
                dist = self.euclidean(train_row, test_row)
            elif self.metric=='manhattan':
                dist = self.manhattan(train_row, test_row)
            elif self.metric=='minkowski':
                dist = self.minkowski(train_row, test_row, self.p)
            else:
                raise NameError('Supported metrics are euclidean, manhattan and minkowski')
            distances.append((dist, train_class))
        # Sort distances
        distances.sort(key=lambda x: x[0])
        # Identify k nearest neighbours
        neighbours = list()
        for i in range(self.k):
            neighbours.append(distances[i][1])
        return neighbours

Logistic regression

NumPy implementation:

# https://blog.51cto.com/u_15661962/5531479 
import numpy as np

# trainable weights and bias init
def weight_init(n_features):
    # n_features: size of each data point
    w = np.zeros(1, n_features)
    b = 0
    return w, b

# activation function
def sigmoid(x):
    return 1/(1 + np.exp(-x))

# loss function: binary cross entropy
def bce_loss(pred, y, n):
    # n is the number of samples
    # pred, y shape: [n, 1]
    loss = (-1/n) * (np.sum(y * np.log(pred) + (1-y) * np.log(1-pred)))
    return loss

def logistic_regression(X, Y):
    # X: input data
    # Y: labels
    n_samples, n_features = X.shape[0], X.shape[1]
    w, b = weight_init(n_features)
    # hyper parameters
    iteration = 10
    lr = 1e-5
    for i in range(iteration):
        # np.dot(x,y.T) == x @ y.T
        pred = sigmoid(np.dot(X, w.T) + b)
        loss = (pred, Y, n_samples)
        # calculate gradients
        # reference: https://laobadao.github.io/2017/10/27/logistic-cost/index.html
        # https://gist.github.com/golamSaroar/1a1ce33139cf77c37cd6bd123f0fa9cb
        w_grad = (1/n_samples) * (np.dot(X.T, (pred - Y.T).T))
        b_grad = (1/n_samples) * （np.sum(pred - Y.T)）
        # update w, b
        w -= lr * w_grad
        b -= lr * b_grad
    return w, b

Linear Regression

NumPy implementation for multiple linear regression (two or more features for each data point), simple linear regression (only one feature for each data point)
reference:

import numpy as np

# mean-squared error loss
def mse_loss(y_pred, y):
    n = y.shape[0] # number of samples
    loss = (1/n) * np.sum((y - y_pred)**2)
    return loss

# trainable weights and bias init
# n_features >= 2 for multiple linear regression
# n_features = 1 for simple linear regression
def weight_init(n_features):
    w = np.zeros(1, n_features)
    b = 0
    return w, b

def gradientDescent(n_iteration, x, y, lr=1e-5):
    n_samples, n_features = x.shape[0], x.shape[1]
    w, b = weight_init(n_features)
    for i in range(n_iteration):
        y_pred = x @ w.T + b
        mse_loss = (y_pred, y)
        w_grad = (1 / n_samples) * np.dot(x.T, y_pred - y)
        b_grad = (1 / n_samples) * np.sum(y - y_pred)
        w -= lr * w_grad
        b -= lr * b_grad
    return w, b

Activation Functions

sigmoid (value from 0 to 1)

1
2
3

import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

ReLU (from 0 to infinity)

def relu(x):
    if x < 0:
        return 0
    else:
        return x

Leaky ReLU

1
2
3

def leaky_relu(x, alpha=0.01):
    """Compute the Leaky ReLU activation."""
    return max(alpha * x, x)

tanh (from -1 to 1)

1
2
3

import numpy as np
def tahn(x):
    return (np.exp(x)-np.exp(-x)) / (np.exp(x) + np.exp(-x))

Softmax (from 0 to 1)

import numpy as np
def softmax(x):
    # assumes x is a vector
    x = x - np.max(x) # to avoid overflow
    return np.exp(x) / np.sum(np.exp(x))

GLU (Gated Linear Units) activation function
glu

import torch.nn as nn

class GLU(nn.Module):
    def __init__(self, in_size):
        super().__init__()
        self.linear1 = nn.Linear(in_size, in_size)
        self.linear2 = nn.Linear(in_size, in_size)
    def forward(self, x):
        return self.linear1(x) * self.linear2(x).sigmoid()

Sigmoid Linear Unit (SiLU/swish) activation function

import numpy as np

def silu(x):
    """Compute the SiLU activation for numpy arrays."""
    return x / (1 + np.exp(-x))

Loss Functions

Binary Cross Entropy Loss

import numpy as np
def binary_cross_entropy(actual, predicted):
    sum_score = 0.0
    for i in range(len(actual)):
        sum_score += actual[i] * np.log(1e-10 + predicted[i])
    mean_sum_score = -1.0 / len(actual) * sum_score

Categorical Cross Entropy Loss

import numpy as np
def categorical_cross_entropy(actual, predicted):
    sum_score = 0.0 
    for i in range(len(actual)):
        for j in range(len(actual[i])):
            sum_score += actual[i][j] * np.log(1e-10 + predicted[i][j])
    mean_sum_score = -1.0 /len(actual) * sum_score
    return mean_sum_score

# class CrossEntropyLoss():
#     def forward(self, in_probs, labels):
#         self.in_probs = self.in_probs
#         self.labels = labels
#         return np.sum(-labels*np.log(self.in_probs))

#     def backward(self):
#         return -self.labels/self.in_probs

KL divergence

import numpy as np
def KL(P,Q):
    """ 
    Epsilon is used here to avoid conditional code for
    checking that neither P nor Q is equal to 0. 
    """
    epsilon = 0.00001
    P = P+epsilon
    Q = Q+epsilon
    divergence = np.sum(P*np.log(P/Q))
    return divergence

Evaluation metrics

Recall, Precision, F1, accuracy

ground_truth = [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1] 
prediction   = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05, 0.9, 0.5, 0.3, 0.66, 0.3,  0.2, 0.85, 0.15, 0.99] 
thresholds    = [0, 0.1, 0.2, 0.3, 0.4, 0.5,  0.6, 0.7, 0.8, 0.85, 0.9, 0.99, 1.0]

#True Positive
def true_positive(ground_truth, prediction):
    tp = 0
    for gt, pred in zip(ground_truth, prediction):
        if gt == 1 and pred == 1:
            tp +=1
    return tp

#True Negative
def true_negative(ground_truth, prediction):
    tn = 0
    for gt, pred in zip(ground_truth, prediction):
        if gt == 0 and pred == 0:
            tn +=1
    return tn

#False Positive
def false_positive(ground_truth, prediction):
    fp = 0
    for gt, pred in zip(ground_truth, prediction):
        if gt == 0 and pred == 1:
            fp +=1
    return fp

#False Negative
def false_negative(ground_truth, prediction):
    fn = 0
    for gt, pred in zip(ground_truth, prediction):
        if gt == 1 and pred == 0:
            fn +=1
    return fn

# True Positive Rate (Sensitivity or Recall) = TP/ (TP + FN)
def tpr(ground_truth, prediction):
    tp = true_positive(ground_truth, prediction)  
    fn = false_negative(ground_truth, prediction)  
    pr = tp/ (tp + fn)  
    return pr

# False Positive Rate (Sensitivity or Recall) = FP/ (TN + FP)
def fpr(ground_truth, prediction):
    fp = false_positive(ground_truth, prediction)  
    tn = true_negative(ground_truth, prediction)  
    fr = fp/ (tn + fp)  
    return fr

# ROC and AUC
true_positive_rate = []
false_poitive_rate = []

for threshold in thresholds:  
    #calculate predictions for threshold  
    value_pred = [1 if x >= threshold else 0 for x in prediction]   
    value_tpr = tpr(ground_truth, value_pred)   
    value_fpr = fpr(ground_truth, value_pred)  
    true_positive_rate.append(value_tpr)  
    false_positive_rate.append(value_fpr)
# ROC curve: X-axis => False Positive Rate (FPR), Y-axis => True Positive Rate (TPR) 
# AUC : area under ROC curve
def _roc_auc_score(true_positive_rate, false_positive_rate):
    tpr, fpr = true_positive_rate, false_positive_rate
    # compute AUC using the trapezoidal rule;
    # appending an extra 0 is just to ensure the length matches
    zero = np.array([0])
    tpr_diff = np.hstack((np.diff(tpr), zero))
    fpr_diff = np.hstack((np.diff(fpr), zero))
    auc = np.dot(tpr, fpr_diff) + np.dot(tpr_diff, fpr_diff) / 2
    return auc

Marco & Micro average recall/precision

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In macro averaging, we compute the performance for each class, and then average over classes. In micro averaging, we collect the decisions for all classes into a single confusion matrix, and then compute precision and recall from that table.
In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

Macro-averaged metrics are used when we want to evaluate systems performance across on different datasets.
Micro-averaged metrics should be used when the size of datasets are variable.

ROC and AUC

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model (can be used for multi class classification as well) at varying threshold values. The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting.

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

BLEU score

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts.

Rouge score

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

ROUGE-N: Overlap of n-grams between the system and reference summaries.
ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. Intuitively, it can be thought of as an evaluation of the model’s ability to predict uniformly among the set of specified tokens in a corpus. The lower perplexity means the model has better performance on the specific corpus.

import torch
import torch.nn.functional as F
num_classes = 10
batch_size  = 1

# your model outputs / logits
output = torch.rand(batch_size, num_classes) 
# your targets
target = torch.randint(num_classes, (batch_size,))

loss = F.cross_entropy(output, target)
# calculating perplexity
perplexity  = torch.exp(loss)

MRR, MAP, NDCG

MRR
Mean Reciprocal Rank (MRR) is one of the metrics that help evaluate the quality of recommendation and information retrieval systems. Mean Reciprocal Rank (MRR) defines how good a retrieval system is to return a relevant result as the top result. To calculate the MRR, multiple queries are invoked. If all of these queries return a relevant document at the top, the score will be 1.0.
MAP
The disadvantage of MRR is that only the top document is considered. Other use cases like comparative questions in Question Answering scenarios require multiple relevant documents. Mean Average Precision (MAP) addresses this shortcoming by considering the order of the returned items. To calculate this score, three steps are taken:
1. Calculate precision
2. Calculate average precision for K
3. Calculate mean average precision for sample of queries
  As for MRR, multiple queries are run. Instead of calculating the Recall, Precision is computed by using relevant and non-relevant documents from the returned lists (not the complete dataset).

Average Precision (AP) is a widely used metric in information retrieval that quantifies the quality of a retrieval system’s ranked results for a single query. Unlike precision, which considers the entire ranked list of retrieved items, AP focuses on how well relevant items are ranked within that list. It measures the area under the precision-recall curve for a single query.
AP is calculated as follows:

First, consider a query that retrieves a set of items.
Compute the precision for each position where a relevant item is retrieved in the ranked results list.
Average these precision values, but only for the positions where relevant items were retrieved.

CG, DCG, NDCG
Discounted cumulative gain (DCG) is a measure of ranking quality which is often presented in its normalized and more easily interpretable form, known as Normalized DCG (nDCG or NDCG). In information retrieval, NDCG is often used to measure effectiveness of search engine algorithms and related applications.

nDCG implementation:

import numpy as np

def ndcg(golden, current, n = -1):
    log2_table = np.log2(np.arange(2, 102))

    def dcg_at_n(rel, n):
        rel = np.asfarray(rel)[:n]
        dcg = np.sum(np.divide(np.power(2, rel) - 1, log2_table[:rel.shape[0]]))
        return dcg

    ndcgs = []
    for i in range(len(current)):
        k = len(current[i]) if n == -1 else n
        idcg = dcg_at_n(sorted(golden[i], reverse=True), n=k)
        dcg = dcg_at_n(current[i], n=k)
        tmp_ndcg = 0 if idcg == 0 else dcg / idcg
        ndcgs.append(tmp_ndcg)
    return 0. if len(ndcgs) == 0 else sum(ndcgs) / (len(ndcgs))

Cross Validation

K-fold cross validation is a resampling technique used to evaluate the performance of machine learning models. It works by splitting the dataset into k folds, and then training the model on k-1 folds and testing it on the remaining fold. This process is repeated k times, with each fold being used as the test set once. The results are typically averaged to obtain a more robust estimate of your model’s performance.

# k-fold cross validation implementation
import numpy as np
    
# split data into k-folds (train, test) datasets
def kfold_indices(data, k):
	fold_size = len(data) // k
	indices = np.arange(len(data))
	flods = []
	for i in range(k):
		test_indices = indices[i * fold_size, (i+1)*fold_size]
		train_indices = np.concatenate([indices[: i*fold_size], indices[(i+1)*fold_size:]])
		folds.append((train_indices, test_indices))
	return folds
	
	
# perform k-fold validation
fold_indices = kfold_indices(X, k)
scores = []
# Iterate through each fold
for train_indices, test_indices in fold_indices:
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    # Train the model on the training data
    model.fit(X_train, y_train)
    # Make predictions on the test data
    y_pred = model.predict(X_test)
    # Calculate the accuracy score for this fold
    fold_score = accuracy_score(y_test, y_pred)
    # Append the fold score to the list of scores
    scores.append(fold_score)
    
# Calculate the mean accuracy across all folds
mean_accuracy = np.mean(scores)

Deep Learning algorithms

Batch Normalization

Why normalization?
The phenomenon by which the distributions of internal nodes (neurons) of a neural network change is referred to as Internal Covariate Shift. And we want to avoid it because it makes training the network slower, as the neurons are forced to re-adjust drastically their weights in one direction or another because of drastic changes in the outputs of the previous layers.

PyTorch implementation

import torch
import numpy as np

def batchnorm_forward(x, gamma, beta, eps=1e-5):
    N, D = x.shape # N is the batch size, D is the hidden size of each sample in batch
    sample_mean = x.mean(axis=0) # shape [1, D]
    sample_var = x.var(axis=0) # shape [1, D]
    std = np.sqrt(sample_var + eps)
    x_centered = x - sample_mean
    x_norm = x_centered / std
    out = gamma * x_norm + beta
    return out

# with moving_mean and moving_var for inference
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    '''
    # X: (N, D) N is the batch size, D is the hidden size of each sample in batch
    # The scale parameter and the shift parameter (model parameters) are
    # initialized to 1 and 0, respectively
    gamma = nn.Parameter(torch.ones(D,))
    beta = nn.Parameter(torch.zeros(D,))
    # The variables that are not model parameters are initialized to 0 and
    # 1
    moving_mean = torch.zeros(shape)
    moving_var = torch.ones(shape)
    '''
    # test stage
    if not torch.is_grad_enabled():
            X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else: # training
        mean = X.mean(dim=0)
        var = ((X-mean)**2).mean(dim=0) # var = X.var(dim=0)
        X_hat = (X - mean) / torch.sqrt(var + eps)
        ## update the mean and variance using moving average
        moving_mean = (1.0 - momentum) * moving_mean + momentum * mean
        moving_var = (1.0 - momentum) * moving-var + momentum * var
    Y = gamma * X_hat + beta # scale and shift
    return Y, moving_mean.data, moving_var.data

NumPy implementation

import numpy as np
def batchnorm_forward(x, scales,, bias, eps=1e-5):
  mean = x.mean(axis=0)  # Shape = (w, h, c)
  var  = x.var(axis=0)
  std = np.sqrt(var+ eps) # shape = (w, h, c)
  # eps is used to avoid divisions by zero
  # Compute the normalized input
  x_norm = (x - mean) / std # shape (batch, w, h, c)
  # Output = scale * x_norm + bias
  if scales is not None:
      x_norm *= scales  # Multiplication for scales
  if bias is not None:
      x_norm += bias # Add bias
  return x_norm

Layer Normalization

Explanation:

A well-known explanation of the success of LayerNorm is its re-centering and re-scaling invariance property. The former enables the model to be insensitive to shift noises on both inputs and weights, and the latter keeps the output representations intact when both inputs and weights are randomly scaled.

Numpy implementation

import numpy as np
def layer_norm(x, g, b, eps: float = 1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps)  # normalize x to have mean=0 and var=1 over last axis
    return g * x + b  # scale and offset with gamma/beta params

Pytorch implementation

import torch
import torch.nn as nn
# feature : (batch_size, seq_len, hidden_dim)
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-5):
        super(LayerNorma, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
    def farward(self, x):
        mean = x.mean(-1, keepdim=True) # (batch_size, seq_len, 1)
        std = x.std(-1, keepdim=True)  # (batch_size, seq_len, 1)
        return self.a_2 * (x-mean) / (std+self.eps) + self.b_2

Dropout

class Dropout():
    def __init__(self. drop_prob):
        self.drop_prob = drop_prob

    def forward(self, inputs):
        self.dropout = (np.random.rand(*inputs.shape) > self.dorp_prob)
        return np.where(self.dropout, inputs, 0)
    
    def backward(self, grads):
        return np.where(self.dropout, grads, 0)

Optimizers

SGD (Stochastic Gradient Descent)

Advantages:

Frequent updates of model parameters, converges in less time
Requires less memory as no need to store values of loss functions
May get new minima’s
Disadvantages:
High variance in model parameters
May shoot even after achieving global minima
To get the same convergence as gradient descent needs to slowly reduce the value of learning rate

Mini-Batch Gradient Descent (BGD)

Advantages:

Frequently updates the model parameters and also has less variance
Requires medium amount of memory
All types of Gradient Descent have some challenges:
May get trapped at local minima
Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge
Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate

Momentum

Used in conjunction Stochastic Gradient Descent or Mini-Batch Gradient Descent, Momentum takes into account past gradients to smooth out the update.
This is seen in variable 𝑣, which is an exponentially weighted average of the gradient on previous steps. This results in minimizing oscillations and faster convergence.

Compared with (minibatch) stochastic gradient descent the momentum method needs to maintain a set of auxiliary variables, i.e., velocity. It has the same shape as the gradients (and variables of the optimization problem). In the implementation below we call these variables states.

def init_momentum_states(feature_dim):
    v_w = torch.zeros((feature_dim, 1))
    v_b = torch.zeros(1)
    return (v_w, v_b)

def sgd_momentum(params, states, hyperparams):
    for p, v in zip(params, states):
        with torch.no_grad():
            v[:] = hyperparams['momentum'] * v + p.grad
            p[:] -= hyperparams['lr'] * v
        p.grad.data.zero_()

Adagrad (Adaptative Gradient)

Adagard optimizer changes the learning rate for each parameter at every time step. Particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.

Advantages:

Learning rate changes for each training parameters
Don’t need to manually tune the learning rate
Able to train on sparse data
Disadvantages:
Computationally expensive as a need to calculate the second order derivative
The learning rate is always decreasing results in slow training

def init_adagrad_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def adagrad(params, states, hyperparams):
    eps = 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] += torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()

Adadelta

Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w.

def init_adadelta_states(feature_dim):
    s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    return ((s_w, delta_w), (s_b, delta_b))

def adadelta(params, states, hyperparams):
    rho, eps = hyperparams['rho'], 1e-5
    for p, (s, delta) in zip(params, states):
        with torch.no_grad():
            # In-place updates via [:]
            s[:] = rho * s + (1 - rho) * torch.square(p.grad)
            g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
            p[:] -= g
            delta[:] = rho * delta + (1 - rho) * g * g
        p.grad.data.zero_()

RMSProp (Root Mean Square Propagation)

RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients. RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner. The learning rate needs to be scheduled by the experimenter in practice.

def init_rmsprop_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def rmsprop(params, states, hyperparams):
    gamma, eps = hyperparams['gamma'], 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()

Adam (Adaptive Moment Estimation)

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. Adam computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.

def init_adam_states(feature_dim):
    v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            v[:] = beta1 * v + (1 - beta1) * p.grad
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)
                                                       + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1

Note: AdamW(Adam with decoupled weight decay)

CV algorithms

CNN

Components of CNN: Convolution, Pooling, Flatten, Fully connected layer
Basic PyTorch CNN implementation:

class SimpleCNN(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.layer = nn.Sequential( # (N, 32, 32, 1) N:batch_size, 32: height, 32: width, 1: channel
            nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0), # 1: in_channels, 6: out_channels, Convolution -> (N, 28, 28, 6)
            nn.BatchNorm2d(6), # batch normalization (N, 28, 28, 6)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2) # Max Pooling -> (N, 14, 14, 6)
            )
        self.fc = nn.linear(14*14*6, num_classes)
    def forward(self, x):
        out = self.layer(x)
        out	= out.reshape(out.size(0), -1) # Flatten -> (N, 14, 14, 6) => (N, 14*14*6)
        out = self.fc(x)
        return out

PyTorch implementation for Convolution and Pooling:

def corr2d(image, kernel, padding_size, stride):  #cross-correlation operation
    """Compute 2D cross-correlation."""
    h, w = kernel.shape
    image = np.pad(image, padding_size) 
    Y = torch.zeros(((image.shape[0] - h) // stride + 1, (image.shape[1] - w) // stride + 1))
	start_pos_h, start_pos_w = 0, 0
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[start_pos_h : start_pos_h + h, start_pos_w : start_pos_w + w] * kernel).sum()
            start_pos_w += stride
        start_pos_h += stride
    return Y

def pool2d(X, pool_size, mode="max"):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == "max":
                Y[i,j] = X[i:i+p_h, j:j+p_w].max()
            elif model == "avg":
                Y[i,j] = X[i:i+p_h, j:j+p_w].mean()
    return Y

NumPy implementation of Convolution and Pooling

def conv2d(image, num_filters, stride, kernel_size, padding=0):
    # init filters
    filters = np.random.randn(num_filters, kernel_size, kernel_size)*0.1
    # image size (num_channels, 28, 28)
    in_dim = image.shape[1] # 28
    out_dim = (in_dim - kernel_size) // stride + 1 # without padding
    out = np.zeros((num_filters, out_dim, out_dim)) # the Convolution output
    for f in range(num_filters):
        temp_y = out_y = 0 
        while temp_y + kernel_size <= in_dim:
            temp_x = out_x = 0
            while temp_x + kernel_size <= in_dim:
                patch = image[:, temp_x:temp_x+kernel_size, temp_y:temp_y+kernel_size]
                out[f, out_y, out_x] = np.sum(filters[f] * patch)
                temp_x += stride
                out_x += 1
            temp_y += stride
            out_y += 1
    return out

def max_pooling(image, stride, pooling_size):
    num_channels, height, width = image.shape
    out_h = (height - pooling_size) // stride + 1
    out_w = (width - pooling_size) // stride + 1
    out = np.zeros(num_channels, out_h, out_w)
    for i in range(num_channels):
        temp_y = out_y = 0
        while temp_y + pooling_size <= height:
            temp_x = out_x = 0
            while temp_x + pooling_size <= width:
                patch = image[i, temp_y:temp_y+pooling_size, temp_x:temp_x+pooling_size]
                out[i, out_y, out_x] = np.max(patch)
                temp_x += stride
                out_x += 1
            temp_y += stride
            out_y += 1
    return out

Patch Embedding

# 2D image to patch embedding
class PatchEmbed(nn.Module):
    def __init__(self, image_size=224, path_size=16, in_chans=3, embed_dim=768, flatten=True):
        super().__init__()
        image_size = (image_size, image_size)
        path_size = (path_size, path_size)
        self.image_size = image_size
        self.path_size = path_size
        self.grid_size = (image_size[0]//patch_size[0], image_size[1]//path_size[1])
        self.num_patches = self.grid_size[0] * self.grid_size[1]
        self.flatten = flatten
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size) 
    def forward(self, x):
        # Output shape: (batch size, no. of patches, hidden_dim)
        B, C, H, W = x.shape # (bs, 3, 224, 224)
        x = self.proj(x) # (bs, 768, 14, 14)
        x = x.flatten(2) # (bs, 768, 196)
        x = x.transpose(1, 2) # (bs, 196, 768)
        return x

# fuyu patch
def patch_img(x: Tensor, patches: int):
  return rearrange(
      x, "b c (h p1) (w p2) -> b (h w) (p1 p2 c)", p1=patches, p2=patches
  )
    
def FuyuPatch(image, patch_size):
    # image (channel, W, H)
    patchs = []
    channel, width, height = image.size()
    for i in range(0, height, path_size):
        for j in range(0, width, path_size):
            patch = image[:, i:i+patch_size, j:j+patch_size] # (channel, path_size, pathc_size)
            patches.append(patch)
        assert len(patches) == width // patch_size * (height // patch_size)
        # patches: (num_patches, channel, patch_size, patch_size)
        patches = torch.cat(patches, dim=0)
        patches = patches.flatten(1) #(num_patches, channel*patch_size*patch_size)
    return patches

NLP algorithms

Basic Preprocessing

Here are the steps.

# remove URLs
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

# remove html tags
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

# tokenize: Splitting the sentence into words
from nltk.tokenize import word_tokenize
sentence = "Books are on the table"
words = word_tokenize(sentence)

# lower case
word = word.lower()

# remove whitespace head and tail
word = word.strip(" ")

# remove Punctuation
import string  
puncts = string.punctuation

# remove stop words
from nltk.corpus import stopwords
set(stopwords.words('english'))

# remove emoji
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

# stemming or Lemmatization
# A better efficient way to proceed is to first lemmatise and then stem, but stemming alone is also fine for few problems statements, here we will not lemmatise.
stemmer = PorterStemmer()
stemmer.stem(word)

Cosine similarity

NumPy Implementation

1 2	import numpy as np cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

python implementation

def dot_product(A,B):
    return sum([a*b for a,b in zip(A,B)])

def cosine_similarity(A,B):
    return dot_product(A,B) / (dot_product(A,A)**0.5 * dot_product(B,B)**0.5)

TF-IDF

Terminology:
TF - Term Frequency, DF - Document Frequency, IDF - Inverse Document Frequency.
t — term (word), d — document (set of words), N — count of corpus, corpus — the total document set.
Formula:
TF(t,d) = count of t in d / number of words in d
DF(t) = the number of documents containing the term t
IDF(t) = N / DF(t)
To avoid IDF value explodes we take the log of IDF ==> IDF(t) = log( N / (DF(t) + 1) )
TD-IDF(t, d) = TF(t,d) * IDF(t)

import numpy as np
from nltk.tokenize import word_tokenize 
 
text = ['Topic sentences are similar to mini thesis statements.\
        Like a thesis statement, a topic sentence has a specific \
        main point. Whereas the thesis is the main point of the essay',\
        'the topic sentence is the main point of the paragraph.\
        Like the thesis statement, a topic sentence has a unifying function. \
        But a thesis statement or topic sentence alone doesn’t guarantee unity.', \
        'An essay is unified if all the paragraphs relate to the thesis,\
        whereas a paragraph is unified if all the sentences relate to the topic sentence.']
 
#preprocessing the text data
sentences = []
word_list = []
for sent in text:
    x = [w.lower() for w in word_tokenize(sent) if w.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_list:
            word_list.append(word)

# total number of docs in corpus
total_doc = len(text)

# calculate term frequency for all documents
term_freq_docs = [] # shape: (num_docs, num_unique_words)
for sent in sentences:
    term_freq = [0]*len(word_list)
    for word in sent:
        idx = word_list.index(word)
        term_freq[idx] += 1
    term_freq = [x/len(sent) for x in term_freq]
    term_freq_docs.append(term_freq)

# calculate inverse document frequency
inverse_doc_freq = [0]*len(word_list) # shape: num_unique_words
for w in word_list:
    for sent in sentences:
        if w in sent:
            idx = word_list.index(w)
            inverse_doc_freq[idx] += 1
    inverse_doc_freq = [np.log(total_doc / (x + 1)) for x in inverse_doc_freq]

# calculate tf-idf 
tf_idf_all = [] # shape: (num_docs, num_unique_words)
for term_freq in term_freq_docs:
    tf_idf = []
    for idx, word in enumerate(term_freq):
        tf_idf.append(term_freq[idx] * inverse_doc_freq[idx])
    tf_idf_all.append(tf_idf)

Sometimes we use TF-IDF for query-document retrieval task, in this case, we will calculate TF and IDF for each word in query according to different documents. Here is the example: blog

FastText

fastText is usually used to do classification tasks, it has two main improvements:

used char-level n-gram as additional word embedding features

used hierarchical softmax to reduce computation cost

import torch
import torch.nn as nn
import torch.nn.functional as F
class FastText(nn.module):
    def __init__(self, config):
        super(FastText, self).__init__()
        self.embedding_word = nn.Embedding(
            config.vocab_size, 
            config.embed_size,
            padding_idx = config.vocab_size-1
        )
        self.embedding_bigram = nn.Embedding(
            config.ngram_vocab_size, 
            config.embed_size,
            padding_idx = config.ngram_vocab_size-1
        )
        self.embedding_trigram = nn.Embedding(
            config.ngram_vocab_size, 
            config.embed_size,
            padding_idx = config.ngram_vocab_size-1
        )
        self.dropout = nn.Dropout(config.dropout_rate)
        self.fc1 = nn.Linear(config.embed_size * 3, config.hidden_size)
        self.fc2 = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):
        '''
        :param x: x[0] (words), x[1] (bigram), x[2] (trigram): all have shape [batch_size, seq_length] 
        :return: [batch_size, num_classes] classification task
        '''
        embed_bow = self.embedding_word(x[0])
        embed_bigram = self.embedding_bigram(x[1])
        embed_trigram = self.embedding_trigram(x[2])
        input = torch.cat((embed_bow, embed_bigram, embed_trigram), -1)
        out = input.mean(dim=1)
        out = self.dropout(out)
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

RNN implementation

class RNN(nn.Module):
    def __init__(self, num_inputs, num_hiddens, sigma=0.01):
        super().__init__()
        self.num_inputs = num_inputs # dimension size of each step's input 
        self.num_hiddens = self.num_hiddens
        self.W_xh = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
        self.W_hh = nn.Parameter(torch.randn(num_hiddens, num_hiddens) * sigma)
        self.b_h = nn.Parameter(torch.randn(num_hiddens))
    def forward(self, inputs, state=None):
        batch_size, seq_len = inputs.shape
        if state == None:
            state = torch.zeros((batch_size, self.num_hiddens)) # the state of last step
        outputs = [] # shape: (num_steps, batch_size, num_hiddens)
        for x_step in inputs: # shape of inputs: (num_steps, batch_size, num_inputs)
            # in NLP the num_steps equals seq_len, num_inputs equals token embedding size
            state = torch.tahn(torch.matmul(x_step, W_xh) + torch.matmul(state, self.W_hh) + self.b_h) # (batch_size, num_hiddens)
            outputs.append(state)
        return outputs, state

Absolute position embedding

class PositionEncoding(nn.Module):
	def __init__(self, d_model, max_seq_length):
		self.pe = torch.zeros(max_seq_length, d_model)
		position = torch.arange(0, max_seq_length).unsqueeze(1) #(max_seq_lenght, 1)
		div_term = torch.arrange(0, d_model, 2)* (-(match.log(10000)/d_models))
		
		self.pe[:, 0::2] = torch.sin(position*div_term)
		self.pe[:, 1::2] = torch.cos(position*div_term)
		
	def forward(self, x):
		return x + self.pe[:, :x.size(1)]

Greedy search decoding

# greedy decoder
def greedy_decoder(data):
    # index for largest probability each row
    return [argmax(s) for s in data]

# define a sequence of 3 words over a vocab of 5 words
data = [[0.1, 0.2, 0.3, 0.4, 0.5],
 [0.5, 0.4, 0.3, 0.2, 0.1],
 [0.1, 0.2, 0.3, 0.4, 0.5]]
# decode sequence
result = greedy_decoder(data)

Beam search decoding

Beam Seach details: here

import numpy as np
# beam search
def beam_search_decoder(data, k):
    sequences = [[list(), 0.0]] #[[seq1, score1], [seq2, score2], ...] K sequence candidates with their scores. Seq: a list of tokens. Score: log probabilities 
    # walk through each step of data
    for row in data:
        all_candidates = []
        # expand each current candidate
        for i in range(len(sequences)):
            seq, score = sequences[i]
            for j in range(len(row)):
                candidate = [seq + row[j], score - np.log(row[j])] # [new_seq, new_score]
                all_candidates.append(candidate)
        # order all candidates by score
        ordered = sorted(all_candidates, key=lambda x:x[1])
        # select top k
        sequences = ordered[:k]
    return sequences

# Pytorch version implementation
import torch
def beam_search_decoder(pred, k):
    """Beam Search Decoder
    Parameters:
        pred(Tensor) – the prediction of network.
        k(int) – beam size of decoder.
    Outputs:
        indices(Tensor) – a beam of index sequence.
        log_prob(Tensor) – a beam of log likelihood of sequence.
    Shape:
        pred: (batch_size, seq_length, vocab_size).
        indices: (batch_size, beam_size, seq_length).
        log_prob: (batch_size, beam_size).

    Examples:
        >>> pred = torch.softmax(torch.randn([32, 20, 1000]), -1)
        >>> indices, log_prob = beam_search_decoder(pred, 3)
    """
    batch_size, seq_length, _ = pred.shape
    log_pred = pred.log()
    log_prob, indices = log_pred[:, 0, :].topk(k, sorted=True) # first token with top k candidates
    indices = indices.unsqueeze(-1)
    for i in range(1, seq_length):
        log_prob = log_prob.unsqueeze(-1) + log_pred[:, i, :].unsqueeze(1).repeat(1, k, 1)
        log_prob, index = log_prob.view(batch_size, -1).topk(k, sorted=True)
        indices = torch.cat([indices, index.unsqueeze(-1)], dim=-1)
    return indices, log_prob

LLM algorithms

LoRA (Low Rank Adaption) Implementation

import torch.nn as nn
class LinearLoRA(nn.Module):
    '''
    A low rank adapted linear layer.
    Args:
        in_dim: An integer representing the input dimension of the linear layers
        out_dim: An integer representing the output dimension of the linear layer
        r: An integer representing the rank of low-rank approximated matrices
        lora_alpha: An integer representing the number of scaling constant alpha / r
        lora_dropout: A float between 0 and 1 representing the dropout probability
    '''
    def __init__(self, in_dim:int, out_dim:int, r: int=8, lora_alpha: int=16, lora_dropout: float=0.0):
        super().__init__()
        self.r = r # must >= 1
        self.lora_alpha = lora_alpha
        self.lora_dropout = nn.Dropout(lora_dropout)
        # recreate the linear layer and freeze it (the actual weight values will be copied in outside of this class)
        self.pretrained = nn.Linear(in_dim, out_dim, bias=True)
        self.pretrained.weight.requires_grad = False
        # create the low-rank A matrix and initialize with same method as in Hugging face PEFT
        self.lora_A = nn.Linear(in_dim, r, bias=True)
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5)) # or nn.init.normal_(self.lora_A, mean=0, std=1)
        # another method to define LoRA_A matrix
        # self.A = nn.Parameter(torch.randn(in_dim, r) * math.sqrt(r))

        # create the low rank B matrix and initialize to zero
        self.lora_B = nn.Linear(r, out_dim, bias=True)
        nn.init.constant_(self.lora_B.weight, 0)
        # another method to define LoRA_B matrix
        # self.B = nn.Parameter(torch.zeros(r, out_dim))
        self.scaling = self.lora_alpha / self.r

    def forward(self,x):
        pretrained_out = self.pretrained(x)
        lora_out = self.lora_A(x)
        lora_out = self.lora_B(lora_out)
        # if we used another method defined A and B:
        # lora_out = x @ self.A @ self.B
        lora_out = self.lora_dropout(lora_out)
        lora_out = lora_out * self.scaling
        return pretrained_out + lora_out

## Another simpale to implement LoRA
class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        out = self.alpha * (x @ self.A @ self.B)
        return out

Autoregressive decoding

Decoder models generate tokens one by one during inference, e.g. GPT, Llama, BART, T5 etc. We can this autoregressive decoding and it is the main bottleneck of inference speed for decoder models.
Here is an sample implementation of this autoregressive decoding process:

@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
    '''
    Take a conditioning sequence of indices idx (Tensor: batch_size, input_length) abd complete the sequence max_new_tokens times,
    feeding the predictions back into the model each time.
    '''
    mode.eval()
    for _ in range(max_new_tokens):
        # forward the model to get the logits for the index in the sequence
        logits, _ = mode(idx) # logits: (batch_size, input_length, vocab_size)
        # get the logits at the final step and scale by desired temperature
        # While applying temperature can make a distribution less random, in its limit, when setting temperature -> 0, temperature scaled sampling becomes equal to greedy decoding
        logits = logits[:,-1,:] / temperature
        # optionally crop the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = float("-inf")
        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution
        idx_next = torch.multinomial(probs, num_samples=1)
        # append sampled index to the running sequence to continue
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

Speculative Decoding

This technology is used in LLM inference stage to speed up Auto-regressive decoding steps.
Main idea: (source)

In speculative sampling, we have two models:

A smaller, faster draft model

A larger, slower target model
The idea is that the draft model speculates what the output is steps into the future, while the target model determines how many of those tokens we should accept.

The intuition behind speculative sampling is that certain strings of tokens (common phrases, pronouns, punctuation, etc …) are fairly easy to predict, so a smaller, less powerful, but faster draft model should be able to quickly predict these instead of having our slower target model doing all the work.

The algorithm is following:

The draft model decodes K tokens in the regular autoregressive fashion.
We get the probability outputs of the target and draft model on the new predicted sequence.
We compare the target and draft model probabilities to determine how many of the K tokens we want to keep based on some rejection criteria. If a token is rejected, we resample it using a combination of the two distributions and don’t accept any more tokens.
If all K tokens are accepted, we can sample an additional final token from the target model probability output.

As such, instead of decoding a single token at each iteration, speculative sampling decodes between 1 to K+1 tokens per iteration. If no tokens are accepted, we resample guaranteeing at least 1 token is decoded. If all K tokens are accepted, then we can also sample a final token from the target models probability distribution, giving us a total of K+1 tokens decoded.