Loess from scratch in Python + Animation

Introduction

Loess is a class of regression models that allows us to estimate the regression function $f (X)$ by fitting a simple (and different) model to each point in the domain $X$ . This is done by paying attention to the points closest to a particular target point $x_{0} \in X$ .

In the classic simple linear regression we have the following model:

$y_{i} = f (x_{i}) + ϵ_{i}$

where $f (x_{i})$ is:

$f (x_{i}) = β_{0} + β_{1} \cdot x_{i}$ and in the case of loess with approximate $f (x_{i})$ by a polynomial weigthed by a function $w_{i}$ that assigns higher weights to points $x_{i}$ that are closer to $x_{0}$ :

$f (x_{i}) = w_{i} (x_{0}) \cdot [β_{0} + β_{1} \cdot x_{i} + β_{2} \cdot x_{i}^{2} + \dots]$

Procedure:

Let $x_{i}$ denote a set of $n$ values for a particular variable and let $y_{i}$ represent the corresponding response variable.
Find the $k$ closest points to the target point $x_{0}$ . Let’s denote this set $D_{0}$
- For implementation, keep track of these points and their distances $d_{i 0}$
Find the furthest distance among those $k$ points. Let’s denote this quantity as $δ (x_{0})$
Calculate the weight function $w_{i}$ for each $x_{i} \in D_{0}$ using the tricube weight function:

$w_{i} (x_{0}) = W (\frac{d_{i 0}}{δ (x_{0})})$

where $W (\cdot)$ :

$W (u) = {\begin{cases} (1 - u^{3})^{3}, & 0 \leq u < 1 \\ 0, & otherwise \end{cases}$

Calculate the prediction $\hat{f} (x_{0})$ by solving weigthed least squares. The normal equation can be expressed as followed:

$\begin{aligned} W X \hat{β} & = W y \\ X^{'} W X \hat{β} & = X^{'} W y \\ \hat{β} & = (X^{'} W X)^{- 1} X^{'} W y \end{aligned}$

You need to use only the $k$ closest points for each target point $x_{0}$
- In the case of a polynomial of degree 1:
  - W is a 2x2 diagonal matrix
  - X is a kx2 matrix (column 1 it is a column of 1’s for the intercept)
  - y is a kx1 vector
- In the case of a polynomial of degree 2:
  - W is a 3x3 diagonal matrix
  - X is a kx3 matrix
  - y is a kx1 vector
With the estimates of $\hat{β}$ you can calculate $\hat{f} (x_{0})$

Implementation

In the implementation, I included an option to not use all the $x_{i}$ points as $x_{0}$ because when there are many samples, the fit takes long time.

Also, to solve the normal equation above, I did a QR decomposition for stability 1.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.linalg import qr, pinv
from scipy.linalg import solve_triangular


def loess(X, y, alpha, deg, all_x = True, num_points = 100):
    '''
    Parameters
    ----------
    X : numpy array 1D
        Explanatory variable.
    y : numpy array 1D
        Response varible.
    alpha : double
        Proportion of the samples to include in local regression.
    deg : int
        Degree of the polynomial to fit. Option 1 or 2 only.
    all_x : boolean, optional
        Include all x points as target. The default is True.
    num_points : int, optional
        Number of points to include if all_x is false. The default is 100.

    Returns
    -------
    y_hat : numpy array 1D
        Y estimations at each focal point.
    x_space : numpy array 1D
        X range used to calculate each estimation of y.

    '''
    
    assert (deg == 1) or (deg == 2), "Deg has to be 1 or 2"
    assert (alpha > 0) and (alpha <=1), "Alpha has to be between 0 and 1"
    assert len(X) == len(y), "Length of X and y are different"
    
    if all_x:
        X_domain = X
    else:
        minX = min(X)
        maxX = max(X)
        X_domain = np.linspace(start=minX, stop=maxX, num=num_points)
        
    n = len(X)
    span = int(np.ceil(alpha * n))
    #y_hat = np.zeros(n)
    #x_space = np.zeros_like(X)
    
    y_hat = np.zeros(len(X_domain))
    x_space = np.zeros_like(X_domain)
    
    for i, val in enumerate(X_domain):
    #for i, val in enumerate(X):
        distance = abs(X - val)
        sorted_dist = np.sort(distance)
        ind = np.argsort(distance)
        
        Nx = X[ind[:span]]
        Ny = y[ind[:span]]
        
        delx0 = sorted_dist[span-1]
        
        u = distance[ind[:span]] / delx0
        w = (1 - u**3)**3
        
        W = np.diag(w)
        A = np.vander(Nx, N=1+deg)
        
        V = np.matmul(np.matmul(A.T, W), A)
        Y = np.matmul(np.matmul(A.T, W), Ny)
        Q, R = qr(V)
        p = solve_triangular(R, np.matmul(Q.T, Y))
        #p = np.matmul(pinv(R), np.matmul(Q.T, Y))
        #p = np.matmul(pinv(V), Y)
        y_hat[i] = np.polyval(p, val)
        x_space[i] = val
        
    return y_hat, x_space

To test the implementation and create the animation, I replicate the following lecture notes 2. The dataset is located here 3.

import matplotlib.animation as animation
from matplotlib.animation import FuncAnimation, PillowWriter

c4 = pd.read_csv("cd4.data.txt", delim_whitespace=True, header=None)
y_hat2, x_space2 = loess(c4[0], c4[1], 0.05, 1, all_x = True, num_points = 200)

d = {'x': x_space2, 'y': y_hat2}
c4_pred = pd.DataFrame(data = d)
c4_pred = c4_pred.sort_values(by = ['x'])
c4_pred = c4_pred.reset_index(drop=True)

def animate(i):
    x_data = c4_pred.x.values[:i]
    y_data = c4_pred.y.values[:i]
    #sns.scatterplot(c4[0], c4[1], color ='skyblue')
    sns.lineplot(x= x_data, y=y_data, color = '#EE6666')

fig = plt.figure()
sns.scatterplot(c4[0], c4[1], color ='skyblue')
plt.ylim(1, 1550)
plt.xlabel("Time since zeroconversion")
plt.ylabel("CD4")
plt.title("CD4 cell count since zeroconversion for HIV infected men")

writer = animation.PillowWriter()

ani = FuncAnimation(fig, animate, frames = np.arange(0, 2376, 40))
ani.save('ani_cd4.gif', writer=writer)