Eshin Jolly

Vector similarity reference

This is a quick post to illustrate with python code how several common vector similarity computations are related to each other. For more details I highly encourage you to check out Brendan O’Connor’s really nice elaboration.

# Import some stuff
import numpy as np
import pandas as pd
import scipy.spatial.distance as spd
from pymer4.simulate import easy_multivariate_normal
from pymer4.models import Lm
# Simulate some data
# 2, 50 dimensional vectors correlated ~ r = -0.05
X = easy_multivariate_normal(50,2,corrs=-.05)
a, b = X[:,0], X[:,1]

Inner product

Sum of element wise multiplication

\[\sum_i {x_iy_i} = x \cdot y\],b)


Average centered inner product

\[\frac{(x - \bar x) \cdot (y - \bar y)}{n}\]
a_centered = a - a.mean()
b_centered = b - b.mean(),b_centered) / len(a)
# Check our work

Cosine Similarity

Normalized (L2) inner product

\[\frac{x \cdot y}{||x|| \ ||y||}\]
# Euclidean/L2 norm = square root of sum of squared values
# algebra form
a_norm = np.sqrt(np.sum(np.power(a,2)))
# matrix form; transpose is not strictly needed here, just for illustration
b_norm = np.sqrt(,b.T))

# numpy short-cut: np.linalg.norm(a),b) / (a_norm * b_norm)
# Check our work (subtract 1 because scipy returns distances)
1 - spd.cosine(a,b)

Pearson Correlation

Centered, normalized (L2) inner product

\[\frac{(x - \bar x) \cdot (y - \bar y)}{||x - \bar x|| \ ||y - \bar y||}\]
# Can think of this as normalized covariance OR centered cosine similarity
a_centered_norm = np.linalg.norm(a_centered)
b_centered_norm = np.linalg.norm(b_centered),b_centered) / (a_centered_norm * b_centered_norm)
# Check our work
1 - spd.correlation(a,b)

OLS (univariate w/o intercept)

Partially normalized inner product, where partially means using a single vector’s norm

\[\frac{x \cdot y}{||x||^2}\]
# Can think of this as cosine similarity using only one vector's norm,b) / (a_norm * a_norm)
# Check our work against a regression in pymer4
model = Lm('B ~ 0 + A',data=pd.DataFrame({'A':a,'B':b}))
# Grab the beta value directly from the coefficients table.

OLS (univariate w/ intercept)

Centered, partially normalized inner product

\[\frac{(x - \bar x) \cdot y}{||x - \bar x||^2}\]
# In the numerator we could actually center a or b, or both.,b) / (a_centered_norm * a_centered_norm)
# Check our work
model = Lm('B ~ A',data=pd.DataFrame({'A':a,'B':b}))
rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora cv