What is the volume of the hypersphere divided by the hypercube as d→∞?
Curse of Dimensionality
VHC=(2r)d and VHS=Γ(2d+1)πd/2rd thus,
VHCVHS=2dΓ(2d+1)πd/2→πd1(2dπe)d/2→0
as d→∞. And the distance between the center and corners of the hypercube is dr.
Curse of Dimensionality
Nearly all of high-dimensional space in a hypercube is distant from the center and close to the border.
High dimensional datasets at risk of being sparse. The average distance between two random points:
in a unit square is roughly 0.52.
in a unit 3-d cube is roughly 0.66.
in a unit 1,000,000-d hypercube is ∼408.25.
Distances from a random point to its nearest and farthest neighbor are similar.
Distance-based classification generalizes poorly unless # samples grows exponentially with d.
Biological Networks
Biological Networks
Highly interconnected with modular structure.
Weakly to strongly scale-free (fraction of nodes with degree k follows a power law k−α).
Subsets of genes, proteins or regulatory elements tend to form highly correlated modules.
Functional genomics datasets tend to (not always!) occupy a low dimensional subpace of the feature space (e.g., genes, proteins, regulatory elements).
Ideal for dimenstional reduction approaches to both visualize and analyze functional genomics data.
Principal Components Analysis (PCA)
PCA
Assume we have n samples and p features which are in the form of a n×p centered matrix X where we subtracted the mean across samples of each feature.
The unbiased sample covariance matrix is then
ΣXX=n−11XTX
PCA finds a linear transformation Z=XV that diagonalizes ΣXX.
Singular Value Decomposition
X can be decomposed as follows:
X=UDVT
where U and V are n×n and p×p orthogonal matricies, respectively, and D is a n×p diagonal matrix. The diagonal elements of D are the singular values of X. The columns of U and V are the left-singular vectors and right-singular vectors.
Singular Value Decomposition
The left singular vectors and right singular vectors of X are the eigenvectors of XXT and XTX.
The nonzero singular values of X are the square roots of the eigenvalues of XXT and XTX.
PCA
The covariance matrix of Z=XV where the columns of V are the right-singular vectors of X is
ΣZZ=n−11ZTZ=n−11DTD=n−11D^2
where D^2 is a square diagonal matrix (0s truncated), and we have used the SVD of X, (UDVT)T=VDTUT, VTV=Ip, UTU=In.
Non-Negative Matrix Factorization
Factorize a matrix V with all positive elements into a product of two matricies W (features matrix) and H (coefficients matrix) subject to the constraint that W and H have no negative elements. Formally,
V≈WH such that W≥0 and H≥0
where V, W and H are m×n, m×p, and p×n matricies, m is the number of samples, n is the number of features, p≪n, and possibly p≪m.
Non-Negative Matrix Factorization
To approximate V≈WH, we need a cost function. We'll focus on a Euclidean distance based cost function between V and WH
∥V−WH∥2=∑ij(Vij−(WH)ij)2
Non-Negative Matrix Factorization
Minimize cost function using Lee and Seung's multiplicative update rule:
Initialize H and W with non negative values
Update Hij and Wij
Hij←Hij(WTWH)ij(WTV)ij
Wij←Wij(WHHT)ij(VHT)ij
Stop when Hij and Wij don't change within a specified tolerance
A non-linear dimensional reduction approach that attempts to map a distribution of pairwise distances among n high-dimensional samples from their high dimension to a distribution of pairwise distances of the n samples in a low dimension.
T-SNE
Given n high-dimensional samples x1,…,xn
p(j∣i)=∑k=ie−∥xk−xi∥2/2σi2e−∥xj−xi∥2/2σi2 for i=j
and 0 otherwise. Next, define
pij=2np(j∣i)+p(i∣j)
which ensures that pij=pji and ∑i,jpij=1
T-SNE
The goal of t-SNE is to learn a map X↦Y such that the y1,…,yn are mapped to a lower dimensional space (d=2 or 3). In the lower dimensional space, the probability of yi and yj being associated/near each other is assumed to follow a t-distribution with one degree of freedom where for i=j
qij=∑k,l;k=l(1+∥yk−yl∥2)−1(1+∥yj−yj∥2)−1 and 0 otherwise.
T-SNE
The locations of the yi in the lower dimensional space are found by minimizing the Kullback-Leibler divergence between the pij and qij distributions
KL(p∥q)=∑i,j;i=jpijlogqijpij
which is a measure of the difference between distributions p and q. The minimization of the Kullback-Leibler divergence with respect to yi is performed using gradient descent.
Let xi,…,xn be the input data. For each xi compute the k nearest neighbors xi1,…,xik and
ρi=min{d(xi,xij)∣1≤j≤k,d>0}
and set σi using
∑j=1ke−max(0,d(xi,xij)−ρi)/σi=log2(k)
UMAP: Graph Construction
UMAP: Graph Construction
Define a weighted directed graph Gˉ=(V,E,w) where the verticies V of Gˉ are the set X. Then form the set of directed edges with weights wh
E={(xi,xij)∣1≤j≤k,1≤i≤n}
wh(xi,xij)=e−max(0,d(xi,xij)−ρi)/σi
UMAP: Graph Construction
UMAP: Graph Construction
Combine edges of Gˉ with adjacency matrix A into a unified undirected graph G with adjacency matrix B
B=A+AT−A∘AT
where ∘ is the Haddamard (or pointwise) product.
If Aij is the probability that the directed edge from xi to xj exists, then Bij is the probability that at least one of the two directed edges (from xi to xj and from xj to xi) exists.
UMAP: Graph Construction
UMAP: Graph Layout
Learn a mapping X↦Y by first initializing the low dimentional representation using spectral embedding.
Spectral embedding is a dimensional reduction approach that maps a connected graph G to a low dimension vector space in such a way that two points that are "close" in the graph are "close" in the low dimensional vector space.
UMAP: Graph Layout
Formally, this is done by calculating the eigenvectors of the Laplacian L=D−B of the adjacency matrix B where D is a diagonal matrix with Dii=∑jBij. Sort the eigenvalues λ0,λ1,λ2,…, and the eigenvectors corresponding to λ1 and λ2 represent the embedding in a 2-d vector space.
UMAP: Graph Layout
Fit wl, a differentiable form of the 2-d weights to exp. Estimate a and b by performing non-linear regression
of wl(x,y)=(1+a(∥x−y∥22)b)−1 to
Ψ(x,y)=e−(∥x−y∥2−dmin)
if ∥x−y∥2 is greater than dmin and 1 otherwise.
UMAP: Graph Layout
The locations of the yi in the lower dimensional space are found by minimizing the binary cross entropy loss L using stochastic gradient descent where the loss is
L=−b∈B∑wh(b)logwl(b)+(1−wh(b))log(1−wl(b))
Curse of Dimensionality What is the volume of the hypersphere divided by the hypercube as $d \rightarrow \infty$?