ANNchor API Guide

ANNchor contains two classes, the ANNchor base class Annchor, and a brute force class BruteForce.

Annchor

class annchor.Annchor(X, func, func_kwargs=None, n_anchors=20, n_neighbors=15, n_samples=5000, p_work=0.1, anchor_picker=None, sampler=None, regression=None, error_predictor=None, random_seed=42, locality=5, loc_thresh=1, loc_min=None, verbose=False, is_metric=True, get_exact_ijs=None, backend='loky', niters=2, lookahead=5)

Quickly computes the approximate k-NN graph for slow metrics

X: np.array or list (required)

The data set for which we want to find the k-NN graph.

func: function, numba-jitted function or string (required)

The metric under which the k-NN graph should be evaluated. This can be a user supplied function or a string. Currently supported string arguments are

  • euclidean

  • cosine

  • levenshtein

  • wasserstein (requires cost_matrix kwarg)

func_kwargs: dict (optional, default None)

Dictionary of keyword arguments for the metric

n_anchors: int (optional, default 20)

The number of anchor points. Increasing the number of anchors increases the k-NN graph accuracy at the expense of speed.

n_neighbors: int (optional, default 15)

The number of nearest neighbors to compute (i.e. the value of k for the k-NN graph).

n_samples: int (optional, default 5000)

The number of sample distances used to compute the error distribution (E = d-dhat).

p_work: float (optional, default 0.1)

The approximate percentage of brute force calculations which we are willing to make.

anchor_picker: AnchorPicker (optional, default MaxMinAnchorPicker())

The anchor picker class which specifies the anchor points.

sampler: Sampler (optional, default SimpleStratifiedSampler())

The sampler class which chooses the sample pairs.

regression: Regression

(optional, default SimpleStratifiedLinearRegression()) The regression class which predicts distances from features.

error_predictor: ErrorRegression

(optional, default SimpleStratifiedErrorRegression()) The error regression class which predicts error distributions.

locality: int (optional, default 5)

The number of anchor points to use in the permutation based k-NN (dhat) step.

loc_thresh: int (optional, default 1)

The minimum number of anchors in common for which we allow an item into NN_x.

loc_min: int (optional, default 10*n_neighbors)

The minimum number of points in NN_x

verbose: bool (optional, default False)

Set verbose=True for more interim output.

is_metric: bool (optional, default True)

Set is_metric=False if the metric may violate the triangle inequality. With is_metric=True, predictions are clipped between upper/lower bounds, which may not be accurate if triangle inequality is violated.

get_exact_ijs: function (optional, default None)

An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,IJ) should return np.array([f(X[i],X[j] for i,j in IJ]).

backend: string (optional, default “loky”)

Specifies the joblib Parallel backend. Can be “loky” or “multiprocessing” In general “loky” seems to be more robust, but occasionally “multiprocessing” can be significantly faster

annchor_selective_subset(y, dne=None, alpha=0)

Generates a selective subset for X given labels y.

y: np.array, shape=(len(X),)

Data labels (corresponding to elements in X)

dne: tuple or None (optional, default=None)

The nearest enemy graph for X,y. dne[0] are the indices of the nearest enemies for points in X. dne[1] are the corresponding distances.

alpha: float (default=0)

Value for alpha-ss. The nearest enemy distance is multiplied by 1/(1+alpha), to account for approximate nearest neighbour methods. In practice, larger alpha values make the subset more robust to errors at the expense of increasing the size of the subset.

ss: np.array

The indices of X which form the selective subset.

fit()

Computes the approx nearest neighbour graph.

get_anchors()

Gets the anchors and distances to anchors. Anchors are stored in self.A, distances in self.D.

self.A: np.array, shape=(n_anchors,)

Array of anchor indices.

self.D: np.array, shape=(nx, n_anchors)

Array of distances to anchor points.

get_features_IJ(IJs, I)

Computes features for distances in IJs.

IJs: np.array, shape=(?,2)

Indices of distances for which we will find features.

I: numba.typed.Dict dict of arrays

I[i] is the array of indices into IJs which contain index i.

feature_names: List

List of features names (as strings)

features: np.array

Array of features corresponding to distances in ijs

not_computed_mask: np.array

Array of bools indicating whether or not a distance in IJs has been explicitly computed.

get_locality()

Uses basic permutation/set method to find candidate nearest neighbours.

Current Technique (Use something better in future):

For each point i, find the set S_i of its nearest l=locality anchor points. For each other point j, calculate (S_i intersect S_j). Only consider pairs ij where |(S_i intersect S_j)|>=loc_thresh.

self.check: Dict, keys=int64, val=int64[:]

check[i] is the array of candidate nearest neighbour indices for index j.

get_nearest_enemies(y, nn=3, loc_min=100)

Computes the nearest enemy graph Result is stored in self.nearest_enemy_graph

y: np.array, shape=(len(X),)

Data labels (corresponding to elements in X)

nn: int > 0

Desired number of nearest enemies

loc_min: int > 0

Minimum number of points to store in locality (i.e. for each index we will compute features for at least loc_min points, if possible.)

get_sample()

Gets the sample of pairwise distances on which to train dhat/errors.

self.G: np.array, shape=(n_samples,3)

Array storing the sample distances and features (for future regression). G[i,:-1] are the features for sample pair i. G[i,-1] is the true distance for sample pair i.

query(Q, nn=15, p_work=0.3, get_exact_query_ijs=None)

Query new data against the existing index.

Q: np.array, shape=(n_samples, n_features)

The new data to query against X

nn: int > 0

Desired number of nearest neighbours

p_work: 0.0 < float < 1.0

Fraction of brute force work to perform. A value of p_work=1 will carry out len(Q)*len(X) distance computations (equivalent to brute forcing the solution).

get_exact_query_ijs: function (optional, default None)

An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,Z,IJ) is the function should return np.array([f(X[i],Z[j]) for i,j in IJ]).

to_sparse_matrix()

Returns the K-NN graph as a dictionary of keys sparse distance matrix.

class annchor.BruteForce(X, func, func_kwargs=None, verbose=False, get_exact_ijs=None, backend='loky')

Computes the approximate k-NN graph by brute force

X: np.array or list (required)

The data set for which we want to find the k-NN graph.

func: function, numba-jitted function (required) or string.

The metric under which the k-NN graph should be evaluated. This can be a user supplied function or a string. Currently supported string arguments are

  • euclidean

  • cosine

  • levenshtein

func_kwargs: dict (optional, default None)

Dictionary of keyword arguments for the metric

get_exact_ijs: function (optional, default None)

An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,IJ) should return np.array([f(X[i],X[j] for i,j in IJ]).

backend: string (optional, default “loky”)

Specifies the joblib Parallel backend. Can be “loky” or “multiprocessing” In general “loky” seems to be more robust, but occasionally “multiprocessing” can be significantly faster

fit()

get_neighbor_graph

Gets the k-NN graph from the all pairs distance matrix

Utility Functions

The annchor package also includes some utility functions, e.g. finding neighbor graph accuracy.

annchor.compare_neighbor_graphs(nng_1, nng_2, n_neighbors)

Compares accuracy of k-NN graphs. The second graph is compared against the first. This measure of accuracy accounts for cases where the indices differ but the distances are equivalent.

e.g. if nng_1[0][0]=[0, 1, 2, 3], nng_1[0][1]=[0, 1, 1, 2],

and nng_2[0][0]=[0, 1, 2, 4], nng_1[0][1]=[0, 1, 1, 2],

There would be zero incorrect NN pairs, since both ix=3 and ix=4 are valid 4th nearest neighbors.

nng_1: nearest neighbour graph (tuple of np.array)

The first nearest neighbour graph, (indices, distances).

nng_2: nearest neighbour graph (tuple of np.array)

The second nearest neighbour graph, (indices, distances)..

n_neighbors: int

The number of nearest neighbors to consider

err: int

The number of incorrect NN pairs.

Data sets

ANNchor comes with various data sets for benchmarking.

annchor.datasets.load_digits()

Loads the UCI OCR digits data test set (1797 8x8 images).

https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

digits_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

digits_data[‘X’] is an np.array ints, shape=(1797,64)

digits_data[‘y’] is an np.array of ints (labels), shape=(1797,)

digits_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

digits_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

digits_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.

annchor.datasets.load_digits_large()

Loads the full (train+test) UCI OCR digits data set (5620 8x8 images).

https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

digits_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

digits_data[‘X’] is an np.array ints, shape=(5620,64)

digits_data[‘y’] is an np.array of ints (labels), shape=(5620,)

digits_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

digits_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

digits_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.

annchor.datasets.load_graph_sp()

Loads the graph shortest path data set (800 vertices from a weighted graph). There are 10 clusters of vertices. The intra-cluster distances are shorter on average than the inter-cluster distances.

graph_sp_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

graph_sp_data[‘X’] is an np.array of vertex indices, shape=(800,)

graph_sp_data[‘y’] is an np.array of ints (labels), shape=(800,)

graph_sp_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

graph_sp_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. graph_sp_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

graph_sp_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. graph_sp_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.

annchor.datasets.load_strings()

Loads the string data set (1600 strings, length ~500). There are 8 clusters of two types: clouds (labels 0-4) and filaments (labels 5-7).

strings_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

strings_data[‘X’] is an np.array of strings, shape=(1600,)

strings_data[‘y’] is an np.array of ints (labels), shape=(1600,)

strings_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

strings_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. strings_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

strings_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. strings_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.