ANNchor API Guide

ANNchor contains two classes, the ANNchor base class Annchor, and a brute force class BruteForce.

Annchor

class annchor.Annchor(X, func, func_kwargs=None, n_anchors=20, n_neighbors=15, n_samples=5000, p_work=0.1, anchor_picker=None, sampler=None, regression=None, error_predictor=None, random_seed=42, locality=5, loc_thresh=1, loc_min=None, verbose=False, is_metric=True, get_exact_ijs=None, backend='loky', niters=2, lookahead=5)

Quickly computes the approximate k-NN graph for slow metrics

Parameters

X: np.array or list (required)

The data set for which we want to find the k-NN graph.

func: function, numba-jitted function or string (required)

The metric under which the k-NN graph should be evaluated. This can be a user supplied function or a string. Currently supported string arguments are

euclidean

cosine

levenshtein

wasserstein (requires cost_matrix kwarg)

func_kwargs: dict (optional, default None)

Dictionary of keyword arguments for the metric

n_anchors: int (optional, default 20)

The number of anchor points. Increasing the number of anchors increases the k-NN graph accuracy at the expense of speed.

n_neighbors: int (optional, default 15)

The number of nearest neighbors to compute (i.e. the value of k for the k-NN graph).

n_samples: int (optional, default 5000)

The number of sample distances used to compute the error distribution (E = d-dhat).

p_work: float (optional, default 0.1)

The approximate percentage of brute force calculations which we are willing to make.

anchor_picker: AnchorPicker (optional, default MaxMinAnchorPicker())

The anchor picker class which specifies the anchor points.

sampler: Sampler (optional, default SimpleStratifiedSampler())

The sampler class which chooses the sample pairs.

regression: Regression

(optional, default SimpleStratifiedLinearRegression()) The regression class which predicts distances from features.

error_predictor: ErrorRegression

(optional, default SimpleStratifiedErrorRegression()) The error regression class which predicts error distributions.

locality: int (optional, default 5)

The number of anchor points to use in the permutation based k-NN (dhat) step.

loc_thresh: int (optional, default 1)

The minimum number of anchors in common for which we allow an item into NN_x.

loc_min: int (optional, default 10*n_neighbors)

The minimum number of points in NN_x

verbose: bool (optional, default False)

Set verbose=True for more interim output.

is_metric: bool (optional, default True)

Set is_metric=False if the metric may violate the triangle inequality. With is_metric=True, predictions are clipped between upper/lower bounds, which may not be accurate if triangle inequality is violated.

get_exact_ijs: function (optional, default None)

An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,IJ) should return np.array([f(X[i],X[j] for i,j in IJ]).

backend: string (optional, default “loky”)

Specifies the joblib Parallel backend. Can be “loky” or “multiprocessing” In general “loky” seems to be more robust, but occasionally “multiprocessing” can be significantly faster

annchor_selective_subset(y, dne=None, alpha=0)

Generates a selective subset for X given labels y.

Parameters

y: np.array, shape=(len(X),): Data labels (corresponding to elements in X)
dne: tuple or None (optional, default=None): The nearest enemy graph for X,y. dne[0] are the indices of the nearest enemies for points in X. dne[1] are the corresponding distances.
alpha: float (default=0): Value for alpha-ss. The nearest enemy distance is multiplied by 1/(1+alpha), to account for approximate nearest neighbour methods. In practice, larger alpha values make the subset more robust to errors at the expense of increasing the size of the subset.

Outputs

ss: np.array: The indices of X which form the selective subset.

fit(): Computes the approx nearest neighbour graph.

get_anchors()

Gets the anchors and distances to anchors. Anchors are stored in self.A, distances in self.D.

self.A: np.array, shape=(n_anchors,): Array of anchor indices.
self.D: np.array, shape=(nx, n_anchors): Array of distances to anchor points.

get_features_IJ(IJs, I)

Computes features for distances in IJs.

Parameters

IJs: np.array, shape=(?,2): Indices of distances for which we will find features.
I: numba.typed.Dict dict of arrays: I[i] is the array of indices into IJs which contain index i.

Outputs

feature_names: List: List of features names (as strings)
features: np.array: Array of features corresponding to distances in ijs
not_computed_mask: np.array: Array of bools indicating whether or not a distance in IJs has been explicitly computed.

get_locality()

Uses basic permutation/set method to find candidate nearest neighbours.

Current Technique (Use something better in future):: For each point i, find the set S_i of its nearest l=locality anchor points. For each other point j, calculate (S_i intersect S_j). Only consider pairs ij where |(S_i intersect S_j)|>=loc_thresh.
self.check: Dict, keys=int64, val=int64[:]: check[i] is the array of candidate nearest neighbour indices for index j.

get_nearest_enemies(y, nn=3, loc_min=100)

Computes the nearest enemy graph Result is stored in self.nearest_enemy_graph

Parameters

y: np.array, shape=(len(X),): Data labels (corresponding to elements in X)
nn: int > 0: Desired number of nearest enemies
loc_min: int > 0: Minimum number of points to store in locality (i.e. for each index we will compute features for at least loc_min points, if possible.)

get_sample()

Gets the sample of pairwise distances on which to train dhat/errors.

self.G: np.array, shape=(n_samples,3): Array storing the sample distances and features (for future regression). G[i,:-1] are the features for sample pair i. G[i,-1] is the true distance for sample pair i.

query(Q, nn=15, p_work=0.3, get_exact_query_ijs=None)

Query new data against the existing index.

Parameters

Q: np.array, shape=(n_samples, n_features): The new data to query against X
nn: int > 0: Desired number of nearest neighbours
p_work: 0.0 < float < 1.0: Fraction of brute force work to perform. A value of p_work=1 will carry out len(Q)*len(X) distance computations (equivalent to brute forcing the solution).
get_exact_query_ijs: function (optional, default None): An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,Z,IJ) is the function should return np.array([f(X[i],Z[j]) for i,j in IJ]).

to_sparse_matrix(): Returns the K-NN graph as a dictionary of keys sparse distance matrix.

class annchor.BruteForce(X, func, func_kwargs=None, verbose=False, get_exact_ijs=None, backend='loky')

Computes the approximate k-NN graph by brute force

Parameters

X: np.array or list (required)

The data set for which we want to find the k-NN graph.

func: function, numba-jitted function (required) or string.

The metric under which the k-NN graph should be evaluated. This can be a user supplied function or a string. Currently supported string arguments are

euclidean

cosine

levenshtein

func_kwargs: dict (optional, default None)

Dictionary of keyword arguments for the metric

get_exact_ijs: function (optional, default None)

An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,IJ) should return np.array([f(X[i],X[j] for i,j in IJ]).

backend: string (optional, default “loky”)

Specifies the joblib Parallel backend. Can be “loky” or “multiprocessing” In general “loky” seems to be more robust, but occasionally “multiprocessing” can be significantly faster

fit()

get_neighbor_graph

Gets the k-NN graph from the all pairs distance matrix

Utility Functions

The annchor package also includes some utility functions, e.g. finding neighbor graph accuracy.

annchor.compare_neighbor_graphs(nng_1, nng_2, n_neighbors)

Compares accuracy of k-NN graphs. The second graph is compared against the first. This measure of accuracy accounts for cases where the indices differ but the distances are equivalent.

e.g. if nng_1[0][0]=[0, 1, 2, 3], nng_1[0][1]=[0, 1, 1, 2],

and nng_2[0][0]=[0, 1, 2, 4], nng_1[0][1]=[0, 1, 1, 2],

There would be zero incorrect NN pairs, since both ix=3 and ix=4 are valid 4th nearest neighbors.

Parameters

nng_1: nearest neighbour graph (tuple of np.array): The first nearest neighbour graph, (indices, distances).
nng_2: nearest neighbour graph (tuple of np.array): The second nearest neighbour graph, (indices, distances)..
n_neighbors: int: The number of nearest neighbors to consider

Returns

err: int: The number of incorrect NN pairs.

Data sets

ANNchor comes with various data sets for benchmarking.

annchor.datasets.load_digits()

Loads the UCI OCR digits data test set (1797 8x8 images).

https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Parameters

Returns

digits_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

digits_data[‘X’] is an np.array ints, shape=(1797,64)

digits_data[‘y’] is an np.array of ints (labels), shape=(1797,)

digits_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

digits_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

digits_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.

annchor.datasets.load_digits_large()

Loads the full (train+test) UCI OCR digits data set (5620 8x8 images).

https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Parameters

Returns

digits_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

digits_data[‘X’] is an np.array ints, shape=(5620,64)

digits_data[‘y’] is an np.array of ints (labels), shape=(5620,)

digits_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

digits_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

digits_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.

annchor.datasets.load_graph_sp()

Loads the graph shortest path data set (800 vertices from a weighted graph). There are 10 clusters of vertices. The intra-cluster distances are shorter on average than the inter-cluster distances.

Parameters

Returns

graph_sp_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

graph_sp_data[‘X’] is an np.array of vertex indices, shape=(800,)

graph_sp_data[‘y’] is an np.array of ints (labels), shape=(800,)

graph_sp_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

graph_sp_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. graph_sp_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

graph_sp_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. graph_sp_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.

annchor.datasets.load_strings()

Loads the string data set (1600 strings, length ~500). There are 8 clusters of two types: clouds (labels 0-4) and filaments (labels 5-7).

Parameters

Returns

strings_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]

The string data set.

strings_data[‘X’] is an np.array of strings, shape=(1600,)

strings_data[‘y’] is an np.array of ints (labels), shape=(1600,)

strings_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data

strings_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. strings_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.

strings_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. strings_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.