ANNchor API Guide
ANNchor contains two classes, the ANNchor base class Annchor
, and a
brute force class BruteForce
.
Annchor
- class annchor.Annchor(X, func, func_kwargs=None, n_anchors=20, n_neighbors=15, n_samples=5000, p_work=0.1, anchor_picker=None, sampler=None, regression=None, error_predictor=None, random_seed=42, locality=5, loc_thresh=1, loc_min=None, verbose=False, is_metric=True, get_exact_ijs=None, backend='loky', niters=2, lookahead=5)
Quickly computes the approximate k-NN graph for slow metrics
- X: np.array or list (required)
The data set for which we want to find the k-NN graph.
- func: function, numba-jitted function or string (required)
The metric under which the k-NN graph should be evaluated. This can be a user supplied function or a string. Currently supported string arguments are
euclidean
cosine
levenshtein
wasserstein (requires cost_matrix kwarg)
- func_kwargs: dict (optional, default None)
Dictionary of keyword arguments for the metric
- n_anchors: int (optional, default 20)
The number of anchor points. Increasing the number of anchors increases the k-NN graph accuracy at the expense of speed.
- n_neighbors: int (optional, default 15)
The number of nearest neighbors to compute (i.e. the value of k for the k-NN graph).
- n_samples: int (optional, default 5000)
The number of sample distances used to compute the error distribution (E = d-dhat).
- p_work: float (optional, default 0.1)
The approximate percentage of brute force calculations which we are willing to make.
- anchor_picker: AnchorPicker (optional, default MaxMinAnchorPicker())
The anchor picker class which specifies the anchor points.
- sampler: Sampler (optional, default SimpleStratifiedSampler())
The sampler class which chooses the sample pairs.
- regression: Regression
(optional, default SimpleStratifiedLinearRegression()) The regression class which predicts distances from features.
- error_predictor: ErrorRegression
(optional, default SimpleStratifiedErrorRegression()) The error regression class which predicts error distributions.
- locality: int (optional, default 5)
The number of anchor points to use in the permutation based k-NN (dhat) step.
- loc_thresh: int (optional, default 1)
The minimum number of anchors in common for which we allow an item into NN_x.
- loc_min: int (optional, default 10*n_neighbors)
The minimum number of points in NN_x
- verbose: bool (optional, default False)
Set verbose=True for more interim output.
- is_metric: bool (optional, default True)
Set is_metric=False if the metric may violate the triangle inequality. With is_metric=True, predictions are clipped between upper/lower bounds, which may not be accurate if triangle inequality is violated.
- get_exact_ijs: function (optional, default None)
An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,IJ) should return np.array([f(X[i],X[j] for i,j in IJ]).
- backend: string (optional, default “loky”)
Specifies the joblib Parallel backend. Can be “loky” or “multiprocessing” In general “loky” seems to be more robust, but occasionally “multiprocessing” can be significantly faster
- annchor_selective_subset(y, dne=None, alpha=0)
Generates a selective subset for X given labels y.
- y: np.array, shape=(len(X),)
Data labels (corresponding to elements in X)
- dne: tuple or None (optional, default=None)
The nearest enemy graph for X,y. dne[0] are the indices of the nearest enemies for points in X. dne[1] are the corresponding distances.
- alpha: float (default=0)
Value for alpha-ss. The nearest enemy distance is multiplied by 1/(1+alpha), to account for approximate nearest neighbour methods. In practice, larger alpha values make the subset more robust to errors at the expense of increasing the size of the subset.
- ss: np.array
The indices of X which form the selective subset.
- fit()
Computes the approx nearest neighbour graph.
- get_anchors()
Gets the anchors and distances to anchors. Anchors are stored in self.A, distances in self.D.
- self.A: np.array, shape=(n_anchors,)
Array of anchor indices.
- self.D: np.array, shape=(nx, n_anchors)
Array of distances to anchor points.
- get_features_IJ(IJs, I)
Computes features for distances in IJs.
- IJs: np.array, shape=(?,2)
Indices of distances for which we will find features.
- I: numba.typed.Dict dict of arrays
I[i] is the array of indices into IJs which contain index i.
- feature_names: List
List of features names (as strings)
- features: np.array
Array of features corresponding to distances in ijs
- not_computed_mask: np.array
Array of bools indicating whether or not a distance in IJs has been explicitly computed.
- get_locality()
Uses basic permutation/set method to find candidate nearest neighbours.
- Current Technique (Use something better in future):
For each point i, find the set S_i of its nearest l=locality anchor points. For each other point j, calculate (S_i intersect S_j). Only consider pairs ij where |(S_i intersect S_j)|>=loc_thresh.
- self.check: Dict, keys=int64, val=int64[:]
check[i] is the array of candidate nearest neighbour indices for index j.
- get_nearest_enemies(y, nn=3, loc_min=100)
Computes the nearest enemy graph Result is stored in self.nearest_enemy_graph
- y: np.array, shape=(len(X),)
Data labels (corresponding to elements in X)
- nn: int > 0
Desired number of nearest enemies
- loc_min: int > 0
Minimum number of points to store in locality (i.e. for each index we will compute features for at least loc_min points, if possible.)
- get_sample()
Gets the sample of pairwise distances on which to train dhat/errors.
- self.G: np.array, shape=(n_samples,3)
Array storing the sample distances and features (for future regression). G[i,:-1] are the features for sample pair i. G[i,-1] is the true distance for sample pair i.
- query(Q, nn=15, p_work=0.3, get_exact_query_ijs=None)
Query new data against the existing index.
- Q: np.array, shape=(n_samples, n_features)
The new data to query against X
- nn: int > 0
Desired number of nearest neighbours
- p_work: 0.0 < float < 1.0
Fraction of brute force work to perform. A value of p_work=1 will carry out len(Q)*len(X) distance computations (equivalent to brute forcing the solution).
- get_exact_query_ijs: function (optional, default None)
An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,Z,IJ) is the function should return np.array([f(X[i],Z[j]) for i,j in IJ]).
- to_sparse_matrix()
Returns the K-NN graph as a dictionary of keys sparse distance matrix.
- class annchor.BruteForce(X, func, func_kwargs=None, verbose=False, get_exact_ijs=None, backend='loky')
Computes the approximate k-NN graph by brute force
- X: np.array or list (required)
The data set for which we want to find the k-NN graph.
- func: function, numba-jitted function (required) or string.
The metric under which the k-NN graph should be evaluated. This can be a user supplied function or a string. Currently supported string arguments are
euclidean
cosine
levenshtein
- func_kwargs: dict (optional, default None)
Dictionary of keyword arguments for the metric
- get_exact_ijs: function (optional, default None)
An optional user supplied function for evaluating the metric on an array of indices. Useful if you wish to supply your own parallelisation. get_exact_ijs(f,X,IJ) should return np.array([f(X[i],X[j] for i,j in IJ]).
- backend: string (optional, default “loky”)
Specifies the joblib Parallel backend. Can be “loky” or “multiprocessing” In general “loky” seems to be more robust, but occasionally “multiprocessing” can be significantly faster
- fit()
get_neighbor_graph
Gets the k-NN graph from the all pairs distance matrix
Utility Functions
The annchor package also includes some utility functions, e.g. finding neighbor graph accuracy.
- annchor.compare_neighbor_graphs(nng_1, nng_2, n_neighbors)
Compares accuracy of k-NN graphs. The second graph is compared against the first. This measure of accuracy accounts for cases where the indices differ but the distances are equivalent.
e.g. if nng_1[0][0]=[0, 1, 2, 3], nng_1[0][1]=[0, 1, 1, 2],
and nng_2[0][0]=[0, 1, 2, 4], nng_1[0][1]=[0, 1, 1, 2],
There would be zero incorrect NN pairs, since both ix=3 and ix=4 are valid 4th nearest neighbors.
- nng_1: nearest neighbour graph (tuple of np.array)
The first nearest neighbour graph, (indices, distances).
- nng_2: nearest neighbour graph (tuple of np.array)
The second nearest neighbour graph, (indices, distances)..
- n_neighbors: int
The number of nearest neighbors to consider
- err: int
The number of incorrect NN pairs.
Data sets
ANNchor comes with various data sets for benchmarking.
- annchor.datasets.load_digits()
Loads the UCI OCR digits data test set (1797 8x8 images).
https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- digits_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]
The string data set.
digits_data[‘X’] is an np.array ints, shape=(1797,64)
digits_data[‘y’] is an np.array of ints (labels), shape=(1797,)
digits_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data
digits_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.
digits_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.
- annchor.datasets.load_digits_large()
Loads the full (train+test) UCI OCR digits data set (5620 8x8 images).
https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- digits_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]
The string data set.
digits_data[‘X’] is an np.array ints, shape=(5620,64)
digits_data[‘y’] is an np.array of ints (labels), shape=(5620,)
digits_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data
digits_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.
digits_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. digits_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.
- annchor.datasets.load_graph_sp()
Loads the graph shortest path data set (800 vertices from a weighted graph). There are 10 clusters of vertices. The intra-cluster distances are shorter on average than the inter-cluster distances.
- graph_sp_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]
The string data set.
graph_sp_data[‘X’] is an np.array of vertex indices, shape=(800,)
graph_sp_data[‘y’] is an np.array of ints (labels), shape=(800,)
graph_sp_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data
graph_sp_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. graph_sp_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.
graph_sp_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. graph_sp_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.
- annchor.datasets.load_strings()
Loads the string data set (1600 strings, length ~500). There are 8 clusters of two types: clouds (labels 0-4) and filaments (labels 5-7).
- strings_data: dict, keys=[‘X’, ‘y’, ‘neighbor_graph’]
The string data set.
strings_data[‘X’] is an np.array of strings, shape=(1600,)
strings_data[‘y’] is an np.array of ints (labels), shape=(1600,)
strings_data[‘neighbor_graph’] is a tuple containing the 100-NN graph data
strings_data[‘neighbor_graph’][0] are the 100-NN indices, i.e. strings_data[‘neighbor_graph’][0][i][j] is the jth nearest index to index i.
strings_data[‘neighbor_graph’][1] are the 100-NN distances, i.e. strings_data[‘neighbor_graph’][0][i][j] is the jth smallest distance to index i.