Basic Parameters
func_kwargs
Dictionary of keyword arguments required for the metric. For example, the kantorovich function from pynndescent takes a cost
keyword:
from pynndescent.distances import kantorovich
M = ... # your cost matrix
ann = annchor.Annchor(X,
kantorovich,
func_kwargs = {'cost': M}
)
Alternatively, we could define a new function without keywords, e.g.
from pynndescent.distances import kantorovich
from numba import njit
M = ... # your cost matrix
@njit()
def wasserstein(x, y):
return kantorovich(x, y, cost=M)
ann = annchor.Annchor(X, wasserstein)
Or you could even use the string ‘wasserstein’, since this is one of Annchor’s inbuilt metrics:
M = ... # your cost matrix
ann = annchor.Annchor(X, 'wasserstein', func_kwargs={'cost_matrix':M})
n_anchors
The number of anchor points (waypoints) we wish to use. The default value is 20, and should increase with the complexity of the underlying metric space.
Picking an optimal value of n_anchors
is more of an art than a science at this stage. For simple metric spaces we can get away with an extremely low number of anchors; for example n_anchors=5
does pretty well for Euclidean in 2D! For most use cases we suggest between 20 and 30 anchor points. Remember that adding more anchor points will use more metric evaluations - there is a trade off to be made between picking more anchors, and evaluating more potential nearest neighbors. Given a fixed value of p_work
(more on this later) the number of calls to the metric is also fixed at n_evals = n_anchors*nx + n_samples + n_refine
, thus increasing n_anchors
reduces n_samples
and vice versa.
n_neighbors
Simply the number of nearest neighbors we are looking to find (i.e. the k in k-NN).
n_samples
ANNchor works by trying to approximate the metric given a variety of features (e.g. lower/upper bound). Under the hood this uses a standard machine learning approach, which naturally requires training. The parameter n_samples
specifies how many pairwise distances we evaluate in order to create a training set on which to learn the metric. If this parameter is too small then we risk not accurately learning the underlying metric. Too big and we end up wasting valuable calls to the metric on non-nearest neighbor pairs.
p_work
At its heart, ANNchor wants to make as few calls to an expensive metric as possible. Of course, the more calls to the metric we make, the more accurate we will be (at the expense of run time). The p_work
parameter allows us to control this trade off.
In the brute force case we would make N=nx*(nx-1)//2
calls to the metric, and ANNchor will make at most p_work*N
calls to the metric, i.e. p_work
is the percentage of brute force work we are willing to do.
locality
and loc_thresh
These parameters specify how we cut down the number of pairwise distances that may be considered throughout the algorithm. We use a simple waypointing mechanism: For each point find the set of its locality
nearest anchors, and only consider pairs which share at least loc_thresh
anchors between their sets.
verbose
Simple boolean flag determines whether or not we print diagnostics/progress messages.
backend
The backend argument for joblib parallelisation (for non-numba-jitted functions). Can be 'loky'
or 'multiprocessing'
.
Typically 'loky'
is more robust (and thus the default), but some set-ups will see preformance increases from selecting 'multiprocessing'
.
Advanced Parameters
For most use cases the parameters described in the previous section will suffice. However, there are some other parameters available which allow greater control over the ANNchor algorithm.
anchor_picker
The anchor_picker
parameter allows you to supply you own anchor picker class. This class determines how anchors are sampled from the data. By default we use a max-min algorithm that picks anchors sequentially, aiming to pick the next anchor to be the point with the largest distance to an existing anchor.
sampler
The sampler
parameter allows you to supply your own sampler class. This tells ANNchor how to decide which pairs of points to pick for the sampling (training) phase.
regression
We can specify the machine learning algorithm to train on the features of the sample points. By default this is a stratified linear regression, but could in principle be any ml algorithm. Remember that this needs to be much faster than the slow metric in order to succeed.
error_predictor
The error_predictor
class tries to predict the error associated with the regression, which is useful for assigning probabilities to predictions.
is_metric
This boolean argument should be set to false when our metric is not a true metric (in particular whether it violates the triangle inequality). This can sometimes be the case for approximate metrics. If the metric approximate, ANNchor needs to take better care of the upper and lower triangle inequality bounds since they may no longer represent reality.
get_exact_ijs
This parameter allows the user to specify their own parallelisation for computing large numbers of pairwise distances.
niters
This parameter defines how many inner loops of recalculating upper/lower bounds occur. There is a time-accuracy tradeoff here where higher values of niters
will be more accurate at the expense of extra computations.