The implementation was taken from the SciPy project (commit
f82b3f9d3c30cc6e7f906b0d00b8b43bfea02e5c) and modified to drop the
scipy.sparse and scipy._lib dependencies so that moleculekit
does not need to pull in SciPy. See moleculekit.kdtree._ckdtree
for details on the changes.
The public API is KDTree and the lower-level cKDTree.
This class provides an index into a set of k-dimensional points
which can be used to rapidly look up the nearest neighbors of any
point.
Parameters:
data (array_like, shape (n,m)) – The n data points of dimension m to be indexed. This array is
not copied unless this is necessary to produce a contiguous
array of doubles. The data are also copied if the kd-tree is built
with copy_data=True. Concurrently modifying the data while constructing
a KDTree is not well-defined and may lead to crashes or data
corruption. If no copy happens, modifying the data after creating
the KDTree may lead to data corruption or incorrect answers when
searching the tree.
leafsize (positive int, optional) – The number of points at which the algorithm switches over to
brute-force. Default: 10.
compact_nodes (bool, optional) – If True, the kd-tree is built to shrink the hyperrectangles to
the actual data range. This usually gives a more compact tree that
is robust against degenerated input data and gives faster queries
at the expense of longer build time. Default: True.
copy_data (bool, optional) – If True the data is always copied to protect the kd-tree against
data corruption. Default: False.
balanced_tree (bool, optional) – If True, the median is used to split the hyperrectangles instead of
the midpoint. This usually gives a more compact tree and
faster queries at the expense of longer build time. Default: True.
boxsize (array_like or scalar, optional) – Apply an m-d toroidal topology to the KDTree. The topology is generated
by \(x_i + n_i L_i\) where \(n_i\) are integers and \(L_i\)
is the boxsize along i-th dimension. The input data shall be wrapped
into \([0, L_i)\). A ValueError is raised if any of the data is
outside of this bound.
The n data points of dimension m to be indexed. This array is
not copied unless this is necessary to produce a contiguous
array of doubles. The data are also copied if the kd-tree is built
with copy_data=True.
If boxsize is passed to the initializer, this will be set to the bounds
of the periodic box. None if no boxsize is passed.
Type:
None or ndarray, shape (m.)
Notes
The algorithm used is described in [1]_.
The general idea is that the kd-tree is a binary tree, each of whose
nodes represents an axis-aligned hyperrectangle. Each node specifies
an axis and splits the set of points based on whether their coordinate
along that axis is greater than or less than a particular value.
During construction, the axis and splitting point are chosen by the
“sliding midpoint” rule, which ensures that the cells do not all
become long and thin.
The tree can be queried for the r closest neighbors of any given point
(optionally returning only those within some maximum distance of the
point). It can also be queried, with a substantial gain in efficiency,
for the r approximate closest neighbors.
For large dimensions (20 is already large) do not expect this to run
significantly faster than brute force. High-dimensional nearest-neighbor
queries are a substantial open problem in computer science.
Count the number of pairs (x1,x2) can be formed, with x1 drawn
from self and x2 drawn from other, and where
distance(x1,x2,p)<=r.
Data points on self and other are optionally weighted by the
weights argument. (See below)
This is adapted from the “two-point correlation” algorithm described by
Gray and Moore [1]_. See notes for further discussion.
Parameters:
other (KDTree) – The other tree to draw points from, can be the same tree as self.
r (float or one-dimensional array of floats) – The radius to produce a count for. Multiple radii are searched with
a single tree traversal.
If the count is non-cumulative (cumulative=False), r
defines the edges of the bins, and must be non-decreasing.
p (float, optional) – 1<=p<=infinity.
Which Minkowski p-norm to use.
Default 2.0.
A finite large p may cause a ValueError if overflow can occur.
If None, the pair-counting is unweighted.
If given as a tuple, weights[0] is the weights of points in
self, and weights[1] is the weights of points in other;
either can be None to indicate the points are unweighted.
If given as an array_like, weights is the weights of points in
self and other. For this to make sense, self and
other must be the same tree. If self and other are two
different trees, a ValueError is raised.
Default: None
Whether the returned counts are cumulative. When cumulative is set
to False the algorithm is optimized to work with a large number
of bins (>10) specified by r. When cumulative is set to
True, the algorithm is optimized to work with a small number of
r. Default: True
New in version 1.6.0.
Returns:
result – The number of pairs. For unweighted counts, the result is integer.
For weighted counts, the result is float.
If cumulative is False, result[i] contains the counts with
(-infifi==0elser[i-1])<R<=r[i]
Return type:
scalar or 1-D array
Notes
Pair-counting is the basic operation used to calculate the two point
correlation functions from a data set composed of position of objects.
Two point correlation function measures the clustering of objects and
is widely used in cosmology to quantify the large scale structure
in our Universe, but it may be useful for data analysis in other fields
where self-similar assembly of objects also occur.
The Landy-Szalay estimator for the two point correlation function of
D measures the clustering signal in D. [2]_
For example, given the position of two sets of objects,
objects D (data) contains the clustering signal, and
where the brackets represents counting pairs between two data sets
in a finite bin around r (distance), corresponding to setting
cumulative=False, and f=float(len(D))/float(len(R)) is the
ratio between number of objects from data and random.
The algorithm implemented here is loosely based on the dual-tree
algorithm described in [1]_. We switch between two different
pair-cumulation scheme depending on the setting of cumulative.
The computing time of the method we use when for
cumulative==False does not scale with the total number of bins.
The algorithm for cumulative==True scales linearly with the
number of bins, though it is slightly faster when only
1 or 2 bins are used. [5]_.
As an extension to the naive pair-counting,
weighted pair-counting counts the product of weights instead
of number of pairs.
Weighted pair-counting is used to estimate marked correlation functions
([3]_, section 2.2),
or to properly calculate the average of data per distance bin
(e.g. [4]_, section 2.1 on redshift).
Examples
You can count neighbors number between two kd-trees within a distance:
x (array_like, last dimension self.m) – An array of points to query.
k (int or Sequence[int], optional) – Either the number of nearest neighbors to return, or a list of the
k-th nearest neighbors to return, starting from 1.
eps (nonnegative float, optional) – Return approximate nearest neighbors; the kth returned value
is guaranteed to be no further than (1+eps) times the
distance to the real kth nearest neighbor.
p (float, 1<=p<=infinity, optional) – Which Minkowski p-norm to use.
1 is the sum-of-absolute-values distance (“Manhattan” distance).
2 is the usual Euclidean distance.
infinity is the maximum-coordinate-difference distance.
A large, finite p may cause a ValueError if overflow can occur.
distance_upper_bound (nonnegative float, optional) – Return only neighbors within this distance. This is used to prune
tree searches, so if you are doing a series of nearest-neighbor
queries, it may help to supply the distance to the nearest neighbor
of the most recent point.
Number of workers to use for parallel processing. If -1 is given
all CPU threads are used. Default: 1.
New in version 1.6.0.
Returns:
d (float or array of floats) – The distances to the nearest neighbors.
If x has shape tuple+(self.m,), then d has shape
tuple+(k,).
When k == 1, the last dimension of the output is squeezed.
Missing neighbors are indicated with infinite distances.
Hits are sorted by distance (nearest first).
Changed in version 1.9.0: Previously if k=None, then d was an object array of
shape tuple, containing lists of distances. This behavior
has been removed, use query_ball_point instead.
i (int or array of integers) – The index of each neighbor in self.data.
i is the same shape as d.
Missing neighbors are indicated with self.n.
x (array_like, shape tuple + (self.m,)) – The point or points to search for neighbors of.
r (array_like, float) – The radius of points to return, must broadcast to the length of x.
p (float, optional) – Which Minkowski p-norm to use. Should be in the range [1, inf].
A finite large p may cause a ValueError if overflow can occur.
eps (nonnegative float, optional) – Approximate search. Branches of the tree are not explored if their
nearest points are further than r/(1+eps), and branches are
added in bulk if their furthest points are nearer than
r*(1+eps).
Sorts returned indices if True and does not sort them if False. If
None, does not sort single point queries, but does sort
multi-point queries which was the behavior before this option
was added.
Return the number of points inside the radius instead of a list
of the indices.
New in version 1.6.0.
Returns:
results – If x is a single point, returns a list of the indices of the
neighbors of x. If x is an array of points, returns an object
array of shape tuple containing lists of neighbors.
If you have many points whose neighbors you want to find, you may save
substantial amounts of time by putting them in a KDTree and using
query_ball_tree.
Find all pairs of points between self and other whose distance is
at most r.
Parameters:
other (KDTree instance) – The tree containing points to search against.
r (float) – The maximum distance, has to be positive.
p (float, optional) – Which Minkowski norm to use. p has to meet the condition
1<=p<=infinity.
eps (float, optional) – Approximate search. Branches of the tree are not explored
if their nearest points are further than r/(1+eps), and
branches are added in bulk if their furthest points are nearer
than r*(1+eps). eps has to be non-negative.
Returns:
results – For each element self.data[i] of this tree, results[i] is a
list of the indices of its neighbors in other.data.
Find all pairs of points in self whose distance is at most r.
Parameters:
r (positive float) – The maximum distance.
p (float, optional) – Which Minkowski norm to use. p has to meet the condition
1<=p<=infinity.
eps (float, optional) – Approximate search. Branches of the tree are not explored
if their nearest points are further than r/(1+eps), and
branches are added in bulk if their furthest points are nearer
than r*(1+eps). eps has to be non-negative.
Choose the output container, ‘set’ or ‘ndarray’. Default: ‘set’
New in version 1.6.0.
Returns:
results – Set of pairs (i,j), with i<j, for which the corresponding
positions are close. If output_type is ‘ndarray’, an ndarray is
returned instead of a set.
Which container to use for output data. Options: 'dok_array',
'coo_array', 'dict', or 'ndarray'.
Legacy options 'dok_matrix' and 'coo_matrix' are still available.
Default: 'dok_matrix'.
Warning
dok_matrix and coo_matrix are being replaced.
All new code using scipy sparse should use sparse array
types ‘dok_array’ or ‘coo_array’. The default value of
output_type will be deprecated at v1.19 and switch from
‘dok_matrix’ to ‘dok_array’ in v1.21.
The values ‘dok_matrix’ and ‘coo_matrix’ continue
to work, but will go away eventually.
New in version 1.6.0.
Returns:
result – Sparse matrix representing the results in “dictionary of keys”
format. If a dict is returned the keys are (i,j) tuples of indices.
If output_type is 'ndarray' a record array with fields 'i', 'j',
and 'v' is returned,
This class provides an index into a set of k-dimensional points
which can be used to rapidly look up the nearest neighbors of any
point.
Note
cKDTree is functionally identical to KDTree. Prior to SciPy
v1.6.0, cKDTree had better performance and slightly different
functionality but now the two names exist only for
backward-compatibility reasons. If compatibility with SciPy < 1.6 is not
a concern, prefer KDTree.
Parameters:
data (array_like, shape (n,m)) – The n data points of dimension m to be indexed. This array is
not copied unless this is necessary to produce a contiguous
array of doubles, and so modifying this data will result in
bogus results. The data are also copied if the kd-tree is built
with copy_data=True.
leafsize (positive int, optional) – The number of points at which the algorithm switches over to
brute-force. Default: 16.
compact_nodes (bool, optional) – If True, the kd-tree is built to shrink the hyperrectangles to
the actual data range. This usually gives a more compact tree that
is robust against degenerated input data and gives faster queries
at the expense of longer build time. Default: True.
copy_data (bool, optional) – If True the data is always copied to protect the kd-tree against
data corruption. Default: False.
balanced_tree (bool, optional) – If True, the median is used to split the hyperrectangles instead of
the midpoint. This usually gives a more compact tree and
faster queries at the expense of longer build time. Default: True.
boxsize (array_like or scalar, optional) – Apply an m-d toroidal topology to the KDTree. The topology is generated
by \(x_i + n_i L_i\) where \(n_i\) are integers and \(L_i\)
is the boxsize along i-th dimension. The input data shall be wrapped
into \([0, L_i)\). A ValueError is raised if any of the data is
outside of this bound.
The n data points of dimension m to be indexed. This array is
not copied unless this is necessary to produce a contiguous
array of doubles. The data are also copied if the kd-tree is built
with copy_data=True. Concurrently modifying the contents of the
data array while the KDTree initializer is running may lead to data
corruption or crashes. If the data are not copied, modifying the
original data array after the tree is created may lead to crashes or
data corruption when searching the tree.
This attribute exposes a Python view of the root node in the cKDTree
object. A full Python view of the kd-tree is created dynamically
on the first access. This attribute allows you to create your own
query functions in Python.
If boxsize is passed to the initializer, this will be set to the bounds
of the periodic box. None if no boxsize is passed.
Type:
None or ndarray, shape (m.)
Notes
The algorithm used is described in [1]_.
The general idea is that the kd-tree is a binary tree, each of whose
nodes represents an axis-aligned hyperrectangle. Each node specifies
an axis and splits the set of points based on whether their coordinate
along that axis is greater than or less than a particular value.
During construction, the axis and splitting point are chosen by the
“sliding midpoint” rule, which ensures that the cells do not all
become long and thin.
The tree can be queried for the r closest neighbors of any given point
(optionally returning only those within some maximum distance of the
point). It can also be queried, with a substantial gain in efficiency,
for the r approximate closest neighbors.
For large dimensions (20 is already large) do not expect this to run
significantly faster than brute force. High-dimensional nearest-neighbor
queries are a substantial open problem in computer science.
Count the number of pairs (x1,x2) can be formed, with x1 drawn
from self and x2 drawn from other, and where
distance(x1,x2,p)<=r.
Data points on self and other are optionally weighted by the
weights argument. (See below)
This is adapted from the “two-point correlation” algorithm described by
Gray and Moore [1]_. See notes for further discussion.
Parameters:
other (cKDTree instance) – The other tree to draw points from, can be the same tree as self.
r (float or one-dimensional array of floats) – The radius to produce a count for. Multiple radii are searched with
a single tree traversal.
If the count is non-cumulative (cumulative=False), r
defines the edges of the bins, and must be non-decreasing.
p (float, optional) – 1<=p<=infinity.
Which Minkowski p-norm to use.
Default 2.0.
A finite large p may cause a ValueError if overflow can occur.
weights (tuple, array_like, or None, optional) – If None, the pair-counting is unweighted.
If given as a tuple, weights[0] is the weights of points in self, and
weights[1] is the weights of points in other; either can be None to
indicate the points are unweighted.
If given as an array_like, weights is the weights of points in self
and other. For this to make sense, self and other must be the
same tree. If self and other are two different trees, a ValueError
is raised.
Default: None
cumulative (bool, optional) – Whether the returned counts are cumulative. When cumulative is set to False
the algorithm is optimized to work with a large number of bins (>10) specified
by r. When cumulative is set to True, the algorithm is optimized to work
with a small number of r. Default: True
Returns:
result – The number of pairs. For unweighted counts, the result is integer.
For weighted counts, the result is float.
If cumulative is False, result[i] contains the counts with
(-infifi==0elser[i-1])<R<=r[i]
Return type:
scalar or 1-D array
Notes
Pair-counting is the basic operation used to calculate the two point
correlation functions from a data set composed of position of objects.
Two point correlation function measures the clustering of objects and
is widely used in cosmology to quantify the large scale structure
in our Universe, but it may be useful for data analysis in other fields
where self-similar assembly of objects also occur.
The Landy-Szalay estimator for the two point correlation function of
D measures the clustering signal in D. [2]_
For example, given the position of two sets of objects,
objects D (data) contains the clustering signal, and
where the brackets represents counting pairs between two data sets
in a finite bin around r (distance), corresponding to setting
cumulative=False, and f=float(len(D))/float(len(R)) is the
ratio between number of objects from data and random.
The algorithm implemented here is loosely based on the dual-tree
algorithm described in [1]_. We switch between two different
pair-cumulation scheme depending on the setting of cumulative.
The computing time of the method we use when for
cumulative==False does not scale with the total number of bins.
The algorithm for cumulative==True scales linearly with the
number of bins, though it is slightly faster when only
1 or 2 bins are used. [5]_.
As an extension to the naive pair-counting,
weighted pair-counting counts the product of weights instead
of number of pairs.
Weighted pair-counting is used to estimate marked correlation functions
([3]_, section 2.2),
or to properly calculate the average of data per distance bin
(e.g. [4]_, section 2.1 on redshift).
Examples
You can count neighbors number between two kd-trees within a distance:
x (array_like, last dimension self.m) – An array of points to query.
k (list of integer or integer) – The list of k-th nearest neighbors to return. If k is an
integer it is treated as a list of [1, … k] (range(1, k+1)).
Note that the counting starts from 1.
eps (non-negative float) – Return approximate nearest neighbors; the k-th returned value
is guaranteed to be no further than (1+eps) times the
distance to the real k-th nearest neighbor.
p (float, 1<=p<=infinity) – Which Minkowski p-norm to use.
1 is the sum-of-absolute-values “Manhattan” distance
2 is the usual Euclidean distance
infinity is the maximum-coordinate-difference distance
A finite large p may cause a ValueError if overflow can occur.
distance_upper_bound (nonnegative float) – Return only neighbors within this distance. This is used to prune
tree searches, so if you are doing a series of nearest-neighbor
queries, it may help to supply the distance to the nearest neighbor
of the most recent point.
Number of workers to use for parallel processing. If -1 is given
all CPU threads are used. Default: 1.
Changed in version 1.9.0: The “n_jobs” argument was renamed “workers”. The old name
“n_jobs” was deprecated in SciPy 1.6.0 and was removed in
SciPy 1.9.0.
Returns:
d (array of floats) – The distances to the nearest neighbors.
If x has shape tuple+(self.m,), then d has shape tuple+(k,).
When k == 1, the last dimension of the output is squeezed.
Missing neighbors are indicated with infinite distances.
i (ndarray of ints) – The index of each neighbor in self.data.
If x has shape tuple+(self.m,), then i has shape tuple+(k,).
When k == 1, the last dimension of the output is squeezed.
Missing neighbors are indicated with self.n.
Notes
If the KD-Tree is periodic, the position x is wrapped into the
box.
When the input k is a list, a query for arange(max(k)) is performed, but
only columns that store the requested values of k are preserved. This is
implemented in a manner that reduces memory usage.
x (array_like, shape tuple + (self.m,)) – The point or points to search for neighbors of.
r (array_like, float) – The radius of points to return, shall broadcast to the length of x.
p (float, optional) – Which Minkowski p-norm to use. Should be in the range [1, inf].
A finite large p may cause a ValueError if overflow can occur.
eps (nonnegative float, optional) – Approximate search. Branches of the tree are not explored if their
nearest points are further than r/(1+eps), and branches are
added in bulk if their furthest points are nearer than
r*(1+eps).
Number of jobs to schedule for parallel processing. If -1 is given
all processors are used. Default: 1.
Changed in version 1.9.0: The “n_jobs” argument was renamed “workers”. The old name
“n_jobs” was deprecated in SciPy 1.6.0 and was removed in
SciPy 1.9.0.
Sorts returned indices if True and does not sort them if False. If
None, does not sort single point queries, but does sort
multi-point queries which was the behavior before this option
was added.
New in version 1.2.0.
return_length (bool, optional) – Return the number of points inside the radius instead of a list
of the indices.
.. versionadded:: 1.3.0
Returns:
results – If x is a single point, returns a list of the indices of the
neighbors of x. If x is an array of points, returns an object
array of shape tuple containing lists of neighbors.
If you have many points whose neighbors you want to find, you may save
substantial amounts of time by putting them in a cKDTree and using
query_ball_tree.
Find all pairs of points between self and other whose distance is at most r
Parameters:
other (cKDTree instance) – The tree containing points to search against.
r (float) – The maximum distance, has to be positive.
p (float, optional) – Which Minkowski norm to use. p has to meet the condition
1<=p<=infinity.
A finite large p may cause a ValueError if overflow can occur.
eps (float, optional) – Approximate search. Branches of the tree are not explored
if their nearest points are further than r/(1+eps), and
branches are added in bulk if their furthest points are nearer
than r*(1+eps). eps has to be non-negative.
Returns:
results – For each element self.data[i] of this tree, results[i] is a
list of the indices of its neighbors in other.data.
Find all pairs of points in self whose distance is at most r.
Parameters:
r (positive float) – The maximum distance.
p (float, optional) – Which Minkowski norm to use. p has to meet the condition
1<=p<=infinity.
A finite large p may cause a ValueError if overflow can occur.
eps (float, optional) – Approximate search. Branches of the tree are not explored
if their nearest points are further than r/(1+eps), and
branches are added in bulk if their furthest points are nearer
than r*(1+eps). eps has to be non-negative.
output_type (str, optional) – Choose the output container, ‘set’ or ‘ndarray’. Default: ‘set’
Returns:
results – Set of pairs (i,j), with i<j, for which the corresponding
positions are close. If output_type is ‘ndarray’, an ndarray is
returned instead of a set.
Which container to use for output data. Options: 'dok_array',
'coo_array', 'dict', or 'ndarray'.
Legacy options 'dok_matrix' and 'coo_matrix' are still available.
Default: 'dok_matrix'.
Warning
dok_matrix and coo_matrix are being replaced.
All new code using scipy sparse should use sparse array
types ‘dok_array’ or ‘coo_array’. The default value of
output_type will be deprecated at v1.19 and switch from
‘dok_matrix’ to ‘dok_array’ in v1.21.
The values ‘dok_matrix’ and ‘coo_matrix’ continue
to work, but will go away eventually.
Returns:
result – Sparse matrix representing the results in “dictionary of keys”
format. If a dict is returned the keys are (i,j) tuples of indices.
If output_type is 'ndarray' a record array with fields 'i', 'j',
and 'v' is returned,
The dimension along which this node is split. If this value is -1
the node is a leafnode in the kd-tree. Leafnodes are not split further
and scanned by brute force.
The value used to separate split this node. Points with value >= split
in the split_dim dimension are sorted to the ‘greater’ subnode
whereas those with value < split are sorted to the ‘lesser’ subnode.
moleculekit.kdtree.minkowski_distance_p(x, y, p=2.0)#
Compute the pth power of the L**p distance between two arrays.
For efficiency, this function computes the L**p distance but does
not extract the pth root. If p is 1 or infinity, this is equal to
the actual L**p distance.
The last dimensions of x and y must be the same length. Any
other dimensions must be compatible for broadcasting.
Parameters:
x ((..., K) array_like) – Input array.
y ((..., K) array_like) – Input array.
p (float, 1 <= p <= infinity) – Which Minkowski p-norm to use.
Returns:
dist – pth power of the distance between the input arrays.