moleculekit.kdtree package#

Module contents#

KDTree port from SciPy.

The implementation was taken from the SciPy project (commit f82b3f9d3c30cc6e7f906b0d00b8b43bfea02e5c) and modified to drop the scipy.sparse and scipy._lib dependencies so that moleculekit does not need to pull in SciPy. See moleculekit.kdtree._ckdtree for details on the changes.

The public API is KDTree and the lower-level cKDTree.

class moleculekit.kdtree.KDTree(data, leafsize=10, compact_nodes=True, copy_data=False, balanced_tree=True, boxsize=None)#

Bases: cKDTree

kd-tree for quick nearest-neighbor lookup.

This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point.

Parameters:
  • data (array_like, shape (n,m)) – The n data points of dimension m to be indexed. This array is not copied unless this is necessary to produce a contiguous array of doubles. The data are also copied if the kd-tree is built with copy_data=True. Concurrently modifying the data while constructing a KDTree is not well-defined and may lead to crashes or data corruption. If no copy happens, modifying the data after creating the KDTree may lead to data corruption or incorrect answers when searching the tree.

  • leafsize (positive int, optional) – The number of points at which the algorithm switches over to brute-force. Default: 10.

  • compact_nodes (bool, optional) – If True, the kd-tree is built to shrink the hyperrectangles to the actual data range. This usually gives a more compact tree that is robust against degenerated input data and gives faster queries at the expense of longer build time. Default: True.

  • copy_data (bool, optional) – If True the data is always copied to protect the kd-tree against data corruption. Default: False.

  • balanced_tree (bool, optional) – If True, the median is used to split the hyperrectangles instead of the midpoint. This usually gives a more compact tree and faster queries at the expense of longer build time. Default: True.

  • boxsize (array_like or scalar, optional) – Apply an m-d toroidal topology to the KDTree. The topology is generated by \(x_i + n_i L_i\) where \(n_i\) are integers and \(L_i\) is the boxsize along i-th dimension. The input data shall be wrapped into \([0, L_i)\). A ValueError is raised if any of the data is outside of this bound.

data#

The n data points of dimension m to be indexed. This array is not copied unless this is necessary to produce a contiguous array of doubles. The data are also copied if the kd-tree is built with copy_data=True.

Type:

ndarray, shape (n,m)

leafsize#

The number of points at which the algorithm switches over to brute-force.

Type:

positive int

m#

The dimension of a single data-point.

Type:

int

n#

The number of data points.

Type:

int

maxes#

The maximum value in each dimension of the n data points.

Type:

ndarray, shape (m,)

mins#

The minimum value in each dimension of the n data points.

Type:

ndarray, shape (m,)

size#

The number of nodes in the tree.

Type:

int

boxsize#

If boxsize is passed to the initializer, this will be set to the bounds of the periodic box. None if no boxsize is passed.

Type:

None or ndarray, shape (m.)

Notes

The algorithm used is described in [1]_. The general idea is that the kd-tree is a binary tree, each of whose nodes represents an axis-aligned hyperrectangle. Each node specifies an axis and splits the set of points based on whether their coordinate along that axis is greater than or less than a particular value.

During construction, the axis and splitting point are chosen by the “sliding midpoint” rule, which ensures that the cells do not all become long and thin.

The tree can be queried for the r closest neighbors of any given point (optionally returning only those within some maximum distance of the point). It can also be queried, with a substantial gain in efficiency, for the r approximate closest neighbors.

For large dimensions (20 is already large) do not expect this to run significantly faster than brute force. High-dimensional nearest-neighbor queries are a substantial open problem in computer science.

References

count_neighbors(other, r, p=2.0, weights=None, cumulative=True)#

Count how many nearby pairs can be formed.

Count the number of pairs (x1,x2) can be formed, with x1 drawn from self and x2 drawn from other, and where distance(x1, x2, p) <= r.

Data points on self and other are optionally weighted by the weights argument. (See below)

This is adapted from the “two-point correlation” algorithm described by Gray and Moore [1]_. See notes for further discussion.

Parameters:
  • other (KDTree) – The other tree to draw points from, can be the same tree as self.

  • r (float or one-dimensional array of floats) – The radius to produce a count for. Multiple radii are searched with a single tree traversal. If the count is non-cumulative (cumulative=False), r defines the edges of the bins, and must be non-decreasing.

  • p (float, optional) – 1<=p<=infinity. Which Minkowski p-norm to use. Default 2.0. A finite large p may cause a ValueError if overflow can occur.

  • weights (tuple, array_like, or None, optional) –

    If None, the pair-counting is unweighted. If given as a tuple, weights[0] is the weights of points in self, and weights[1] is the weights of points in other; either can be None to indicate the points are unweighted. If given as an array_like, weights is the weights of points in self and other. For this to make sense, self and other must be the same tree. If self and other are two different trees, a ValueError is raised. Default: None

    New in version 1.6.0.

  • cumulative (bool, optional) –

    Whether the returned counts are cumulative. When cumulative is set to False the algorithm is optimized to work with a large number of bins (>10) specified by r. When cumulative is set to True, the algorithm is optimized to work with a small number of r. Default: True

    New in version 1.6.0.

Returns:

result – The number of pairs. For unweighted counts, the result is integer. For weighted counts, the result is float. If cumulative is False, result[i] contains the counts with (-inf if i == 0 else r[i-1]) < R <= r[i]

Return type:

scalar or 1-D array

Notes

Pair-counting is the basic operation used to calculate the two point correlation functions from a data set composed of position of objects.

Two point correlation function measures the clustering of objects and is widely used in cosmology to quantify the large scale structure in our Universe, but it may be useful for data analysis in other fields where self-similar assembly of objects also occur.

The Landy-Szalay estimator for the two point correlation function of D measures the clustering signal in D. [2]_

For example, given the position of two sets of objects,

  • objects D (data) contains the clustering signal, and

  • objects R (random) that contains no signal,

\[\xi(r) = \frac{<D, D> - 2 f <D, R> + f^2<R, R>}{f^2<R, R>},\]

where the brackets represents counting pairs between two data sets in a finite bin around r (distance), corresponding to setting cumulative=False, and f = float(len(D)) / float(len(R)) is the ratio between number of objects from data and random.

The algorithm implemented here is loosely based on the dual-tree algorithm described in [1]_. We switch between two different pair-cumulation scheme depending on the setting of cumulative. The computing time of the method we use when for cumulative == False does not scale with the total number of bins. The algorithm for cumulative == True scales linearly with the number of bins, though it is slightly faster when only 1 or 2 bins are used. [5]_.

As an extension to the naive pair-counting, weighted pair-counting counts the product of weights instead of number of pairs. Weighted pair-counting is used to estimate marked correlation functions ([3]_, section 2.2), or to properly calculate the average of data per distance bin (e.g. [4]_, section 2.1 on redshift).

Examples

You can count neighbors number between two kd-trees within a distance:

>>> import numpy as np
>>> from scipy.spatial import KDTree
>>> rng = np.random.default_rng()
>>> points1 = rng.random((5, 2))
>>> points2 = rng.random((5, 2))
>>> kd_tree1 = KDTree(points1)
>>> kd_tree2 = KDTree(points2)
>>> kd_tree1.count_neighbors(kd_tree2, 0.2)
1

This number is same as the total pair number calculated by query_ball_tree:

>>> indexes = kd_tree1.query_ball_tree(kd_tree2, r=0.2)
>>> sum([len(i) for i in indexes])
1
class innernode(ckdtreenode)#

Bases: node

property children#
property split#
property split_dim#
class leafnode(ckdtree_node=None)#

Bases: node

property children#
property idx#
class node(ckdtree_node=None)#

Bases: object

query(x, k=1, eps=0.0, p=2.0, distance_upper_bound=inf, workers=1)#

Query the kd-tree for nearest neighbors.

Parameters:
  • x (array_like, last dimension self.m) – An array of points to query.

  • k (int or Sequence[int], optional) – Either the number of nearest neighbors to return, or a list of the k-th nearest neighbors to return, starting from 1.

  • eps (nonnegative float, optional) – Return approximate nearest neighbors; the kth returned value is guaranteed to be no further than (1+eps) times the distance to the real kth nearest neighbor.

  • p (float, 1<=p<=infinity, optional) – Which Minkowski p-norm to use. 1 is the sum-of-absolute-values distance (“Manhattan” distance). 2 is the usual Euclidean distance. infinity is the maximum-coordinate-difference distance. A large, finite p may cause a ValueError if overflow can occur.

  • distance_upper_bound (nonnegative float, optional) – Return only neighbors within this distance. This is used to prune tree searches, so if you are doing a series of nearest-neighbor queries, it may help to supply the distance to the nearest neighbor of the most recent point.

  • workers (int, optional) –

    Number of workers to use for parallel processing. If -1 is given all CPU threads are used. Default: 1.

    New in version 1.6.0.

Returns:

  • d (float or array of floats) – The distances to the nearest neighbors. If x has shape tuple+(self.m,), then d has shape tuple+(k,). When k == 1, the last dimension of the output is squeezed. Missing neighbors are indicated with infinite distances. Hits are sorted by distance (nearest first).

    Changed in version 1.9.0: Previously if k=None, then d was an object array of shape tuple, containing lists of distances. This behavior has been removed, use query_ball_point instead.

  • i (int or array of integers) – The index of each neighbor in self.data. i is the same shape as d. Missing neighbors are indicated with self.n.

Examples

>>> import numpy as np
>>> from scipy.spatial import KDTree
>>> x, y = np.mgrid[0:5, 2:8]
>>> tree = KDTree(np.c_[x.ravel(), y.ravel()])

To query the nearest neighbours and return squeezed result, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=1)
>>> print(dd, ii, sep='\n')
[2.         0.2236068]
[ 0 13]

To query the nearest neighbours and return unsqueezed result, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=[1])
>>> print(dd, ii, sep='\n')
[[2.        ]
 [0.2236068]]
[[ 0]
 [13]]

To query the second nearest neighbours and return unsqueezed result, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=[2])
>>> print(dd, ii, sep='\n')
[[2.23606798]
 [0.80622577]]
[[ 6]
 [19]]

To query the first and second nearest neighbours, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=2)
>>> print(dd, ii, sep='\n')
[[2.         2.23606798]
 [0.2236068  0.80622577]]
[[ 0  6]
 [13 19]]

or, be more specific

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=[1, 2])
>>> print(dd, ii, sep='\n')
[[2.         2.23606798]
 [0.2236068  0.80622577]]
[[ 0  6]
 [13 19]]
query_ball_point(x, r, p=2.0, eps=0.0, workers=1, return_sorted=None, return_length=False)#

Find all points within distance r of point(s) x.

Parameters:
  • x (array_like, shape tuple + (self.m,)) – The point or points to search for neighbors of.

  • r (array_like, float) – The radius of points to return, must broadcast to the length of x.

  • p (float, optional) – Which Minkowski p-norm to use. Should be in the range [1, inf]. A finite large p may cause a ValueError if overflow can occur.

  • eps (nonnegative float, optional) – Approximate search. Branches of the tree are not explored if their nearest points are further than r / (1 + eps), and branches are added in bulk if their furthest points are nearer than r * (1 + eps).

  • workers (int, optional) –

    Number of jobs to schedule for parallel processing. If -1 is given all processors are used. Default: 1.

    New in version 1.6.0.

  • return_sorted (bool, optional) –

    Sorts returned indices if True and does not sort them if False. If None, does not sort single point queries, but does sort multi-point queries which was the behavior before this option was added.

    New in version 1.6.0.

  • return_length (bool, optional) –

    Return the number of points inside the radius instead of a list of the indices.

    New in version 1.6.0.

Returns:

results – If x is a single point, returns a list of the indices of the neighbors of x. If x is an array of points, returns an object array of shape tuple containing lists of neighbors.

Return type:

list or array of lists

Notes

If you have many points whose neighbors you want to find, you may save substantial amounts of time by putting them in a KDTree and using query_ball_tree.

Examples

>>> import numpy as np
>>> from scipy import spatial
>>> x, y = np.mgrid[0:5, 0:5]
>>> points = np.c_[x.ravel(), y.ravel()]
>>> tree = spatial.KDTree(points)
>>> sorted(tree.query_ball_point([2, 0], 1))
[5, 10, 11, 15]

Query multiple points and plot the results:

>>> import matplotlib.pyplot as plt
>>> points = np.asarray(points)
>>> plt.plot(points[:,0], points[:,1], '.')
>>> for results in tree.query_ball_point(([2, 0], [3, 3]), 1):
...     nearby_points = points[results]
...     plt.plot(nearby_points[:,0], nearby_points[:,1], 'o')
>>> plt.margins(0.1, 0.1)
>>> plt.show()
query_ball_tree(other, r, p=2.0, eps=0.0)#

Find all pairs of points between self and other whose distance is at most r.

Parameters:
  • other (KDTree instance) – The tree containing points to search against.

  • r (float) – The maximum distance, has to be positive.

  • p (float, optional) – Which Minkowski norm to use. p has to meet the condition 1 <= p <= infinity.

  • eps (float, optional) – Approximate search. Branches of the tree are not explored if their nearest points are further than r/(1+eps), and branches are added in bulk if their furthest points are nearer than r * (1+eps). eps has to be non-negative.

Returns:

results – For each element self.data[i] of this tree, results[i] is a list of the indices of its neighbors in other.data.

Return type:

list of lists

Examples

You can search all pairs of points between two kd-trees within a distance:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from scipy.spatial import KDTree
>>> rng = np.random.default_rng()
>>> points1 = rng.random((15, 2))
>>> points2 = rng.random((15, 2))
>>> plt.figure(figsize=(6, 6))
>>> plt.plot(points1[:, 0], points1[:, 1], "xk", markersize=14)
>>> plt.plot(points2[:, 0], points2[:, 1], "og", markersize=14)
>>> kd_tree1 = KDTree(points1)
>>> kd_tree2 = KDTree(points2)
>>> indexes = kd_tree1.query_ball_tree(kd_tree2, r=0.2)
>>> for i in range(len(indexes)):
...     for j in indexes[i]:
...         plt.plot([points1[i, 0], points2[j, 0]],
...             [points1[i, 1], points2[j, 1]], "-r")
>>> plt.show()
query_pairs(r, p=2.0, eps=0.0, output_type='set')#

Find all pairs of points in self whose distance is at most r.

Parameters:
  • r (positive float) – The maximum distance.

  • p (float, optional) – Which Minkowski norm to use. p has to meet the condition 1 <= p <= infinity.

  • eps (float, optional) – Approximate search. Branches of the tree are not explored if their nearest points are further than r/(1+eps), and branches are added in bulk if their furthest points are nearer than r * (1+eps). eps has to be non-negative.

  • output_type (str, optional) –

    Choose the output container, ‘set’ or ‘ndarray’. Default: ‘set’

    New in version 1.6.0.

Returns:

results – Set of pairs (i,j), with i < j, for which the corresponding positions are close. If output_type is ‘ndarray’, an ndarray is returned instead of a set.

Return type:

set or ndarray

Examples

You can search all pairs of points in a kd-tree within a distance:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from scipy.spatial import KDTree
>>> rng = np.random.default_rng()
>>> points = rng.random((20, 2))
>>> plt.figure(figsize=(6, 6))
>>> plt.plot(points[:, 0], points[:, 1], "xk", markersize=14)
>>> kd_tree = KDTree(points)
>>> pairs = kd_tree.query_pairs(r=0.2)
>>> for (i, j) in pairs:
...     plt.plot([points[i, 0], points[j, 0]],
...             [points[i, 1], points[j, 1]], "-r")
>>> plt.show()
sparse_distance_matrix(other, max_distance, p=2.0, output_type='dok_matrix')#

Compute a sparse distance matrix.

Computes a distance matrix between two KDTrees, leaving as zero any distance greater than max_distance.

Parameters:
  • other (KDTree) – The other KDTree to compute distances against.

  • max_distance (positive float) – Maximum distance within which neighbors are returned. Distances above this value are returned as zero.

  • p (float, 1<=p<=infinity) – Which Minkowski p-norm to use. A finite large p may cause a ValueError if overflow can occur.

  • output_type (str, optional) –

    Which container to use for output data. Options: 'dok_array', 'coo_array', 'dict', or 'ndarray'. Legacy options 'dok_matrix' and 'coo_matrix' are still available. Default: 'dok_matrix'.

    Warning

    dok_matrix and coo_matrix are being replaced.

    All new code using scipy sparse should use sparse array types ‘dok_array’ or ‘coo_array’. The default value of output_type will be deprecated at v1.19 and switch from ‘dok_matrix’ to ‘dok_array’ in v1.21. The values ‘dok_matrix’ and ‘coo_matrix’ continue to work, but will go away eventually.

    New in version 1.6.0.

Returns:

result – Sparse matrix representing the results in “dictionary of keys” format. If a dict is returned the keys are (i,j) tuples of indices. If output_type is 'ndarray' a record array with fields 'i', 'j', and 'v' is returned,

Return type:

dok_array, coo_array, dict or ndarray

Examples

You can compute a sparse distance matrix between two kd-trees:

>>> import numpy as np
>>> from scipy.spatial import KDTree
>>> rng = np.random.default_rng()
>>> points1 = rng.random((5, 2))
>>> points2 = rng.random((5, 2))
>>> kdtree1 = KDTree(points1)
>>> kdtree2 = KDTree(points2)
>>> sdm = kdtree1.sparse_distance_matrix(kdtree2, 0.3, output_type="dok_array")
>>> sdm.toarray()
array([[0.        , 0.        , 0.12295571, 0.        , 0.        ],
   [0.        , 0.        , 0.        , 0.        , 0.        ],
   [0.28942611, 0.        , 0.        , 0.2333084 , 0.        ],
   [0.        , 0.        , 0.        , 0.        , 0.        ],
   [0.24617575, 0.29571802, 0.26836782, 0.        , 0.        ]])

You can check distances above the max_distance are zeros:

>>> from scipy.spatial import distance_matrix
>>> distance_matrix(points1, points2)
array([[0.56906522, 0.39923701, 0.12295571, 0.8658745 , 0.79428925],
   [0.37327919, 0.7225693 , 0.87665969, 0.32580855, 0.75679479],
   [0.28942611, 0.30088013, 0.6395831 , 0.2333084 , 0.33630734],
   [0.31994999, 0.72658602, 0.71124834, 0.55396483, 0.90785663],
   [0.24617575, 0.29571802, 0.26836782, 0.57714465, 0.6473269 ]])
property tree#
class moleculekit.kdtree.cKDTree(data, leafsize=16, compact_nodes=True, copy_data=False, balanced_tree=True, boxsize=None)#

Bases: object

kd-tree for quick nearest-neighbor lookup.

This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point.

Note

cKDTree is functionally identical to KDTree. Prior to SciPy v1.6.0, cKDTree had better performance and slightly different functionality but now the two names exist only for backward-compatibility reasons. If compatibility with SciPy < 1.6 is not a concern, prefer KDTree.

Parameters:
  • data (array_like, shape (n,m)) – The n data points of dimension m to be indexed. This array is not copied unless this is necessary to produce a contiguous array of doubles, and so modifying this data will result in bogus results. The data are also copied if the kd-tree is built with copy_data=True.

  • leafsize (positive int, optional) – The number of points at which the algorithm switches over to brute-force. Default: 16.

  • compact_nodes (bool, optional) – If True, the kd-tree is built to shrink the hyperrectangles to the actual data range. This usually gives a more compact tree that is robust against degenerated input data and gives faster queries at the expense of longer build time. Default: True.

  • copy_data (bool, optional) – If True the data is always copied to protect the kd-tree against data corruption. Default: False.

  • balanced_tree (bool, optional) – If True, the median is used to split the hyperrectangles instead of the midpoint. This usually gives a more compact tree and faster queries at the expense of longer build time. Default: True.

  • boxsize (array_like or scalar, optional) – Apply an m-d toroidal topology to the KDTree. The topology is generated by \(x_i + n_i L_i\) where \(n_i\) are integers and \(L_i\) is the boxsize along i-th dimension. The input data shall be wrapped into \([0, L_i)\). A ValueError is raised if any of the data is outside of this bound.

data#

The n data points of dimension m to be indexed. This array is not copied unless this is necessary to produce a contiguous array of doubles. The data are also copied if the kd-tree is built with copy_data=True. Concurrently modifying the contents of the data array while the KDTree initializer is running may lead to data corruption or crashes. If the data are not copied, modifying the original data array after the tree is created may lead to crashes or data corruption when searching the tree.

Type:

ndarray, shape (n,m)

leafsize#

The number of points at which the algorithm switches over to brute-force.

Type:

positive int

m#

The dimension of a single data-point.

Type:

int

n#

The number of data points.

Type:

int

maxes#

The maximum value in each dimension of the n data points.

Type:

ndarray, shape (m,)

mins#

The minimum value in each dimension of the n data points.

Type:

ndarray, shape (m,)

tree#

This attribute exposes a Python view of the root node in the cKDTree object. A full Python view of the kd-tree is created dynamically on the first access. This attribute allows you to create your own query functions in Python.

Type:

object, class cKDTreeNode

size#

The number of nodes in the tree.

Type:

int

boxsize#

If boxsize is passed to the initializer, this will be set to the bounds of the periodic box. None if no boxsize is passed.

Type:

None or ndarray, shape (m.)

Notes

The algorithm used is described in [1]_. The general idea is that the kd-tree is a binary tree, each of whose nodes represents an axis-aligned hyperrectangle. Each node specifies an axis and splits the set of points based on whether their coordinate along that axis is greater than or less than a particular value.

During construction, the axis and splitting point are chosen by the “sliding midpoint” rule, which ensures that the cells do not all become long and thin.

The tree can be queried for the r closest neighbors of any given point (optionally returning only those within some maximum distance of the point). It can also be queried, with a substantial gain in efficiency, for the r approximate closest neighbors.

For large dimensions (20 is already large) do not expect this to run significantly faster than brute force. High-dimensional nearest-neighbor queries are a substantial open problem in computer science.

References

boxsize#
count_neighbors(self, other, r, p=2.0, weights=None, cumulative=True)#

Count how many nearby pairs can be formed.

Count the number of pairs (x1,x2) can be formed, with x1 drawn from self and x2 drawn from other, and where distance(x1, x2, p) <= r.

Data points on self and other are optionally weighted by the weights argument. (See below)

This is adapted from the “two-point correlation” algorithm described by Gray and Moore [1]_. See notes for further discussion.

Parameters:
  • other (cKDTree instance) – The other tree to draw points from, can be the same tree as self.

  • r (float or one-dimensional array of floats) – The radius to produce a count for. Multiple radii are searched with a single tree traversal. If the count is non-cumulative (cumulative=False), r defines the edges of the bins, and must be non-decreasing.

  • p (float, optional) – 1<=p<=infinity. Which Minkowski p-norm to use. Default 2.0. A finite large p may cause a ValueError if overflow can occur.

  • weights (tuple, array_like, or None, optional) – If None, the pair-counting is unweighted. If given as a tuple, weights[0] is the weights of points in self, and weights[1] is the weights of points in other; either can be None to indicate the points are unweighted. If given as an array_like, weights is the weights of points in self and other. For this to make sense, self and other must be the same tree. If self and other are two different trees, a ValueError is raised. Default: None

  • cumulative (bool, optional) – Whether the returned counts are cumulative. When cumulative is set to False the algorithm is optimized to work with a large number of bins (>10) specified by r. When cumulative is set to True, the algorithm is optimized to work with a small number of r. Default: True

Returns:

result – The number of pairs. For unweighted counts, the result is integer. For weighted counts, the result is float. If cumulative is False, result[i] contains the counts with (-inf if i == 0 else r[i-1]) < R <= r[i]

Return type:

scalar or 1-D array

Notes

Pair-counting is the basic operation used to calculate the two point correlation functions from a data set composed of position of objects.

Two point correlation function measures the clustering of objects and is widely used in cosmology to quantify the large scale structure in our Universe, but it may be useful for data analysis in other fields where self-similar assembly of objects also occur.

The Landy-Szalay estimator for the two point correlation function of D measures the clustering signal in D. [2]_

For example, given the position of two sets of objects,

  • objects D (data) contains the clustering signal, and

  • objects R (random) that contains no signal,

\[\xi(r) = \frac{<D, D> - 2 f <D, R> + f^2<R, R>}{f^2<R, R>},\]

where the brackets represents counting pairs between two data sets in a finite bin around r (distance), corresponding to setting cumulative=False, and f = float(len(D)) / float(len(R)) is the ratio between number of objects from data and random.

The algorithm implemented here is loosely based on the dual-tree algorithm described in [1]_. We switch between two different pair-cumulation scheme depending on the setting of cumulative. The computing time of the method we use when for cumulative == False does not scale with the total number of bins. The algorithm for cumulative == True scales linearly with the number of bins, though it is slightly faster when only 1 or 2 bins are used. [5]_.

As an extension to the naive pair-counting, weighted pair-counting counts the product of weights instead of number of pairs. Weighted pair-counting is used to estimate marked correlation functions ([3]_, section 2.2), or to properly calculate the average of data per distance bin (e.g. [4]_, section 2.1 on redshift).

Examples

You can count neighbors number between two kd-trees within a distance:

>>> import numpy as np
>>> from scipy.spatial import cKDTree
>>> rng = np.random.default_rng()
>>> points1 = rng.random((5, 2))
>>> points2 = rng.random((5, 2))
>>> kd_tree1 = cKDTree(points1)
>>> kd_tree2 = cKDTree(points2)
>>> kd_tree1.count_neighbors(kd_tree2, 0.2)
1

This number is same as the total pair number calculated by query_ball_tree:

>>> indexes = kd_tree1.query_ball_tree(kd_tree2, r=0.2)
>>> sum([len(i) for i in indexes])
1
data#
indices#
leafsize#
m#
maxes#
mins#
n#
query(self, x, k=1, eps=0.0, p=2.0, distance_upper_bound=np.inf, workers=1)#

Query the kd-tree for nearest neighbors

Parameters:
  • x (array_like, last dimension self.m) – An array of points to query.

  • k (list of integer or integer) – The list of k-th nearest neighbors to return. If k is an integer it is treated as a list of [1, … k] (range(1, k+1)). Note that the counting starts from 1.

  • eps (non-negative float) – Return approximate nearest neighbors; the k-th returned value is guaranteed to be no further than (1+eps) times the distance to the real k-th nearest neighbor.

  • p (float, 1<=p<=infinity) – Which Minkowski p-norm to use. 1 is the sum-of-absolute-values “Manhattan” distance 2 is the usual Euclidean distance infinity is the maximum-coordinate-difference distance A finite large p may cause a ValueError if overflow can occur.

  • distance_upper_bound (nonnegative float) – Return only neighbors within this distance. This is used to prune tree searches, so if you are doing a series of nearest-neighbor queries, it may help to supply the distance to the nearest neighbor of the most recent point.

  • workers (int, optional) –

    Number of workers to use for parallel processing. If -1 is given all CPU threads are used. Default: 1.

    Changed in version 1.9.0: The “n_jobs” argument was renamed “workers”. The old name “n_jobs” was deprecated in SciPy 1.6.0 and was removed in SciPy 1.9.0.

Returns:

  • d (array of floats) – The distances to the nearest neighbors. If x has shape tuple+(self.m,), then d has shape tuple+(k,). When k == 1, the last dimension of the output is squeezed. Missing neighbors are indicated with infinite distances.

  • i (ndarray of ints) – The index of each neighbor in self.data. If x has shape tuple+(self.m,), then i has shape tuple+(k,). When k == 1, the last dimension of the output is squeezed. Missing neighbors are indicated with self.n.

Notes

If the KD-Tree is periodic, the position x is wrapped into the box.

When the input k is a list, a query for arange(max(k)) is performed, but only columns that store the requested values of k are preserved. This is implemented in a manner that reduces memory usage.

Examples

>>> import numpy as np
>>> from scipy.spatial import cKDTree
>>> x, y = np.mgrid[0:5, 2:8]
>>> tree = cKDTree(np.c_[x.ravel(), y.ravel()])

To query the nearest neighbours and return squeezed result, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=1)
>>> print(dd, ii, sep='\n')
[2.         0.2236068]
[ 0 13]

To query the nearest neighbours and return unsqueezed result, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=[1])
>>> print(dd, ii, sep='\n')
[[2.        ]
 [0.2236068]]
[[ 0]
 [13]]

To query the second nearest neighbours and return unsqueezed result, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=[2])
>>> print(dd, ii, sep='\n')
[[2.23606798]
 [0.80622577]]
[[ 6]
 [19]]

To query the first and second nearest neighbours, use

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=2)
>>> print(dd, ii, sep='\n')
[[2.         2.23606798]
 [0.2236068  0.80622577]]
[[ 0  6]
 [13 19]]

or, be more specific

>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=[1, 2])
>>> print(dd, ii, sep='\n')
[[2.         2.23606798]
 [0.2236068  0.80622577]]
[[ 0  6]
 [13 19]]
query_ball_point(x, r, p=2.0, eps=0.0, workers=None, return_sorted=None, return_length=False, **kwargs)#
query_ball_point(self, x, r, p=2.0, eps=0.0, workers=1, return_sorted=None,

return_length=False)

Find all points within distance r of point(s) x.

Parameters:
  • x (array_like, shape tuple + (self.m,)) – The point or points to search for neighbors of.

  • r (array_like, float) – The radius of points to return, shall broadcast to the length of x.

  • p (float, optional) – Which Minkowski p-norm to use. Should be in the range [1, inf]. A finite large p may cause a ValueError if overflow can occur.

  • eps (nonnegative float, optional) – Approximate search. Branches of the tree are not explored if their nearest points are further than r / (1 + eps), and branches are added in bulk if their furthest points are nearer than r * (1 + eps).

  • workers (int, optional) –

    Number of jobs to schedule for parallel processing. If -1 is given all processors are used. Default: 1.

    Changed in version 1.9.0: The “n_jobs” argument was renamed “workers”. The old name “n_jobs” was deprecated in SciPy 1.6.0 and was removed in SciPy 1.9.0.

  • return_sorted (bool, optional) –

    Sorts returned indices if True and does not sort them if False. If None, does not sort single point queries, but does sort multi-point queries which was the behavior before this option was added.

    New in version 1.2.0.

  • return_length (bool, optional) – Return the number of points inside the radius instead of a list of the indices. .. versionadded:: 1.3.0

Returns:

results – If x is a single point, returns a list of the indices of the neighbors of x. If x is an array of points, returns an object array of shape tuple containing lists of neighbors.

Return type:

list or array of lists

Notes

If you have many points whose neighbors you want to find, you may save substantial amounts of time by putting them in a cKDTree and using query_ball_tree.

Examples

>>> import numpy as np
>>> from scipy import spatial
>>> x, y = np.mgrid[0:4, 0:4]
>>> points = np.c_[x.ravel(), y.ravel()]
>>> tree = spatial.cKDTree(points)
>>> tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]

Query multiple points and plot the results:

>>> import matplotlib.pyplot as plt
>>> points = np.asarray(points)
>>> plt.plot(points[:,0], points[:,1], '.')
>>> for results in tree.query_ball_point(([2, 0], [3, 3]), 1):
...     nearby_points = points[results]
...     plt.plot(nearby_points[:,0], nearby_points[:,1], 'o')
>>> plt.margins(0.1, 0.1)
>>> plt.show()
query_ball_tree(self, other, r, p=2.0, eps=0.0)#

Find all pairs of points between self and other whose distance is at most r

Parameters:
  • other (cKDTree instance) – The tree containing points to search against.

  • r (float) – The maximum distance, has to be positive.

  • p (float, optional) – Which Minkowski norm to use. p has to meet the condition 1 <= p <= infinity. A finite large p may cause a ValueError if overflow can occur.

  • eps (float, optional) – Approximate search. Branches of the tree are not explored if their nearest points are further than r/(1+eps), and branches are added in bulk if their furthest points are nearer than r * (1+eps). eps has to be non-negative.

Returns:

results – For each element self.data[i] of this tree, results[i] is a list of the indices of its neighbors in other.data.

Return type:

list of lists

Examples

You can search all pairs of points between two kd-trees within a distance:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from scipy.spatial import cKDTree
>>> rng = np.random.default_rng()
>>> points1 = rng.random((15, 2))
>>> points2 = rng.random((15, 2))
>>> plt.figure(figsize=(6, 6))
>>> plt.plot(points1[:, 0], points1[:, 1], "xk", markersize=14)
>>> plt.plot(points2[:, 0], points2[:, 1], "og", markersize=14)
>>> kd_tree1 = cKDTree(points1)
>>> kd_tree2 = cKDTree(points2)
>>> indexes = kd_tree1.query_ball_tree(kd_tree2, r=0.2)
>>> for i in range(len(indexes)):
...     for j in indexes[i]:
...         plt.plot([points1[i, 0], points2[j, 0]],
...             [points1[i, 1], points2[j, 1]], "-r")
>>> plt.show()
query_pairs(self, r, p=2.0, eps=0.0, output_type='set')#

Find all pairs of points in self whose distance is at most r.

Parameters:
  • r (positive float) – The maximum distance.

  • p (float, optional) – Which Minkowski norm to use. p has to meet the condition 1 <= p <= infinity. A finite large p may cause a ValueError if overflow can occur.

  • eps (float, optional) – Approximate search. Branches of the tree are not explored if their nearest points are further than r/(1+eps), and branches are added in bulk if their furthest points are nearer than r * (1+eps). eps has to be non-negative.

  • output_type (str, optional) – Choose the output container, ‘set’ or ‘ndarray’. Default: ‘set’

Returns:

results – Set of pairs (i,j), with i < j, for which the corresponding positions are close. If output_type is ‘ndarray’, an ndarray is returned instead of a set.

Return type:

set or ndarray

Examples

You can search all pairs of points in a kd-tree within a distance:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from scipy.spatial import cKDTree
>>> rng = np.random.default_rng()
>>> points = rng.random((20, 2))
>>> plt.figure(figsize=(6, 6))
>>> plt.plot(points[:, 0], points[:, 1], "xk", markersize=14)
>>> kd_tree = cKDTree(points)
>>> pairs = kd_tree.query_pairs(r=0.2)
>>> for (i, j) in pairs:
...     plt.plot([points[i, 0], points[j, 0]],
...             [points[i, 1], points[j, 1]], "-r")
>>> plt.show()
size#
sparse_distance_matrix(other, max_distance, p=2.0, output_type='dok_matrix')#

Compute a sparse distance matrix

Computes a distance matrix between two cKDTrees, leaving as zero any distance greater than max_distance.

Parameters:
  • other (cKDTree) – The other KDTree to compute distances against.

  • max_distance (positive float) – Maximum distance within which neighbors are returned. Distances above this value are returned as zero.

  • p (float, 1<=p<=infinity) – Which Minkowski p-norm to use. A finite large p may cause a ValueError if overflow can occur.

  • output_type (str, optional) –

    Which container to use for output data. Options: 'dok_array', 'coo_array', 'dict', or 'ndarray'. Legacy options 'dok_matrix' and 'coo_matrix' are still available. Default: 'dok_matrix'.

    Warning

    dok_matrix and coo_matrix are being replaced.

    All new code using scipy sparse should use sparse array types ‘dok_array’ or ‘coo_array’. The default value of output_type will be deprecated at v1.19 and switch from ‘dok_matrix’ to ‘dok_array’ in v1.21. The values ‘dok_matrix’ and ‘coo_matrix’ continue to work, but will go away eventually.

Returns:

result – Sparse matrix representing the results in “dictionary of keys” format. If a dict is returned the keys are (i,j) tuples of indices. If output_type is 'ndarray' a record array with fields 'i', 'j', and 'v' is returned,

Return type:

dok_array, coo_array, dict or ndarray

Examples

You can compute a sparse distance matrix between two kd-trees:

>>> import numpy as np
>>> from scipy.spatial import cKDTree
>>> rng = np.random.default_rng()
>>> points1 = rng.random((5, 2))
>>> points2 = rng.random((5, 2))
>>> kdtree1 = cKDTree(points1)
>>> kdtree2 = cKDTree(points2)
>>> sdm = kdtree1.sparse_distance_matrix(kdtree2, 0.3, output_type="dok_array")
>>> sdm.toarray()
array([[0.        , 0.        , 0.12295571, 0.        , 0.        ],
   [0.        , 0.        , 0.        , 0.        , 0.        ],
   [0.28942611, 0.        , 0.        , 0.2333084 , 0.        ],
   [0.        , 0.        , 0.        , 0.        , 0.        ],
   [0.24617575, 0.29571802, 0.26836782, 0.        , 0.        ]])

You can check distances above the max_distance are zeros:

>>> from scipy.spatial import distance_matrix
>>> distance_matrix(points1, points2)
array([[0.56906522, 0.39923701, 0.12295571, 0.8658745 , 0.79428925],
   [0.37327919, 0.7225693 , 0.87665969, 0.32580855, 0.75679479],
   [0.28942611, 0.30088013, 0.6395831 , 0.2333084 , 0.33630734],
   [0.31994999, 0.72658602, 0.71124834, 0.55396483, 0.90785663],
   [0.24617575, 0.29571802, 0.26836782, 0.57714465, 0.6473269 ]])
tree#
class moleculekit.kdtree.cKDTreeNode#

Bases: object

class cKDTreeNode

This class exposes a Python view of a node in the cKDTree object.

All attributes are read-only.

level#

The depth of the node. 0 is the level of the root node.

Type:

int

split_dim#

The dimension along which this node is split. If this value is -1 the node is a leafnode in the kd-tree. Leafnodes are not split further and scanned by brute force.

Type:

int

split#

The value used to separate split this node. Points with value >= split in the split_dim dimension are sorted to the ‘greater’ subnode whereas those with value < split are sorted to the ‘lesser’ subnode.

Type:

float

children#

The number of data points sorted to this node.

Type:

int

data_points#

An array with the data points sorted to this node.

Type:

ndarray of float64

indices#

An array with the indices of the data points sorted to this node. The indices refer to the position in the data set used to construct the kd-tree.

Type:

ndarray of intp

lesser#

Subnode with the ‘lesser’ data points. This attribute is None for leafnodes.

Type:

cKDTreeNode or None

greater#

Subnode with the ‘greater’ data points. This attribute is None for leafnodes.

Type:

cKDTreeNode or None

children#
data_points#
end_idx#
greater#
indices#
lesser#
level#
split#
split_dim#
start_idx#
moleculekit.kdtree.minkowski_distance(x, y, p=2.0)#

Compute the L**p distance between two arrays.

The last dimensions of x and y must be the same length. Any other dimensions must be compatible for broadcasting.

Parameters:
  • x ((..., K) array_like) – Input array.

  • y ((..., K) array_like) – Input array.

  • p (float, 1 <= p <= infinity) – Which Minkowski p-norm to use.

Returns:

dist – Distance between the input arrays.

Return type:

ndarray

Examples

>>> from scipy.spatial import minkowski_distance
>>> minkowski_distance([[0, 0], [0, 0]], [[1, 1], [0, 1]])
array([ 1.41421356,  1.        ])
moleculekit.kdtree.minkowski_distance_p(x, y, p=2.0)#

Compute the pth power of the L**p distance between two arrays.

For efficiency, this function computes the L**p distance but does not extract the pth root. If p is 1 or infinity, this is equal to the actual L**p distance.

The last dimensions of x and y must be the same length. Any other dimensions must be compatible for broadcasting.

Parameters:
  • x ((..., K) array_like) – Input array.

  • y ((..., K) array_like) – Input array.

  • p (float, 1 <= p <= infinity) – Which Minkowski p-norm to use.

Returns:

dist – pth power of the distance between the input arrays.

Return type:

ndarray

Examples

>>> from scipy.spatial import minkowski_distance_p
>>> minkowski_distance_p([[0, 0], [0, 0]], [[1, 1], [0, 1]])
array([2., 1.])