Welcome to weighted-levenshtein’s documentation!¶
weighted-levenshtein¶
Installation¶
pip install weighted-levenshtein
Usage Example¶
import numpy as np
from weighted_levenshtein import lev, osa, dam_lev
insert_costs = np.ones(128) # make an array of all 1's
insert_costs[ord('D')] = 1.5 # make inserting the character 'D' have cost 1.5 (instead of 1)
print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs) # prints '1.5'
lev
, osa
, and dam_lev
are aliases for levenshtein
,
optimal_string_alignment
, and damerau_levenshtein
, respectively.
Detailed Documentation¶
Important Notes¶
- The costs parameters only accept numpy arrays, since the underlying Cython implementation relies on this for fast lookups. The numpy arrays are indexed using the
ord()
value of the characters. Thus, only the first 128 ASCII letters are accepted, anddict
andlist
are not accepted. Consequently, the strings must be strictlystr
objects, notunicode
. - This library was built with only Python 2 in mind. Python 3 compatibility is untested.
Wikipedia links¶
Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance and https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
Optimal String Alignment: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance
Damerau-Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions
Use as Cython library¶
TODO
Distribution¶
Since not every machine has Cython installed, we distribute the C code
that was compiled from Cython. To compile to C, run setup.sh
like
above. Not only will it generate a .so file, it will also generate the
.c file that can be distributed, and compiled on any machine with a C
compiler. Consequently, the distribution on PyPI contains only the .c
file.
Functions¶
-
weighted_levenshtein.
levenshtein
(unsigned char *str1, unsigned char *str2, __Pyx_memviewslice insert_costs=None, __Pyx_memviewslice delete_costs=None, __Pyx_memviewslice substitute_costs=None)¶ Calculates the Levenshtein distance between str1 and str2, provided the costs of inserting, deleting, and substituting characters. The costs default to 1 if not provided.
For convenience, this function is aliased as clev.lev().
Parameters: - str1 (str) – first string
- str2 (str) – second string
- insert_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where insert_costs[i] is the cost of inserting ASCII character i
- delete_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where delete_costs[i] is the cost of deleting ASCII character i
- substitute_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where substitute_costs[i, j] is the cost of substituting ASCII character i with ASCII character j
-
weighted_levenshtein.
optimal_string_alignment
(unsigned char *str1, unsigned char *str2, __Pyx_memviewslice insert_costs=None, __Pyx_memviewslice delete_costs=None, __Pyx_memviewslice substitute_costs=None, __Pyx_memviewslice transpose_costs=None)¶ Calculates the Optimal String Alignment distance between str1 and str2, provided the costs of inserting, deleting, and substituting characters. The costs default to 1 if not provided.
For convenience, this function is aliased as clev.osa().
Parameters: - str1 (str) – first string
- str2 (str) – second string
- insert_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where insert_costs[i] is the cost of inserting ASCII character i
- delete_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where delete_costs[i] is the cost of deleting ASCII character i
- substitute_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where substitute_costs[i, j] is the cost of substituting ASCII character i with ASCII character j
- transpose_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where transpose_costs[i, j] is the cost of transposing ASCII character i with ASCII character j, where character i is followed by character j in the string
-
weighted_levenshtein.
damerau_levenshtein
(unsigned char *str1, unsigned char *str2, __Pyx_memviewslice insert_costs=None, __Pyx_memviewslice delete_costs=None, __Pyx_memviewslice substitute_costs=None, __Pyx_memviewslice transpose_costs=None)¶ Calculates the Damerau-Levenshtein distance between str1 and str2, provided the costs of inserting, deleting, substituting, and transposing characters. The costs default to 1 if not provided.
For convenience, this function is aliased as clev.dam_lev().
Parameters: - str1 (str) – first string
- str2 (str) – second string
- insert_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where insert_costs[i] is the cost of inserting ASCII character i
- delete_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where delete_costs[i] is the cost of deleting ASCII character i
- substitute_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where substitute_costs[i, j] is the cost of substituting ASCII character i with ASCII character j
- transpose_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where transpose_costs[i, j] is the cost of transposing ASCII character i with ASCII character j, where character i is followed by character j in the string