Welcome to weighted-levenshtein’s documentation!

weighted-levenshtein

Installation

pip install weighted-levenshtein

Usage Example

import numpy as np
from weighted_levenshtein import lev, osa, dam_lev

insert_costs = np.ones(128)  # make an array of all 1's
insert_costs[ord('D')] = 1.5  # make inserting the character 'D' have cost 1.5 (instead of 1)

print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs)  # prints '1.5'

lev, osa, and dam_lev are aliases for levenshtein, optimal_string_alignment, and damerau_levenshtein, respectively.

Important Notes

  • The costs parameters only accept numpy arrays, since the underlying Cython implementation relies on this for fast lookups. The numpy arrays are indexed using the ord() value of the characters. Thus, only the first 128 ASCII letters are accepted, and dict and list are not accepted. Consequently, the strings must be strictly str objects, not unicode.
  • This library was built with only Python 2 in mind. Python 3 compatibility is untested.

Use as Cython library

TODO

Distribution

Since not every machine has Cython installed, we distribute the C code that was compiled from Cython. To compile to C, run setup.sh like above. Not only will it generate a .so file, it will also generate the .c file that can be distributed, and compiled on any machine with a C compiler. Consequently, the distribution on PyPI contains only the .c file.

Functions

weighted_levenshtein.levenshtein(unsigned char *str1, unsigned char *str2, __Pyx_memviewslice insert_costs=None, __Pyx_memviewslice delete_costs=None, __Pyx_memviewslice substitute_costs=None)

Calculates the Levenshtein distance between str1 and str2, provided the costs of inserting, deleting, and substituting characters. The costs default to 1 if not provided.

For convenience, this function is aliased as clev.lev().

Parameters:
  • str1 (str) – first string
  • str2 (str) – second string
  • insert_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where insert_costs[i] is the cost of inserting ASCII character i
  • delete_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where delete_costs[i] is the cost of deleting ASCII character i
  • substitute_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where substitute_costs[i, j] is the cost of substituting ASCII character i with ASCII character j
weighted_levenshtein.optimal_string_alignment(unsigned char *str1, unsigned char *str2, __Pyx_memviewslice insert_costs=None, __Pyx_memviewslice delete_costs=None, __Pyx_memviewslice substitute_costs=None, __Pyx_memviewslice transpose_costs=None)

Calculates the Optimal String Alignment distance between str1 and str2, provided the costs of inserting, deleting, and substituting characters. The costs default to 1 if not provided.

For convenience, this function is aliased as clev.osa().

Parameters:
  • str1 (str) – first string
  • str2 (str) – second string
  • insert_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where insert_costs[i] is the cost of inserting ASCII character i
  • delete_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where delete_costs[i] is the cost of deleting ASCII character i
  • substitute_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where substitute_costs[i, j] is the cost of substituting ASCII character i with ASCII character j
  • transpose_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where transpose_costs[i, j] is the cost of transposing ASCII character i with ASCII character j, where character i is followed by character j in the string
weighted_levenshtein.damerau_levenshtein(unsigned char *str1, unsigned char *str2, __Pyx_memviewslice insert_costs=None, __Pyx_memviewslice delete_costs=None, __Pyx_memviewslice substitute_costs=None, __Pyx_memviewslice transpose_costs=None)

Calculates the Damerau-Levenshtein distance between str1 and str2, provided the costs of inserting, deleting, substituting, and transposing characters. The costs default to 1 if not provided.

For convenience, this function is aliased as clev.dam_lev().

Parameters:
  • str1 (str) – first string
  • str2 (str) – second string
  • insert_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where insert_costs[i] is the cost of inserting ASCII character i
  • delete_costs (np.ndarray) – a numpy array of np.float64 (C doubles) of length 128 (0..127), where delete_costs[i] is the cost of deleting ASCII character i
  • substitute_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where substitute_costs[i, j] is the cost of substituting ASCII character i with ASCII character j
  • transpose_costs (np.ndarray) – a 2D numpy array of np.float64 (C doubles) of dimensions (128, 128), where transpose_costs[i, j] is the cost of transposing ASCII character i with ASCII character j, where character i is followed by character j in the string