jaro_winkler alternatives and similar gems
Based on the "Utilities" category.
Alternatively, view jaro_winkler alternatives based on common mentions on social networks and blogs.
-
counter_culture
Turbo-charged counter caches for your Rails app. -
smarter_csv
Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with optional features for processing large files in parallel, embedded comments, unusual field- and record-separators, flexible mapping of CSV-headers to Hash-keys -
Time Math
Small library for operations with time steps (like "next day", "floor to hour" and so on) -
clipboard-rails
clipboard.js javascript library integration for your Rails 4 and Rails 5 applications
Access the most powerful time series database as a service
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of jaro_winkler or a related project?
README
jaro_winkler is an implementation of Jaro-Winkler distance algorithm which is written in C extension and will fallback to pure Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. Both of C and Ruby implementation support any kind of string encoding, such as UTF-8, EUC-JP, Big5, etc.
Installation
gem install jaro_winkler
Usage
require 'jaro_winkler'
# Jaro Winkler Distance
JaroWinkler.distance "MARTHA", "MARHTA"
# => 0.9611
JaroWinkler.distance "MARTHA", "marhta", ignore_case: true
# => 0.9611
JaroWinkler.distance "MARTHA", "MARHTA", weight: 0.2
# => 0.9778
# Jaro Distance
JaroWinkler.jaro_distance "MARTHA", "MARHTA"
# => 0.9444444444444445
There is no JaroWinkler.jaro_winkler_distance
, it's tediously long.
Options
Name | Type | Default | Note |
---|---|---|---|
ignore_case | boolean | false | All lower case characters are converted to upper case prior to the comparison. |
weight | number | 0.1 | A constant scaling factor for how much the score is adjusted upwards for having common prefixes. |
threshold | number | 0.7 | The prefix bonus is only added when the compared strings have a Jaro distance above the threshold. |
adj_table | boolean | false | The option is used to give partial credit for characters that may be errors due to known phonetic or character recognition errors. A typical example is to match the letter "O" with the number "0". |
Adjusting Table
Default Table
['A', 'E'], ['A', 'I'], ['A', 'O'], ['A', 'U'], ['B', 'V'], ['E', 'I'], ['E', 'O'], ['E', 'U'], ['I', 'O'], ['I', 'U'],
['O', 'U'], ['I', 'Y'], ['E', 'Y'], ['C', 'G'], ['E', 'F'], ['W', 'U'], ['W', 'V'], ['X', 'K'], ['S', 'Z'], ['X', 'S'],
['Q', 'C'], ['U', 'V'], ['M', 'N'], ['L', 'I'], ['Q', 'O'], ['P', 'R'], ['I', 'J'], ['2', 'Z'], ['5', 'S'], ['8', 'B'],
['1', 'I'], ['1', 'L'], ['0', 'O'], ['0', 'Q'], ['C', 'K'], ['G', 'J'], ['E', ' '], ['Y', ' '], ['S', ' ']
How it works?
Original Formula:
where
m
is the number of matching characters.t
is half the number of transpositions.
With Adjusting Table:
where
s
is the number of nonmatching but similar characters.
Why This?
There is also another similar gem named fuzzy-string-match which both provides C and Ruby version as well.
I reinvent this wheel because of the naming in fuzzy-string-match
such as getDistance
breaks convention, and some weird code like a1 = s1.split( // )
(s1.chars
could be better), furthermore, it's bugged (see tables below).
Compare with other gems
jaro_winkler | fuzzystringmatch | hotwater | amatch | |
---|---|---|---|---|
Encoding Support | Yes | Pure Ruby only | No | No |
Windows Support | Yes | ? | No | Yes |
Adjusting Table | Yes | No | No | No |
Native | Yes | Yes | Yes | Yes |
Pure Ruby | Yes | Yes | No | No |
Speed | 1st | 3rd | 2nd | 4th |
I made a table below to compare accuracy between each gem:
str_1 | str_2 | origin | jaro_winkler | fuzzystringmatch | hotwater | amatch |
---|---|---|---|---|---|---|
"henka" | "henkan" | 0.9667 | 0.9667 | 0.9722 | 0.9667 | 0.9444 |
"al" | "al" | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
"martha" | "marhta" | 0.9611 | 0.9611 | 0.9611 | 0.9611 | 0.9444 |
"jones" | "johnson" | 0.8324 | 0.8324 | 0.8324 | 0.8324 | 0.7905 |
"abcvwxyz" | "cabvwxyz" | 0.9583 | 0.9583 | 0.9583 | 0.9583 | 0.9583 |
"dwayne" | "duane" | 0.84 | 0.84 | 0.84 | 0.84 | 0.8222 |
"dixon" | "dicksonx" | 0.8133 | 0.8133 | 0.8133 | 0.8133 | 0.7667 |
"fvie" | "ten" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
- The "origin" result is from the original C implementation by the author of the algorithm.
- Test data are borrowed from fuzzy-string-match's rspec file.
Benchmark
$ bundle exec rake benchmark
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin16]
# C Extension
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09) 0.240000 0.000000 0.240000 ( 0.241347)
fuzzy-string-match (1.0.1) 0.400000 0.010000 0.410000 ( 0.403673)
hotwater (0.1.2) 0.250000 0.000000 0.250000 ( 0.254503)
amatch (0.4.0) 0.870000 0.000000 0.870000 ( 0.875930)
----------------------------------------------------- total: 1.770000sec
user system total real
jaro_winkler (8c16e09) 0.230000 0.000000 0.230000 ( 0.236921)
fuzzy-string-match (1.0.1) 0.380000 0.000000 0.380000 ( 0.381942)
hotwater (0.1.2) 0.250000 0.000000 0.250000 ( 0.254977)
amatch (0.4.0) 0.860000 0.000000 0.860000 ( 0.861207)
# Pure Ruby
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09) 0.440000 0.000000 0.440000 ( 0.438470)
fuzzy-string-match (1.0.1) 0.860000 0.000000 0.860000 ( 0.862850)
----------------------------------------------------- total: 1.300000sec
user system total real
jaro_winkler (8c16e09) 0.440000 0.000000 0.440000 ( 0.439237)
fuzzy-string-match (1.0.1) 0.910000 0.010000 0.920000 ( 0.920259)
Todo
- Custom adjusting word table.