Groovy Text Similarity
Published: 2025-02-03 08:30PM
Introduction
Let’s build a wordle-like word guessing game. But instead of telling you how many correct and misplaced letters, we’ll give you more hints, but slightly less obvious ones, to make it a little more challenging!
We won’t (directly) even tell you how many letters are in the word, but we’ll give hints like:
-
How close your guess sounds like the hidden word.
-
How close your guess is to the meaning of the hidden word.
-
Instead of correct and misplaced letters, we’ll give you some distance and similarity measures which will give you clues about how many correct letters you have, do you have the correct letters in order and so forth.
If you are new to Groovy, consider checking out this Groovy game building tutorial first.
Background
String similarity tries to answer the question: is one word (or phrase, sentence, paragraph, document) the same as, or similar to, another. This is an important capability in many scenarios involving searching or asking questions or invoking commands. It can help answer questions like:
-
Are two Jira/GitHub issues duplicates of the same issue?
-
Are two (or more) customer records actually for the same customer?
-
Is some social media topic trending because multiple posts which, even though they contain slightly different words, are really about the same thing?
-
Can I understand some natural language customer request even when it contains spelling mistakes?
-
As a doctor, can I find a medical journal paper discussing a patient’s medical diagnosis/symptoms/treatment?
-
As a programmer, can I find a solution to my coding problem?
When comparing two things to see if they are the same, we often want to incorporate a certain amount of fuzziness:
-
Are two words the same except for case? E.g.
cat
andCat
. -
Are two words the same except for spelling variations? E.g.
Clare
,Claire
, andClair
. -
Are two words the same except for a misspelling or typo? E.g.
mathematiks
, andcomputter
. -
Are two phrases the same except for order and/or inconsequential words? E.g.
the red car
andthe car of colour red
-
Do the words sound the same? E.g.
there
,their
, andthey’re
-
Are the words variations meaning the same (or similar) thing? E.g.
cow
,bull
,calf
,bovine
,cattle
, orlivestock
Very simple cases can typically be explicitly catered for by hand, e.g. using String library methods or regular expressions:
assert 'Cat'.equalsIgnoreCase('cat')
assert ['color', 'Colour'].every { it =~ '[Cc]olou?r' }
Handling cases explicitly like this soon becomes tedious. We’ll look at some libraries which can help us handle comparisons in more general ways.
First, we’ll examine two libraries for performing similarity matching using string metrics:
-
info.debatty:java-string-similarity
-
org.apache.commons:commons-text
Apache Commons Text
Then we’ll look at some libraries for phonetic matching:
-
commons-codec:commons-codec
Apache Commons Codec for Soundex and Metaphone -
org.openrefine:main
OpenRefine for Metaphone3
Then we’ll look at some deep learning options for increased semantic matching:
-
org.deeplearning4j:deeplearning4j-nlp
for Glove, ConceptNet, and FastText models -
ai.djl
with Pytorch for a universal-sentence-encoder model and Tensorflow with an Angle model
Simple String Metrics
String metrics provide some sort of measure of the sameness of the characters in words (or phrases).
These algorithms generally compute similarity or distance (inverse similarity). The two are related.
Consider cat
vs hat
. There is one "edit" to change from one word to the other according
to the Levenshtein algorithm (described shortly). So, we give this a distance of 1.
We can alternatively produce a normalized similarity value using (word_size - distance) / word_size
,
where word_size
is the size of the largest of the two words. So we’d get a Levenshtein
similarity measure of 2/3 (or 0.67) for cat
vs hat
.
For our game, we’ll sometimes want the distance, other times we’ll use the similarity.
There are numerous tutorials (see further information) that describe various string metric algorithms. We won’t replicate those tutorials but here is a summary of some common ones:
Algorithm | Description |
---|---|
The minimum number of "edits" (inserts, deletes, or substitutions) required to convert from one word to another.
Distance between |
|
Defines a ratio between two sample sets. This could be sets of
characters in a word, or words in a sentence, or sets of |
|
Similar to Levenshtein but insertions and deletions aren’t allowed.
Distance between |
|
The maximum number of characters appearing in order in the two words, not necessarily consecutively.
LCS of |
|
This is a metric also measuring edit distance but weights edits to favor
words with common prefixes.
JaroWinkler of |
You may be wondering what practical use these algorithms might have. Here are just a few use cases:
-
Longest common subsequence is the algorithm behind the popular
diff
tool -
Hamming distance is an important metric when designing algorithms for error detection, error correction and checksums
-
Levenshtein is used in search engines (like Apache Lucene and Apache Solr) for fuzzy matching searches and for spelling correction software
Groovy has in fact a built-in example of using the Damerau-Levenshtein distance metric.
This variant counts transposing two adjacent characters within the original word as one "edit".
The Levenshtein distance of fish
and ifsh` is 2.
The Damerau-Levenshtein distance of fish
and ifsh` is 1.
As an example, for this code:
'foo'.toUpper()
Groovy will give this error message:
No signature of method: java.lang.String.toUpper() is applicable for argument types: () values: [] Possible solutions: toUpperCase(), toURI(), toURL(), toURI(), toURL(), toSet()
And this code:
'foo'.touppercase()
Gives this error:
No signature of method: java.lang.String.touppercase() is applicable for argument types: () values: [] Possible solutions: toUpperCase(), toUpperCase(java.util.Locale), toLowerCase(), toLowerCase(java.util.Locale)
The values returned as possible solutions are the best methods found for the String class according the Damerau-Levenshtein distance but with some smarts added to cater for closest matching parameters. This includes Groovy’s String extension methods and (if using dynamic metaprogramming) any methods added at runtime.
Note
|
You’ll frequently get feedback about using incorrect methods from your IDE or the Groovy compiler (with static checking enabled). But having this feedback provides a great fallback especially when using Groovy scripts outside an IDE. |
Let’s now look at some examples of running various string metric algorithms.
We’ll use algorithm from Apache Commons Text and the info.debatty:java-string-similarity
library.
Both these libraries support numerous string metric classes. Methods to calculate both similarity and distance are provided. We’ll look at both in turn.
First, let’s look at some similarity measures. These typically range from 0 (meaning no similarity) to 1 (meaning they are the same).
We’ll look at the following subset of similarity measures from the two libraries. Note that there is a fair bit of overlap between the two libraries. We’ll do a little cross-checking between the two libraries but won’t compare them exhaustively.
var simAlgs = [
NormalizedLevenshtein: new NormalizedLevenshtein()::similarity,
'Jaccard (debatty k=1)': new Jaccard(1)::similarity,
'Jaccard (debatty k=2)': new Jaccard(2)::similarity,
'Jaccard (debatty k=3)': new Jaccard()::similarity,
'Jaccard (commons text k=1)': new JaccardSimilarity()::apply,
'JaroWinkler (debatty)': new JaroWinkler()::similarity,
'JaroWinkler (commons text)': new JaroWinklerSimilarity()::apply,
RatcliffObershelp: new RatcliffObershelp()::similarity,
SorensenDice: new SorensenDice()::similarity,
Cosine: new Cosine()::similarity,
]
In the sample code, we run these measures for the following pairs:
var pairs = [
['there', 'their'],
['cat', 'hat'],
['cat', 'kitten'],
['cat', 'dog'],
['bear', 'bare'],
['bear', 'bean'],
['pair', 'pear'],
['sort', 'sought'],
['cow', 'bull'],
['winners', 'grinners'],
['knows', 'nose'],
['ground', 'aground'],
['grounds', 'aground'],
['peeler', 'repeal'],
['hippo', 'hippopotamus'],
['elton john', 'john elton'],
['elton john', 'nhoj notle'],
['my name is Yoda', 'Yoda my name is'],
['the cat sat on the mat', 'the fox jumped over the dog'],
['poodles are cute', 'dachshunds are delightful']
]
We can run our algorithms on our pairs like follows:
pairs.each { wordPair ->
var results = simAlgs.collectValues { alg -> alg(*wordPair) }
// display results ...
}
Here is the output from the first pair:
there VS their JaroWinkler (commons text) 0.91 ██████████████████ JaroWinkler (debatty) 0.91 ██████████████████ Jaccard (debatty k=1) 0.80 ████████████████ Jaccard (commons text k=1) 0.80 ████████████████ RatcliffObershelp 0.80 ████████████████ NormalizedLevenshtein 0.60 ████████████ Cosine 0.33 ██████ Jaccard (debatty k=2) 0.33 ██████ SorensenDice 0.33 ██████ Jaccard (debatty k=3) 0.20 ████
We have color coded the bars in the chart with 80% and above colored green deeming it a "match" in terms of similarity. You could choose some different threshold for matching depending on your use case.
We can see that the different algorithms rank the similarity of the two words differently.
Rather than show the results of all algorithms for all pairs, let’s just show a few highlights that give us insight into which similarity measures might be most useful for our game.
A first observation is the usefulness of Jaccard with k=1 (looking at the set of individual letters).
Here we can imagine that bear
might be our guess and bare
might be the hidden word.
bear VS bare
Jaccard (debatty k=1) 1.00 ████████████████████
Here we know that we have correctly guessed all the letters!
For another example:
cow VS bull
Jaccard (debatty k=1) 0.00 ▏
We can rule out all letters from our guess!
What about Jaccard looking at multi-letter sequences? Well, if you were trying to determine
whether a social media account @elton_john
might be the same person as the email john.elton@gmail.com
,
Jaccard with higher indexes would help you out.
elton john VS john elton Jaccard (debatty k=1) 1.00 ████████████████████ Jaccard (debatty k=2) 0.80 ████████████████ Jaccard (debatty k=3) 0.45 █████████ elton john VS nhoj notle Jaccard (debatty k=1) 1.00 ████████████████████ Jaccard (debatty k=2) 0.00 ▏ Jaccard (debatty k=3) 0.00 ▏
Note that for "Elton John" backwards, Jaccard with higher values of k quickly drops to zero but just swapping the words (like our social media account and email with punctuation removed) remains high. So higher value values of k for Jaccard definitely have there place but perhaps not needed for our game. Dealing with k sequential letters (also referred to as n sequential letters) is common. There is in fact a special term n-grams for such sequences. While n-grams play an important role in measuring similarity, it doesn’t add a lot of value for our game. So we’ll just use Jaccard with k=1 in the game.
Let’s now look at JaroWinkler. This measure looks at the number of edits but calculates a weighted score penalising changes at the start, which in turn means that words with common prefixes have higher similarity scores.
If we look at the words 'superstar' and 'supersonic', 5 of the 11 distinct letters are in common (hence the Jaccard score of 5/11 or 0.45), but since they both start with the same 6 letters, it scores a high JaroWinkler value of 0.90.
Swapping to 'partnership' and 'leadership', 7 of the 11 distinct letters are in common hence the higher Jaccard score of 0.64, but even though they both end with the same 6 characters, it gives us a lower JaroWinkler score of 0.73.
superstar VS supersonic JaroWinkler (debatty) 0.90 ██████████████████ Jaccard (debatty k=1) 0.45 █████████ partnership VS leadership JaroWinkler (debatty) 0.73 ██████████████ Jaccard (debatty k=1) 0.64 ████████████
Perhaps it could be interesting to know if the start of our guess is close to the hidden word. So we’ll use the JaroWinkler measure in our game.
Let’s now swap over to distance measures.
Let’s first explore the range of distance measures we have available to us by looking at the following distance measures:
var distAlgs = [
NormalizedLevenshtein: new NormalizedLevenshtein()::distance,
'WeightedLevenshtein (t is near r)': new WeightedLevenshtein({ char c1, char c2 ->
c1 == 't' && c2 == 'r' ? 0.5 : 1.0
})::distance,
Damerau: new Damerau()::distance,
OptimalStringAlignment: new OptimalStringAlignment()::distance,
LongestCommonSubsequence: new LongestCommonSubsequence()::distance,
MetricLCS: new MetricLCS()::distance,
'NGram(2)': new NGram(2)::distance,
'NGram(4)': new NGram(4)::distance,
QGram: new QGram(2)::distance,
CosineDistance: new CosineDistance()::apply,
HammingDistance: new HammingDistance()::apply,
JaccardDistance: new JaccardDistance()::apply,
JaroWinklerDistance: new JaroWinklerDistance()::apply,
LevenshteinDistance: LevenshteinDistance.defaultInstance::apply,
]
Not all of these metrics are normalized, so graphing them like before isn’t as useful. Instead, we will have a set of predefined phrases (similar to the index of a search engine), and we will find the closest phrases to some query.
Here are our predefined phrases:
var phrases = [
'The sky is blue',
'The blue sky',
'The blue cat',
'The sea is blue',
'Blue skies following me',
'My ferrari is red',
'Apples are red',
'I read a book',
'The wind blew',
'Numbers are odd or even',
'Red noses',
'Read knows',
'Hippopotamus',
]
Now, let’s use the query The blue car
. We’ll find the distance from the query to each of the
candidates phrases and return the closest three. Here are the results for The blue car
:
NormalizedLevenshtein: The blue cat (0.08), The blue sky (0.25), The wind blew (0.62) WeightedLevenshtein (t is near r): The blue cat (0.50), The blue sky (3.00), The wind blew (8.00) Damerau: The blue cat (1.00), The blue sky (3.00), The wind blew (8.00) OptimalStringAlignment: The blue cat (1.00), The blue sky (3.00), The wind blew (8.00) LongestCommonSubsequence (debatty): The blue cat (2.00), The blue sky (6.00), The sky is blue (11.00) MetricLCS: The blue cat (0.08), The blue sky (0.25), The wind blew (0.46) NGram(2): The blue cat (0.04), The blue sky (0.21), The wind blew (0.58) NGram(4): The blue cat (0.02), The blue sky (0.13), The wind blew (0.50) QGram: The blue cat (2.00), The blue sky (6.00), The sky is blue (11.00) CosineDistance: The blue sky (0.33), The blue cat (0.33), The sky is blue (0.42) HammingDistance: The blue cat (1), The blue sky (3), Hippopotamus (12) JaccardDistance: The blue cat (0.18), The sea is blue (0.33), The blue sky (0.46) JaroWinklerDistance: The blue cat (0.03), The blue sky (0.10), The wind blew (0.32) LevenshteinDistance: The blue cat (1), The blue sky (3), The wind blew (8) LongestCommonSubsequenceDistance (commons text): The blue cat (2), The blue sky (6), The sky is blue (11)
As another example, let’s query Red roses
:
NormalizedLevenshtein: Red noses (0.11), Read knows (0.50), Apples are red (0.71) WeightedLevenshtein (t is near r): Red noses (1.00), Read knows (5.00), The blue sky (9.00) Damerau: Red noses (1.00), Read knows (5.00), The blue sky (9.00) OptimalStringAlignment: Red noses (1.00), Read knows (5.00), The blue sky (9.00) MetricLCS: Red noses (0.11), Read knows (0.40), The blue sky (0.67) NGram(2): Red noses (0.11), Read knows (0.55), Apples are red (0.75) NGram(4): Red noses (0.11), Read knows (0.53), Apples are red (0.82) QGram: Red noses (4.00), Read knows (13.00), Apples are red (15.00) CosineDistance: Red noses (0.50), The sky is blue (1.00), The blue sky (1.00) HammingDistance: Red noses (1), The sky is blue (-), The blue sky (-) JaccardDistance: Red noses (0.25), Read knows (0.45), Apples are red (0.55) JaroWinklerDistance: Red noses (0.04), Read knows (0.20), The sea is blue (0.37) LevenshteinDistance: Red noses (1), Read knows (5), The blue sky (9) LongestCommonSubsequenceDistance (commons text): Red noses (2), Read knows (7), The blue sky (13)
Let’s examine these results. Firstly, there are way too many measures for most folks to comfortably reason about. We want to shrink the set down.
We have various Levenshtein values on display. Some are the actual "edit" distance, others are a metric. For our game, since we don’t know the length of the word initially, we thought it might be useful to know the exact number of edits. A normalized value of 0.5 could mean any of 1 letter wrong in a 2-letter word, 2 letters wrong in a 4-letter word, 3 letters wrong in a 6-letter word, and so forth.
We also think the actual distance measure might be something that wordle players could relate to. Once you know the size of the hidden word, the distance indirectly gives you the number of correct letters (which is something wordle players are used to - just here we don’t know which ares are correct).
We also thought about Damerau-Levenshtein. It allows transposition of adjacent characters and, while that adds value in most spell-checking scenarios, for our game it might be harder to keep visualize what the distance measure might mean with that additional possible change.
So, we’ll use the standard Levenshtein distance in the game.
We could continue reasoning in this way about the other measures, but we’ll jump to our solution, and try to give you examples of why we think they are useful. We added Hamming distance and LongestCommonSubsequence (similarity measure not a distance measure) to our list.
Let’s look at some examples to highlight our thinking.
Let’s look at some examples for hamming distance. Consider these results:
cat vs dog: LongestCommonSubsequence (0) Hamming (3) Levenshtein (3) cow vs bull: LongestCommonSubsequence (0) Hamming (-) Levenshtein (4)
For cat
vs dog
, the LCS value tells us we didn’t have any correct letters in our guess.
We’d have at least an LCS value of 1 if we had one correct character somewhere.
The fact that Hamming is now 3, means that the hidden word (like our guess) must
have 3 letters in it - since we can substitute 3 times. The Levenshtein value confirms this.
For cow
vs bull
, because Hamming didn’t return a result we know that the guess and
hidden word are different sizes (remember it’s an algorithm that requires the sizes to be the same).
We also know that our guess also has no correct letters. But the Levenshtein value of 4
tells us less in this case. If cow
was our guess, the hidden word could be bull
or cowbell
. We do know that the size of the hidden word is between 4 and 7.
Consider now this example:
grounds vs aground: LongestCommonSubsequence (6) Hamming(7) Levenshtein(2)
The fact that Hamming returned 7 means that our guess has the correct number of letters. The fact that the LCS value was 6 means that there is one spurious letter in our guess. If the spurious letter was somewhere in the middle of our guess, then Hamming shouldn’t show that all letters are incorrect. Hence, we know that the spurious letter is the first or last. The Levenshtein value confirms this (one insertion at one end and one delete at one end).
Phonetic Algorithms
Phonetic algorithms map words into representations of their pronunciation. They are often used for spell checkers, searching, data deduplication and speech to text systems.
One of the earliest phonetic algorithms was Soundex. The idea is that similar sounding words will have the same soundex encoding despite minor differences in spelling, e.g. Claire, Clair, and Clare, all have the same soundex encoding. A summary of soundex is that (all but leading) vowels are dropped and similar sounding consonants are grouped together. Commons codec has several soundex algorithms. The most commonly used ones for the English language are shown below:
Pair Soundex RefinedSoundex DaitchMokotoffSoundex cat|hat C300|H300 C306|H06 430000|530000 bear|bare B600|B600 B109|B1090 790000|790000 pair|pare P600|P600 P109|P1090 790000|790000 there|their T600|T600 T6090|T609 390000|390000 sort|sought S630|S230 S3096|S30406 493000|453000 cow|bull C000|B400 C30|B107 470000|780000 winning|grinning W552|G655 W08084|G4908084 766500|596650 knows|nose K520|N200 K3803|N8030 567400|640000 ground|aground G653|A265 G49086|A049086 596300|059630 peeler|repeal P460|R140 P10709|R90107 789000|978000 hippo|hippopotamus H100|H113 H010|H0101060803 570000|577364
Another common phonetic algorithm is Metaphone. This is similar in concept to Soundex but uses a more sophisticated algorithm for encoding. Various versions are available. Commons codec supports Metaphone and Double Metaphone. The openrefine project includes an early version of Metaphone 3.
Pair Metaphone Metaphone(8) DblMetaphone(8) Metaphone3 cat|hat KT|HT KT|HT KT|HT KT|HT bear|bare BR|BR BR|BR PR|PR PR|PR pair|pare PR|PR PR|PR PR|PR PR|PR there|their 0R|0R 0R|0R 0R|0R 0R|0R sort|sought SRT|ST SRT|ST SRT|SKT SRT|ST cow|bull K|BL K|BL K|PL K|PL winning|grinning WNNK|KRNN WNNK|KRNNK ANNK|KRNNK ANNK|KRNNK knows|nose NS|NS NS|NS NS|NS NS|NS ground|aground KRNT|AKRN KRNT|AKRNT KRNT|AKRNT KRNT|AKRNT peeler|repeal PLR|RPL PLR|RPL PLR|RPL PLR|RPL hippo|hippopotamus HP|HPPT HP|HPPTMS HP|HPPTMS HP|HPPTMS
Commons Codec includes some additional algorithms including Nysiis and Caverphone. They are shown below for completeness.
Pair Nysiis Caverphone2 cat|hat CAT|HAT KT11111111|AT11111111 bear|bare BAR|BAR PA11111111|PA11111111 pair|pare PAR|PAR PA11111111|PA11111111 there|their TAR|TAR TA11111111|TA11111111 sort|sought SAD|SAGT ST11111111|ST11111111 cow|bull C|BAL KA11111111|PA11111111 winning|grinning WANANG|GRANAN WNNK111111|KRNNK11111 knows|nose N|NAS KNS1111111|NS11111111 ground|aground GRAD|AGRAD KRNT111111|AKRNT11111 peeler|repeal PALAR|RAPAL PLA1111111|RPA1111111 hippo|hippopotamus HAP|HAPAPA APA1111111|APPTMS1111
The matching of sort
with sought
by Caverphone2 is useful but it didn’t match
knows
with nose
. In summary, these
algorithms don’t offer anything compelling compared with Metaphone.
For our game, we don’t want users to have to understand the encoding algorithms of the various phonetic algorithms. We want to instead give them a metric that lets them know how closely their guess sounds like the hidden word.
Pair SoundexDiff Metaphone5LCS Metaphone5Lev cat|hat 75% 50% 50% bear|bare 100% 100% 100% pair|pare 100% 100% 100% there|their 100% 100% 100% sort|sought 75% 67% 67% cow|bull 50% 0% 0% winning|grinning 25% 60% 60% knows|nose 25% 100% 100% ground|aground 0% 80% 80% peeler|repeal 25% 67% 33% hippo|hippopotamus 50% 40% 40%
Going Deeper
Using DJL with PyTorch and the Angle model:
cow bovine (0.86) bull (0.73) Bovines convert grass to milk (0.60) hay (0.59) Bulls consume hay (0.56) cat kitten (0.82) bull (0.63) bovine (0.60) One two three (0.59) hay (0.55) dog bull (0.69) bovine (0.68) kitten (0.58) Dogs play in the grass (0.58) Dachshunds are delightful (0.55) grass The grass is green (0.83) hay (0.68) Dogs play in the grass (0.65) Bovines convert grass to milk (0.61) Bulls trample grass (0.59) Cows eat grass Bovines convert grass to milk (0.80) Bulls consume hay (0.69) Bulls trample grass (0.68) Dogs play in the grass (0.65) bovine (0.62) Poodles are cute Dachshunds are delightful (0.63) Dogs play in the grass (0.56) bovine (0.49) The grass is green (0.44) kitten (0.44) The water is turquoise The sea is blue (0.72) The sky is blue (0.65) The grass is green (0.53) One two three (0.43) bovine (0.39)
Using DJL with Tensorflow and the UAE model:
cow bovine (0.72) bull (0.57) Bulls consume hay (0.46) hay (0.45) kitten (0.44) cat kitten (0.75) bull (0.35) hay (0.31) bovine (0.26) Dogs play in the grass (0.22) dog kitten (0.54) Dogs play in the grass (0.45) bull (0.39) hay (0.35) Dachshunds are delightful (0.27) grass The grass is green (0.61) Bulls trample grass (0.56) Dogs play in the grass (0.52) hay (0.51) Bulls consume hay (0.47) Cows eat grass Bovines convert grass to milk (0.60) Bulls trample grass (0.58) Dogs play in the grass (0.56) Bulls consume hay (0.53) bovine (0.44) Poodles are cute Dachshunds are delightful (0.54) Dogs play in the grass (0.27) Bulls consume hay (0.19) bovine (0.16) Bulls trample grass (0.15) The water is turquoise The sea is blue (0.56) The grass is green (0.39) The sky is blue (0.38) kitten (0.17) One two three (0.17)
Algorithm conceptnet /c/fr/vache █████████▏ /c/de/kuh █████████▏ /c/en/cow /c/en/bovine ███████▏ /c/fr/bovin ███████▏ /c/en/bull █████▏ /c/fr/taureau █████████▏ /c/en/cow █████▏ /c/en/bull /c/fr/vache █████▏ /c/de/kuh █████▏ /c/fr/bovin █████▏ /c/de/kuh █████▏ /c/en/cow █████▏ /c/en/calf /c/fr/vache █████▏ /c/en/bovine █████▏ /c/fr/bovin █████▏ /c/fr/bovin █████████▏ /c/en/cow ███████▏ /c/en/bovine /c/de/kuh ███████▏ /c/fr/vache ███████▏ /c/en/calf █████▏ /c/en/bovine █████████▏ /c/fr/vache ███████▏ /c/fr/bovin /c/de/kuh ███████▏ /c/en/cow ███████▏ /c/fr/taureau █████▏ /c/en/cow █████████▏ /c/de/kuh █████████▏ /c/fr/vache /c/fr/bovin ███████▏ /c/en/bovine ███████▏ /c/fr/taureau █████▏ /c/en/bull █████████▏ /c/fr/bovin █████▏ /c/fr/taureau /c/fr/vache █████▏ /c/en/cow █████▏ /c/de/kuh █████▏ /c/en/cow █████████▏ /c/fr/vache █████████▏ /c/de/kuh /c/fr/bovin ███████▏ /c/en/bovine ███████▏ /c/en/calf █████▏ /c/en/cat ████████▏ /c/de/katze ████████▏ /c/en/kitten /c/en/bull ██▏ /c/en/cow █▏ /c/de/kuh █▏ /c/de/katze █████████▏ /c/en/kitten ████████▏ /c/en/cat /c/en/bull ██▏ /c/en/cow ██▏ /c/fr/taureau █▏ /c/en/cat █████████▏ /c/en/kitten ████████▏ /c/de/katze /c/en/bull ██▏ /c/de/kuh ██▏ /c/fr/taureau ██▏
Algorithm angle use conceptnet glove fasttext bovine ████████▏ bovine ███████▏ bovine ███████▏ bovine ██████▏ bovine ███████▏ cattle ████████▏ cattle ███████▏ cattle ███████▏ cattle ██████▏ cattle ███████▏ cow calf ████████▏ calf ██████▏ livestock ██████▏ milk ████▏ calf ██████▏ milk ███████▏ livestock ██████▏ bull █████▏ livestock ████▏ bull ██████▏ bull ███████▏ bull █████▏ calf █████▏ calf ████▏ milk ██████▏ bear ███████▏ cow █████▏ cow █████▏ cow ███▏ cow ██████▏ cow ███████▏ cattle █████▏ cattle █████▏ bear ███▏ cattle █████▏ bull cattle ███████▏ bovine █████▏ bovine ████▏ calf ███▏ bovine █████▏ bovine ███████▏ livestock ████▏ livestock ███▏ cattle ███▏ calf █████▏ calf ██████▏ calf ████▏ calf ███▏ cat ███▏ bear █████▏ bovine ████████▏ cow ██████▏ cow █████▏ cow ████▏ cow ██████▏ cow ████████▏ cattle ██████▏ bovine █████▏ bovine ███▏ cattle █████▏ calf cattle ███████▏ bovine █████▏ cattle ████▏ bull ███▏ bull █████▏ bear ██████▏ livestock █████▏ livestock ███▏ cattle ███▏ bovine █████▏ bull ██████▏ bull ████▏ bull ███▏ hippo ██▏ livestock █████▏ cow ████████▏ cattle ███████▏ cow ███████▏ cow ██████▏ cow ███████▏ cattle ████████▏ cow ███████▏ cattle ███████▏ cattle ████▏ cattle ██████▏ bovine calf ████████▏ livestock ██████▏ livestock ██████▏ feline ███▏ bull █████▏ livestock ███████▏ calf █████▏ calf █████▏ calf ███▏ calf █████▏ bull ███████▏ bull █████▏ bull ████▏ milk ██▏ livestock █████▏ bovine ████████▏ livestock ████████▏ livestock ████████▏ livestock ███████▏ livestock ████████▏ livestock ████████▏ bovine ███████▏ cow ███████▏ cow ██████▏ cow ███████▏ cattle cow ████████▏ cow ███████▏ bovine ███████▏ bovine ████▏ bovine ██████▏ calf ███████▏ calf ██████▏ bull █████▏ milk ███▏ calf █████▏ bull ███████▏ bull █████▏ calf ████▏ bull ███▏ bull █████▏ cattle ████████▏ cattle ████████▏ cattle ████████▏ cattle ███████▏ cattle ████████▏ bovine ███████▏ bovine ██████▏ cow ██████▏ cow ████▏ cow ██████▏ livestock cow ███████▏ cow ██████▏ bovine ██████▏ milk ███▏ water █████▏ bull ██████▏ calf █████▏ calf ███▏ water ██▏ bovine █████▏ calf ██████▏ bull ████▏ bull ███▏ bovine ██▏ calf █████▏ feline █████████▏ kitten ███████▏ kitten ████████▏ feline ████▏ kitten ███████▏ kitten ████████▏ feline ███████▏ feline ████████▏ kitten ████▏ feline ███████▏ cat bear ███████▏ bear ████▏ bull ██▏ cow ███▏ cow ████▏ milk ██████▏ cow ████▏ bear ██▏ bear ███▏ hippo ████▏ bull ██████▏ grass ███▏ cow ██▏ bull ███▏ bull ████▏ feline ████████▏ cat ███████▏ cat ████████▏ cat ████▏ cat ███████▏ cat ████████▏ feline ██████▏ feline ███████▏ feline ███▏ feline ██████▏ kitten bear ██████▏ cow ████▏ bear ██▏ hippo ██▏ hippo ████▏ milk █████▏ milk ████▏ bull ██▏ cow ██▏ cow ████▏ bovine █████▏ bear ███▏ hippo ██▏ calf ██▏ calf ████▏ cat █████████▏ cat ███████▏ cat ████████▏ cat ████▏ cat ███████▏ kitten ████████▏ kitten ██████▏ kitten ███████▏ kitten ███▏ kitten ██████▏ feline bear ██████▏ cow ███▏ bear ██▏ bovine ███▏ bovine ████▏ bovine ██████▏ livestock ███▏ hippo ██▏ hippo ██▏ hippo ████▏ bull ██████▏ grass ███▏ livestock ██▏ cow █▏ cow ████▏ bull ██████▏ cow ███▏ cow ███▏ calf ██▏ cow ████▏ calf ██████▏ feline ███▏ bovine ███▏ feline ██▏ calf ████▏ hippo bear ██████▏ bull ███▏ bear ██▏ kitten ██▏ kitten ████▏ bovine █████▏ bear ███▏ calf ██▏ bull ██▏ feline ████▏ cat █████▏ calf ███▏ bull ██▏ cow ██▏ cat ████▏ bull ███████▏ bare █████▏ bull ███▏ bull ███▏ bull █████▏ cat ███████▏ cat ████▏ hippo ██▏ cat ███▏ bare ████▏ bear bovine ██████▏ cow ████▏ feline ██▏ cow ██▏ kitten ████▏ calf ██████▏ bull ████▏ kitten ██▏ livestock ██▏ calf ████▏ milk ██████▏ kitten ███▏ cat ██▏ cattle ██▏ livestock ████▏ bear █████▏ bear █████▏ grass █▏ grass ██▏ bear ████▏ milk █████▏ calf ███▏ bear █▏ water ██▏ grass ████▏ bare green █████▏ cat ███▏ kitten ▏ green █▏ green ████▏ grass █████▏ grass ███▏ cat ▏ calf █▏ bull ███▏ water █████▏ green ███▏ cattle ▏ cat █▏ water ███▏ cow ███████▏ cow █████▏ cow █████▏ cow ████▏ cow ██████▏ bear ██████▏ water █████▏ bovine ███▏ water ████▏ water █████▏ milk bovine ██████▏ calf ████▏ cattle ███▏ cattle ███▏ cattle █████▏ cattle ██████▏ kitten ████▏ livestock ███▏ livestock ███▏ livestock █████▏ water ██████▏ bovine ████▏ water ██▏ bovine ██▏ calf █████▏ milk ██████▏ milk █████▏ milk ██▏ milk ████▏ milk █████▏ green ██████▏ grass ████▏ hippo ██▏ green ███▏ grass █████▏ water grass ██████▏ green ███▏ grass █▏ grass ██▏ livestock █████▏ cat ██████▏ cow ███▏ green █▏ livestock ██▏ green ████▏ bear █████▏ cat ███▏ livestock ▏ bare ██▏ cattle ████▏ green ███████▏ cow ████▏ green ███▏ green ███▏ green █████▏ water ██████▏ water ████▏ livestock ██▏ water ██▏ water █████▏ grass cat █████▏ livestock ████▏ water █▏ bare ██▏ livestock ████▏ livestock █████▏ calf ████▏ cattle █▏ livestock ██▏ cow ████▏ cattle █████▏ cattle ████▏ calf █▏ cattle ██▏ cattle ████▏ grass ███████▏ grass ███▏ grass ███▏ grass ███▏ grass █████▏ water ██████▏ water ███▏ water █▏ water ███▏ water ████▏ green cat ██████▏ cat ███▏ hippo █▏ milk ██▏ bare ████▏ bear ██████▏ bear ███▏ milk ▏ bear █▏ cow ███▏ feline █████▏ bare ███▏ bear ▏ cow █▏ bear ███▏
Playing the game
Round 1
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 1): aftershock LongestCommonSubsequence 0 Levenshtein Distance: 10, Insert: 0, Delete: 3, Substitute: 7 Jaccard 0% JaroWinkler PREFIX 0% / SUFFIX 0% Phonetic Metaphone=AFTRXK 47% / Soundex=A136 0% Meaning Angle 45% / Use 21% / ConceptNet 2% / Glove -4% / FastText 19% Possible letters: b d g i j l m n p q u v w x y z Guess the hidden word (turn 2): fruit LongestCommonSubsequence 2 Levenshtein Distance: 6, Insert: 2, Delete: 0, Substitute: 4 Jaccard 22% JaroWinkler PREFIX 56% / SUFFIX 45% Phonetic Metaphone=FRT 39% / Soundex=F630 0% Meaning Angle 64% / Use 41% / ConceptNet 37% / Glove 31% / FastText 44% Possible letters: b d g i j l m n p q u v w x y z Guess the hidden word (turn 3): buzzing LongestCommonSubsequence 4 Levenshtein Distance: 3, Insert: 0, Delete: 0, Substitute: 3 Jaccard 50% JaroWinkler PREFIX 71% / SUFFIX 80% Phonetic Metaphone=BSNK 58% / Soundex=B252 50% Meaning Angle 44% / Use 19% / ConceptNet -9% / Glove -2% / FastText 24% Possible letters: b d g i j l m n p q u v w x y z Guess the hidden word (turn 4): pulling LongestCommonSubsequence 5 Levenshtein Distance: 2, Insert: 0, Delete: 0, Substitute: 2 Jaccard 71% JaroWinkler PREFIX 85% / SUFFIX 87% Phonetic Metaphone=PLNK 80% / Soundex=P452 75% Meaning Angle 48% / Use 25% / ConceptNet -8% / Glove 3% / FastText 29% Possible letters: b d g i j l m n p q u v w x y z Guess the hidden word (turn 5): pudding LongestCommonSubsequence 7 Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0 Jaccard 100% JaroWinkler PREFIX 100% / SUFFIX 100% Phonetic Metaphone=PTNK 100% / Soundex=P352 100% Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove 100% / FastText 100% Congratulations, you guessed correctly!
Round 2
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 1): bail LongestCommonSubsequence 1 Levenshtein Distance: 7, Insert: 4, Delete: 0, Substitute: 3 Jaccard 22% (2/9) 2 / 9 JaroWinkler PREFIX 42% / SUFFIX 46% Phonetic Metaphone=BL 38% / Soundex=B400 25% Meaning Angle 46% / Use 40% / ConceptNet 0% / Glove 0% / FastText 31% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 2): leg LongestCommonSubsequence 2 Levenshtein Distance: 6, Insert: 5, Delete: 0, Substitute: 1 Jaccard 25% (2/8) 1 / 4 JaroWinkler PREFIX 47% / SUFFIX 0% Phonetic Metaphone=LK 38% / Soundex=L200 0% Meaning Angle 50% / Use 18% / ConceptNet 11% / Glove 13% / FastText 37% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 3): languish LongestCommonSubsequence 2 Levenshtein Distance: 8, Insert: 0, Delete: 0, Substitute: 8 Jaccard 15% (2/13) 2 / 13 JaroWinkler PREFIX 50% / SUFFIX 50% Phonetic Metaphone=LNKX 34% / Soundex=L522 0% Meaning Angle 46% / Use 12% / ConceptNet -11% / Glove -4% / FastText 25% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 4): election LongestCommonSubsequence 5 Levenshtein Distance: 4, Insert: 0, Delete: 0, Substitute: 4 Jaccard 40% (4/10) 2 / 5 JaroWinkler PREFIX 83% / SUFFIX 75% Phonetic Metaphone=ELKXN 50% / Soundex=E423 75% Meaning Angle 47% / Use 13% / ConceptNet -5% / Glove -7% / FastText 26% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 5): elevator LongestCommonSubsequence 8 Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0 Jaccard 100% (7/7) 1 JaroWinkler PREFIX 100% / SUFFIX 100% Phonetic Metaphone=ELFTR 100% / Soundex=E413 100% Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove 100% / FastText 100% Congratulations, you guessed correctly!
Round 3
Let’s take a first guess with a 10-letter (all distinct) word.
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 1): aftershock LongestCommonSubsequence 3 Levenshtein Distance: 8, Insert: 1, Delete: 3, Substitute: 4 Jaccard 33% JaroWinkler PREFIX 56% / SUFFIX 56% Phonetic Metaphone=AFTRXK 32% / Soundex=A136 25% Meaning Angle 41% / Use 20% / ConceptNet -4% / Glove -13% / FastText 11%
Tells us:
-
We did two more deletes than inserts, so the hidden word has 8 characters.
-
If the hidden word is size 8, why would we ever do inserts, i.e. make it longer? Doing the insert (and subsequent deletes) must have made it possible to get 3 letters into the correct position.
-
Soundex tells use that it either starts with A and the other consonant groupings are wrong, or it doesn’t start with A and one consonant grouping is correct. Metaphone of 32% means we probably have two consonant groupings correct.
-
Our guess has 10 distinct letters. Jaccard of 33% tells that we have 4/12 or 5/15 letters correct. If we have 5 letters correct there would be up to 3 letters we don’t have, but adding 3 to the 10 in our guess doesn’t give 15. So we have 4 of 12 letters. There must be up to 4 letters we don’t have. Add those 4 to our 10 gives 14, but we know there is only 12 distinct letters, so the answer has two duplicates or a triple. I.e. the answer has 6 distinct letters.
The letters e
and s
are very common. Let’s pick a word with
2 of each that matches what we know from LCS.
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 1): aftershock LongestCommonSubsequence 3 Levenshtein Distance: 8, Insert: 1, Delete: 3, Substitute: 4 Jaccard 33% (4/12) 1 / 3 JaroWinkler PREFIX 56% / SUFFIX 56% Phonetic Metaphone=AFTRXK 32% / Soundex=A136 25% Meaning Angle 41% / Use 20% / ConceptNet -4% / Glove -13% / FastText 11% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 2): patriate LongestCommonSubsequence 2 Levenshtein Distance: 7, Insert: 0, Delete: 0, Substitute: 7 Jaccard 20% (2/10) 1 / 5 JaroWinkler PREFIX 47% / SUFFIX 47% Phonetic Metaphone=PTRT 38% / Soundex=P363 0% Meaning Angle 39% / Use 23% / ConceptNet 13% / Glove 0% / FastText 27% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 3): tarragon LongestCommonSubsequence 3 Levenshtein Distance: 5, Insert: 0, Delete: 0, Substitute: 5 Jaccard 71% (5/7) 5 / 7 JaroWinkler PREFIX 68% / SUFFIX 68% Phonetic Metaphone=TRKN 50% / Soundex=T625 25% Meaning Angle 46% / Use 4% / ConceptNet -7% / Glove 5% / FastText 26% Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 4): kangaroo LongestCommonSubsequence 8 Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0 Jaccard 100% (6/6) 1 JaroWinkler PREFIX 100% / SUFFIX 100% Phonetic Metaphone=KNKR 100% / Soundex=K526 100% Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove 100% / FastText 100% Congratulations, you guessed correctly!
-
Our Jaccard is now 1/11. That must be the 6 letters we tried plus 5 others in the hidden word, so our correct letter isn’t one of the duplicates. I.e. there is no S or E in the word.
-
Our soundex indicates the word doesn’t start with S which confirms our previous derived fact.
-
Our metaphone has dropped markedly. We know the S shouldn’t be there but with only 10%, only one of F or R is probably correct, and we probably need a K or T from turn 1.
Let’s try duplicates for o
and r
, and also match LCS from previous guesses.
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 3): motorcar LongestCommonSubsequence 2 Levenshtein Distance: 8, Insert: 0, Delete: 0, Substitute: 8 Jaccard 33% (3/9) 1 / 3 JaroWinkler PREFIX 47% / SUFFIX 47% Phonetic Metaphone=MTRKR 43% / Soundex=M362 0% Meaning Angle 44% / Use 20% / ConceptNet -4% / Glove 6% / FastText 33%
-
Soundex indicates that the word doesn’t start with M
-
Our Jaccard is now 3/9. That must mean .
Round 4
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 1): aftershock LongestCommonSubsequence 3 Levenshtein Distance: 8, Insert: 0, Delete: 4, Substitute: 4 Jaccard 50% JaroWinkler PREFIX 61% / SUFFIX 49% Phonetic Metaphone=AFTRXK 33% / Soundex=A136 25% Meaning Angle 44% / Use 11% / ConceptNet -7% / Glove 1% / FastText 15%
What do we know?
-
we deleted 4 letters, so the hidden word has 6 letters
-
Jaccard of 50% is either 5/10 or 6/12. If the latter, we’d have all the letters, so there can’t be 2 additional letters in the hidden word, so it’s 5/10. That means we need to pick 5 letter from aftershock, duplicate one of them, and we’ll have all the letters
-
phonetic clues suggest it probably doesn’t start with A
In aftershock, F, H, and K, are probably least common. Let’s pick a 6-letter word from the remaining 7 letters that abides by our LCS clue. We know this can’t be right because we aren’t duplicating a letter yet, but we just want to narrow down the possibilities.
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 2): coarse LongestCommonSubsequence 3 Levenshtein Distance: 4, Insert: 0, Delete: 0, Substitute: 4 Jaccard 57% (4/7) 4 / 7 JaroWinkler PREFIX 67% / SUFFIX 67% Phonetic Metaphone=KRS 74% / Soundex=C620 75% Meaning Angle 51% / Use 12% / ConceptNet 5% / Glove 23% / FastText 26%
This tells us:
-
We now have 4 of the 5 distinct letters (we should discard 2)
-
Phonetics indicates we are close but not very close yet, from the Metaphone value of KRS we should drop one and keep two.
Let’s assume C and E are wrong and bring in the other common letter, T. We need to find a word that matches the LCS conditions from previous guesses, and we’ll duplicate one letter, S.
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 3): roasts LongestCommonSubsequence 3 Levenshtein Distance: 6, Insert: 0, Delete: 0, Substitute: 6 Jaccard 67% (4/6) 2 / 3 JaroWinkler PREFIX 56% / SUFFIX 56% Phonetic Metaphone=RSTS 61% / Soundex=R232 25% Meaning Angle 54% / Use 25% / ConceptNet 18% / Glove 18% / FastText 31%
We learned:
-
Phonetics dropped, so maybe S wasn’t the correct letter to bring in, we want the K (from letter C) and R from the previous guess.
-
Also, the semantic meaning has bumped up to warm (from cold for previous guesses). Maybe the hidden word is related to roasts.
Let’s try to word starting with C, related to roasts.
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z Guess the hidden word (turn 4): carrot LongestCommonSubsequence 6 Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0 Jaccard 100% (5/5) 1 JaroWinkler PREFIX 100% / SUFFIX 100% Phonetic Metaphone=KRT 100% / Soundex=C630 100% Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove 100% / FastText 100% Congratulations, you guessed correctly!
Success!
Further information
Source code for this post:
Other referenced sites:
Related libraries and links: