SimpleSim tests

Testing file for modifications of SimpleSim.

SimpleSim is the original code. When the next found letter was before current offset (eg turnaround of haystack), the score for the letter is 0.

SimpleSim2 adds a score based on the distance of the found letter and the offset. It should better handle spelling disorder.

offset is the next character in haystack

it searches forward to the end

it then searches backward towards the beginning for a second opinion. This seaerch is limited, only to found nearer results. If forward did not find, it searches until 0. If forward was found, it searches until (2 * offset - position + 1) as this is the limit for the distance to be smaller.

If position is > -1 then score += 1 / (Math.abs(position - offset) + 1)

The offset is adjusted to position + 1

The score is divided by the length of the needle

SimpleSim3 is like SimpleSim2 but gives the option to exit if a certain threshold cannot be achieved any more. The goal is here speed.

To create symmetry, SimpleSim functions are run twice in both directions.

Jaro Winkler is an established algorithm.

Levenshtein Distance is an established algorithm. To make it comparable (range 0-1 instead of distance) we calculate 1-levenshtein/maxlength.

Longest Common Sequence LCS is an established algorithm. To make it comparable, we divide it by maxlength.

Test

The test uses a list of 150 000 english words.

Two words are chosen at random. Each function is run. If in average the similarity is above 0.7, the result is retained and displayed. The test runs until it habve 50 results.

Between all functions, CosineSimilarity is calculated for the retained results. As you can see, it is quite high, even if there ware differences for single word pairs.

For the perfomance test, a shortwords (length = 5) and a lomgwords (length = 5) table are used.

Each function is run 100'000 times and the duration is given in miliseconds. All functions use the same sets.

Dividing the results of both tests can give an information of the average complexity. If the complexity is O(n), the indicator is 2.00 (double the length, double the comparison time). If the complexity is O(n^2) the indicator is 4.0. Between, we would have O(n log n).

The test takes 30 seconds, depending on browser and CPU.

Test results

Conclusions

The results diverge per run, probably as javascript environment may be busy elsewhere. But it looks like Jaro Winkler is near O(n^2), Levenshtein, LCS and SimpleSim behave O(n ln n) and SimpleSim3 may have also O(n) behaviour: it has both a linear and a complexity gain compared to SimpleSim3. We can retain SimpleSim3.

2025-03-13 matti@belle-nuit.com