Skip to content
This repository has been archived by the owner on Sep 27, 2022. It is now read-only.

Rank for results #16

Open
shameelabdulla opened this issue Sep 20, 2013 · 17 comments
Open

Rank for results #16

shameelabdulla opened this issue Sep 20, 2013 · 17 comments

Comments

@shameelabdulla
Copy link

I ve been checking out fuzzily gem it greatly helps. It would be great if there is a rank for suggestions returned. I know that the best suggestion is the first result. If there is a way to give point for each suggestion (say 0 => Exact match, 0.2 => deviates to some extend, 0.9 => deviates to a great extend), it would be really great.

@mezis
Copy link
Owner

mezis commented Sep 21, 2013

Hi @shameelabdulla, thanks for the suggestion.

I think the easiest way to do this is after the fact, on the (limited) list of results. You could use an implementation of Levenshtein to calculate the similarity of the result strings and your input string.

If you do, I'd welcome an add-on to Fuzzily that does this!

@shameelabdulla
Copy link
Author

Oh great!! I ll add it

Sent from my iPhone

On 22-Sep-2013, at 2:06 AM, Julien Letessier [email protected] wrote:

Hi @shameelabdulla, thanks for the suggestion.

I think the easiest way to do this is after the fact, on the (limited) list of results. You could use an implementation of Levenshtein to calculate the similarity of the result strings and your input string.

If you do, I'd welcome an add-on to Fuzzily that does this!


Reply to this email directly or view it on GitHub.

@airblade
Copy link

This works for me: in Fuzzily::Model::Rails(2|3):

  def _matches_for_trigrams(trigrams)
    self.
      select('owner_id, owner_type, count(*) AS matches, MAX(score) AS score').
      group('owner_id, owner_type').
      order('matches DESC, score ASC').
      with_trigram(trigrams).
-     map(&:owner)
+     map do |t|
+       t.owner.tap |o|
+         o.instance_eval "def fuzzily_score; #{t.score}; end"
+       end
+     end
  end

Would you like a patch for this?

@shameelabdulla
Copy link
Author

It would be great if you can add as patch

On Mon, Sep 30, 2013 at 3:10 PM, Andy Stewart [email protected]:

This works for me: in Fuzzily::Model::Rails(2|3):

def _matches_for_trigrams(trigrams)
self.
select('owner_id, owner_type, count(*) AS matches, MAX(score) AS score').
group('owner_id, owner_type').
order('matches DESC, score ASC').
with_trigram(trigrams).- map(&:owner)+ map do |t|+ t.owner.tap |o|+ o.instance_eval "def fuzzily_score; #{t.score}; end"+ end+ end
end

Would you like a patch for this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25348138
.

@airblade
Copy link

Here's some code: airblade@8b41888.

It's not as clean as my diff above due to having to work around the problem in #18.

@shameelabdulla
Copy link
Author

Hi Andy,
Thanks. However this score does not seem to work. If you check the score
responses they do not correspond to level of matching.

On Mon, Sep 30, 2013 at 6:30 PM, Andy Stewart [email protected]:

Here's some code: airblade/fuzzily@8b41888airblade@8b41888
.

It's not as clean as my diff above due to having to work around the
problem in #18 #18.


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25357862
.

@airblade
Copy link

airblade commented Oct 1, 2013

I had assumed that score corresponded to the quality of the match. Can you give a couple of examples? I'm not particularly familiar with how trigram matching works.

@airblade
Copy link

airblade commented Oct 1, 2013

Looking at the code, a trigram's score is simply the length of the word from which it came. The fuzzy finder orders its results by matches DESC, score ASC, where matches is the number of trigrams which match. So the more trigrams which match, the better. And when you have a tie, prefer shorter words.

We need a way to normalise the quality of the matches (so they're comparable across models). How about modifying my code above like this:

o.instance_eval "def fuzzily_score; #{t.matches / t.score.to_f}; end"

– although that doesn't really normalise the results to between 0 and 1.

@airblade
Copy link

airblade commented Oct 1, 2013

For normalising the score, how about:

(number of matches / number of trigrams for search text) / (1 + abs(score - score of search text))

Let's say the search text is vogue. It's score is 5 and it has 6 trigrams ('**v', '*vo', 'vog', 'ogu', 'gue', 'ue*').
For a search text with 0 matches, the normalised score = 0.
For an exact match, the normalised score = (6 / 6) / (1 + 0) = 1.
For two partial hits, each with three trigram matches, with scores 6 and 8:

  • first normalised score = (3 / 6) / (1 + (6 - 5)) = 0.25
  • second normalised score = (3 / 6) / (1 + (8 - 5)) = 0.125.

@airblade
Copy link

airblade commented Oct 1, 2013

Here's another scoring method which I quite like. It keeps the same order in which the results are returned, i.e. the matches DESC, score ASC.

The more matches the better, and the lower the score the better.
The number of matches is an integer, so we make the score adjustment lie between 0 and 1. The bigger the difference between the result's score and the search text's score, the nearer 1 the score adjustment.
And we normalise the overall quality to be between 0 and 1.

delta score = abs(result score - search text score)
score adjustment = 1 / (1 + (search text score / delta score))
quality = (number of trigram matches - score adjustment) / (number of trigrams for seach string)

@shameelabdulla
Copy link
Author

Trying one by one with a data set. Will let you know

Sent from my iPad

On 01-Oct-2013, at 7:31 pm, Andy Stewart [email protected] wrote:

Here's another scoring method which I quite like. It keeps the same order in which the results are returned, i.e. the matches DESC, score ASC.

The more matches the better, and the lower the score the better.
The number of matches is an integer, so we make the score adjustment lie between 0 and 1. The bigger the difference between the result's score and the search text's score, the nearer 1 the score adjustment.
And we normalise the overall quality to be between 0 and 1.

delta score = abs(result score - search text score)
score adjustment = 1 / (1 + (search text score / delta score))
quality = (number of trigram matches - score adjustment) / (number of trigrams for seach string)

Reply to this email directly or view it on GitHub.

@shameelabdulla
Copy link
Author

What is the diff between result score and search text score?

On Tue, Oct 1, 2013 at 7:31 PM, Andy Stewart [email protected]:

Here's another scoring method which I quite like. It keeps the same order
in which the results are returned, i.e. the matches DESC, score ASC.

The more matches the better, and the lower the score the better.
The number of matches is an integer, so we make the score adjustment lie
between 0 and 1. The bigger the difference between the result's score and
the search text's score, the nearer 1 the score adjustment.
And we normalise the overall quality to be between 0 and 1.

delta score = abs(result score - search text score)
score adjustment = 1 / (1 + (search text score / delta score))
quality = (number of trigram matches - score adjustment) / (number of trigrams for seach string)


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25451707
.

@airblade
Copy link

airblade commented Oct 2, 2013

The score is simply the length of the string. The result score is the length of the result string, and the search text score is the length of the text we're searching for.

@shameelabdulla
Copy link
Author

Does not seem to work for the data I ve. I ll tell you the problem I am
trying to solve. I ve a big sane data base of different products. The sane
data base consists of names of products, classification (bakery, grocery,
household etc), and proper images.

As Input (lets call it insane data :) )I ve names of products - But the
inputs just have names(which may have spelling mistakes), and the products
are not classified.

What I am trying to do is

  1. Pick each insane input (name of product)
  2. Do a fuzzy search on my search database
  3. If search comes back with a respectable score (say >0.9), select the
    record.

However for step 3 I have not yet been able to figure out a proper scoring
algorithm. I tried scoring as you said, tried Levenshtein. However all
these just correspond to edit distance I feel.

The following is the data set I tried:

Insane name entry:
ATTA AASHIRVAAD MG 5KG

Response from fuzzy with scores:
{:t=>"AASHIRVAAD DAL BUKHARA 1",
:q=>"1",
:s=>0.4583333333333333,
:l=>0.6666666666666666},
{:t=>"AASHIRVAAD MULTIGRAIN ATTA 1 kg Pouch",
:q=>"1 kg Pouch",
:s=>0.4864864864864865,
:l=>0.7567567567567568},
{:t=>"ATTA AASHIRVAAD SELECT 5 kg",
:q=>"5 kg",
:s=>0.6666666666666666,
:l=>0.3333333333333333},
{:t=>"ATTA AASHIRVAAD SELECT 1 kg",
:q=>"1 kg",
:s=>0.6666666666666666,
:l=>0.37037037037037035},
{:t=>"ATTA AASHIRVAAD M G 1 kg",
:q=>"1 kg",
:s=>0.8333333333333334,
:l=>0.20833333333333334},
{:t=>"ATTA AASHIRVAAD M G 5 kg",
:q=>"5 kg",
:s=>0.8333333333333334,
:l=>0.16666666666666666},
{:t=>"ATTA AASHIRVAAD 10 kg",
:q=>"10 kg",
:s=>0.8571428571428571,
:l=>0.22727272727272727},
{:t=>"ATTA AASHIRVAAD 2 kg", :q=>"2 kg", :s=>0.9, :l=>0.22727272727272727},
{:t=>"ATTA AASHIRVAAD 5 kg", :q=>"5 kg", :s=>0.9, :l=>0.22727272727272727},
{:t=>"ATTA AASHIRVAAD 1 kg", :q=>"1 kg", :s=>0.9, :l=>0.22727272727272727}]

For my requirement 5th entry from last should ve the highest score. Any
thoughts are welcome :)

On Wed, Oct 2, 2013 at 12:47 PM, Andy Stewart [email protected]:

The score is simply the length of the stringhttps://github.com/mezis/fuzzily/blob/b91323ed8de5aa8872590dfce388d6234dda0e3d/lib/fuzzily/trigram.rb#L14.
The result score is the length of the result string, and the search text
score is the length of the text we're searching for.


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25517463
.

@mezis
Copy link
Owner

mezis commented Oct 2, 2013

@airblade — while having a normalized "matchiness" metric is a hard problem, it looks like your formula works.

The reason the "score" is the length of the needle is that you want "York" to rank before "New York" when searching for "York"—between two strings that match as well in terms of number of matching trigrams, you want the shortest one, which will be the "best" match.

Implementation-wise, defining extra methods on the fly is a performance killer (it flushes the method cache, so affects an entire application), and it's probably not something you want to use the database to do either.

I haven't had much time last week but I'll try to cobble something together this weekend.

@shameelabdulla
Copy link
Author

@ariblade Tried a combination score suggested by you [(x.matches /
x.score.to_f) => airblade score] and Levenshtein distance in the following
way:
result score = (airblade score + (1-Levenshtein distance))/2

Seems to work. Analysing with results.

On Wed, Oct 2, 2013 at 1:41 PM, Julien Letessier
[email protected]:

@airblade https://github.com/airblade — while having a normalized
"matchiness" metric is a hard problem, it looks like your formula works.

The reason the "score" is the length of the needle is that you want "York"
to rank before "New York" when searching for "York"—between two strings
that match as well in terms of number of matching trigrams, you want the
shortest one, which will be the "best" match.

Implementation-wise, defining extra methods on the fly is a performance
killer (it flushes the method cache, so affects an entire application), and
it's probably not something you want to use the database to do either.

I haven't had much time last week but I'll try to cobble something
together this weekend.


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25521115
.

@mezis
Copy link
Owner

mezis commented Dec 1, 2013

Here's some code: airblade/fuzzily@8b41888.

As a (late) update to this, I can't use the code directly as it has no test and also has a performance issue—it adds methods on the fly, which kills the method cache in Ruby < 2.1. Working on an alternate solution based on @airblade's formula.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants