Skip to content

Simple library to compare two phrase through different algorithms, already well implemented in debatty / java-string-similarity. The library allows to comparing two text, first as entirety, then as a piece by piece and finally as single words in any combinations. Then each one can implements its own strategy to consider valid a comparison or not.

Notifications You must be signed in to change notification settings

fulmicotone/fulmicotone-strings-similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strings-similarity

Library to compare two text input. First the input will be normalized then through one or more algorithms, already well implemented in the debatty / java-string similarity, the comparison will begin and finally the result will be evaluated with a personal strategy.

Normalization

In order to define the rules that will prepare the text for comparison we must use the class PhraseNormalizerFactory

We can set the following rules:

  • skip the word with length < n

  • lower casing all the input

  • splitting in chunks using the splitterDelimiter

  • replacing one or more word with an other

  • discarding one or more a word

      PhraseNormalizerFactory.newOne()
      .withMinWordLength(3)
      .withApplyLowerCase(true)
      .withSplitterDelimiter("/")
      .addReplacement( word, replace)
      .addDiscardWord(String word)
      .build()
    

After applying PhraseNormalizerFactory the output is an organised object where the text is subdivided in Phrase, Chunks and Words.

Each of these objects belongs to the family of the CharacterSequence and have three main fields useful in order to know where it come from,what place it had in its context and what it contains:

  • parent defines from where is has been extracted
  • sort index indicates the order inside is root context
  • sequence its content as string

These fields can be important in a comparison between two CharacterSequence.

Comparison

This process involves comparing in any combination each CharacterSequence normalized of the same type, Phrase with Phrase, Chunk with Chunk, Word with Word. In any compare will be used the distance algorithms defined in the StringDistanceAlgorithms class and each of them will produce CharacterSequenceComparison, an Object containing the two CharacterSequence compared and and a scoring map generated by the algorithms, it is also possible through the element compared inside having access to the sort index property and parent length to understand the place of each element in its root context and use this to have a better result evaluation.

Each CharacterSequenceComparison will be collected in a ComparingResult and then passed as input to the strategies already defined that will evaluate if the compare is passed or not.

Compare Result Evaluation

We can define one or more Compare result evaluation strategy then relate them to each other through the logical operators OR or AND. Implementing a ComparationPassedStrategy interface we can write how to the CompareResult Object will be evaluate in order to consider the compare passed or not.

public class OneWordIsEnoughStrategy implements ComparationPassedStrategy {
@Override
public boolean isPassed(ComparingResult r) {
   return r.getByUnit(CharacterSequenceUnit.WORD)
            .stream()
            .flatMap(c->c.getScoreMap().values().stream())
            .anyMatch(score->score==0);

}
}
EXAMPLE
  Similarity s = Similarity.Builder
  .newOne()
  .withFirstFactorNormalizationRules(PhraseNormalizerFactory.newOne()
  .withMinWordLength(3)
  .withApplyLowerCase(true)
  .withSplitterDelimiter("/")
  .build())
  .withSecondFactorNormalizationRules(PhraseNormalizerFactory.newOne()
  .withMinWordLength(3)
  .withApplyLowerCase(true)
  .withSplitterDelimiter("-")
  .build())
  .addPassedStrategy(new OneWordIsEnoughStrategy(), Similarity.Builder.LogicalOperator.OR)
  .build();
          
  boolean willbetrue= s.compare("soccer/player/ball", "soccer and play-shoes-bombastic");

  boolean willbefalse= s.compare("tennis/player/field", "rock and base-shoes-bombastic");

TODO

  • PhraseNormalizerFactory adding apply trim
  • consider when a replacement completely replaces the entire input and therefore only space remains

About

Simple library to compare two phrase through different algorithms, already well implemented in debatty / java-string-similarity. The library allows to comparing two text, first as entirety, then as a piece by piece and finally as single words in any combinations. Then each one can implements its own strategy to consider valid a comparison or not.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Languages