Skip to content

Generalized Suffix Tree implementation with Ukkonen's algorithm in C++.

License

Notifications You must be signed in to change notification settings

lukau2357/gst-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Implementation of the generalized suffix tree data structure in C++, along with modules for testing the correctness of the implementation. The tree is hardcoded to support 8 input strings (8 different terminating characters for each input strings), however this is easily extendable and should be a problem in the domain of bioinformatics where the alphabet size is 5 (considering nitrogenous bases of both DNA and RNA molecules). The following queries are implemented:

  • $substring(S, T)$ - check if $S$ is a substring of at least one sequence in the tree $T$ .
  • $isSuffix(S, T)$ - check if $S$ is a substring of at least one sequence in the tree $T$.
  • $occurrences(S, T)$ - Compute occurrences of $S$ accross all sequences in the tree $T$
  • $lcs(T)$ - Compute the longest common substring of all tree sequences in linear time.
  • $generalizedSubstring(minLength, minString, minOccurrences, T)$ - The most general query, return all substrings of the tree $T$ that are of minimum length $minLength$, occurr in minimum $minString$ sequences of the tree $T$, and in each of them they occurr at least $minOccurrences$ times.

About

Generalized Suffix Tree implementation with Ukkonen's algorithm in C++.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages