Skip to content

Commit

Permalink
Merge pull request #6 from msg-systems/development
Browse files Browse the repository at this point in the history
Merge v3.0.0 into master
  • Loading branch information
richardpaulhudson committed Sep 10, 2021
2 parents ba17bd2 + 3024df1 commit fc536f3
Show file tree
Hide file tree
Showing 70 changed files with 13,523 additions and 11,066 deletions.
129 changes: 129 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright 2019-2020 msg systems ag
Copyright 2019-2021 msg systems ag

The Holmes library is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
Expand Down
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include SHORTREADME.md
global-include *.cfg
global-include *.csv
global-include LICENSE
1,085 changes: 434 additions & 651 deletions README.md

Large diffs are not rendered by default.

32 changes: 17 additions & 15 deletions SHORTREADME.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,51 @@
**Holmes** is a Python 3 library (tested with version 3.7.7) that supports a number of
use cases involving information extraction from English and German texts. In all use cases, the information extraction
is based on analysing the semantic relationships expressed by the component parts of each sentence:
**Holmes** is a Python 3 library (tested with version 3.9.5) running on top of
[spaCy](https://spacy.io/) (tested with version 3.1.2) that supports a number of use cases
involving information extraction from English and German texts. In all use cases, the information
extraction is based on analysing the semantic relationships expressed by the component parts of
each sentence:

- In the [chatbot](https://github.com/msg-systems/holmes-extractor/#getting-started) use case, the system is configured using one or more **search phrases**.
- In the [chatbot](https://github.com/msg-systems/holmes-extractor#getting-started) use case, the system is configured using one or more **search phrases**.
Holmes then looks for structures whose meanings correspond to those of these search phrases within
a searched **document**, which in this case corresponds to an individual snippet of text or speech
entered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase
corresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.

- The [structural extraction](https://github.com/msg-systems/holmes-extractor/#structural-extraction) use case uses exactly the same
[structural matching](https://github.com/msg-systems/holmes-extractor/#how-it-works-structural-matching) technology as the chatbot use
- The [structural extraction](https://github.com/msg-systems/holmes-extractor#structural-extraction) use case uses exactly the same
[structural matching](https://github.com/msg-systems/holmes-extractor#how-it-works-structural-matching) technology as the chatbot use
case, but searching takes place with respect to a pre-existing document or documents that are typically much
longer than the snippets analysed in the chatbot use case, and the aim to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
longer than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
take over a second company. The identities of the companies concerned could then be stored in a database.

- The [topic matching](https://github.com/msg-systems/holmes-extractor/#topic-matching) use case aims to find passages in a document or documents whose meaning
- The [topic matching](https://github.com/msg-systems/holmes-extractor#topic-matching) use case aims to find passages in a document or documents whose meaning
is close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or
query document, matches the documents being searched against each phraselet, and conflates the results to find the
most relevant passages within the documents. Because there is no strict requirement that every word with its own
meaning in the query document match a specific word or words in the searched documents, more matches are found
query document, matches the documents being searched against each phraselet, and conflates the results to find
the most relevant passages within the documents. Because there is no strict requirement that every
word with its own meaning in the query document match a specific word or words in the searched documents, more matches are found
than in the structural extraction use case, but the matches do not contain structured information that can be
used in subsequent processing. The topic matching use case is demonstrated by [a website allowing searches within
the Harry Potter corpus (for English) and around 350 traditional stories (for German)](http://holmes-demo.xt.msg.team/).

- The [supervised document classification](https://github.com/msg-systems/holmes-extractor/#supervised-document-classification) use case uses training data to
- The [supervised document classification](https://github.com/msg-systems/holmes-extractor#supervised-document-classification) use case uses training data to
learn a classifier that assigns one or more **classification labels** to new documents based on what they are about.
It classifies a new document by matching it against phraselets that were extracted from the training documents in the
same way that phraselets are extracted from the query document in the topic matching use case. The technique is
inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component
words are related semantically rather than that just happen to be neighbours in the surface representation of a language.

In all four use cases, the **individual words** are matched using a [number of strategies](https://github.com/msg-systems/holmes-extractor/#word-level-matching-strategies).
In all four use cases, the **individual words** are matched using a [number of strategies](https://github.com/msg-systems/holmes-extractor#word-level-matching-strategies).
To work out whether two grammatical structures that contain individually matching words correspond logically and
constitute a match, Holmes transforms the syntactic parse information provided by the [spaCy](https://spacy.io/) library
into semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to
understand the intricacies of how this works, although there are some
[important tips](https://github.com/msg-systems/holmes-extractor/#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
[important tips](https://github.com/msg-systems/holmes-extractor#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
structural extraction use cases that you should try and take on board.

Holmes aims to offer generalist solutions that can be used more or less out of the box with
relatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.
At its core lies a logical, programmed, rule-based system that describes how syntactic representations in each
language express semantic relationships. Although the supervised document classification use case does incorporate a
neural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine
learning, the essentially rule-based nature of Holmes means that the chatbot, structural matching and topic matching use
learning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use
cases can be put to use out of the box without any training and that the supervised document classification use case
typically requires relatively little training data, which is a great advantage because pre-labelled training data is
not available for many real-world problems.
Expand Down
17 changes: 17 additions & 0 deletions examples/example_chatbot_DE_insurance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import os
import holmes_extractor as holmes

if __name__ in ('__main__', 'example_chatbot_DE_insurance'):
script_directory = os.path.dirname(os.path.realpath(__file__))
ontology = holmes.Ontology(os.sep.join((
script_directory, 'example_chatbot_DE_insurance_ontology.owl')))
holmes_manager = holmes.Manager(model='de_core_news_lg', ontology=ontology, number_of_workers=2)
holmes_manager.register_search_phrase('Jemand benötigt eine Versicherung')
holmes_manager.register_search_phrase('Ein ENTITYPER schließt eine Versicherung ab')
holmes_manager.register_search_phrase('ENTITYPER benötigt eine Versicherung')
holmes_manager.register_search_phrase('Eine Versicherung für einen Zeitraum')
holmes_manager.register_search_phrase('Eine Versicherung fängt an')
holmes_manager.register_search_phrase('Jemand zahlt voraus')

holmes_manager.start_chatbot_mode_console()
# e.g. 'Richard Hudson und Max Mustermann brauchen eine Krankenversicherung für die nächsten fünf Jahre'
20 changes: 20 additions & 0 deletions examples/example_chatbot_EN_insurance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import os
import holmes_extractor as holmes

if __name__ in ('__main__', 'example_chatbot_EN_insurance'):
script_directory = os.path.dirname(os.path.realpath(__file__))
ontology = holmes.Ontology(os.sep.join((
script_directory, 'example_chatbot_EN_insurance_ontology.owl')))
holmes_manager = holmes.Manager(
model='en_core_web_lg', ontology=ontology, number_of_workers=2)
holmes_manager.register_search_phrase('Somebody requires insurance')
holmes_manager.register_search_phrase('An ENTITYPERSON takes out insurance')
holmes_manager.register_search_phrase('A company buys payment insurance')
holmes_manager.register_search_phrase('An ENTITYPERSON needs insurance')
holmes_manager.register_search_phrase('Insurance for a period')
holmes_manager.register_search_phrase('An insurance begins')
holmes_manager.register_search_phrase('Somebody prepays')
holmes_manager.register_search_phrase('Somebody makes an insurance payment')

holmes_manager.start_chatbot_mode_console()
# e.g. 'Richard Hudson and John Doe require health insurance for the next five years'
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,11 @@ def download_and_register(url, label):
holmes_manager.parse_and_register_document(soup.get_text(), label)

# Start the Holmes Manager with the German model
holmes_manager = holmes.Manager(model='de_core_news_md')
download_and_register('https://www.gesetze-im-internet.de/vvg_2008/BJNR263110007.html', 'VVG_2008')
download_and_register('https://www.gesetze-im-internet.de/vag_2016/BJNR043410015.html', 'VAG')
holmes_manager.start_topic_matching_search_mode_console()
if __name__ in ('__main__', 'example_search_DE_law'):
holmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=2)
download_and_register('https://www.gesetze-im-internet.de/vvg_2008/BJNR263110007.html', 'VVG_2008')
download_and_register('https://www.gesetze-im-internet.de/vag_2016/BJNR043410015.html', 'VAG')
holmes_manager.start_topic_matching_search_mode_console(initial_question_word_embedding_match_threshold=0.7)

# Example queries:
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,18 @@
HOLMES_EXTENSION = 'hdc'
flag_filename = os.sep.join((working_directory, 'STORY_PARSING_COMPLETE'))

print('Initializing Holmes...')
print('Initializing Holmes (this may take some time) ...')
# Start the Holmes manager with the German model
holmes_manager = holmes.MultiprocessingManager(
model='de_core_news_md', overall_similarity_threshold=0.85, number_of_workers=4)
# set number_of_workers to prevent memory exhaustion / swapping; it should never be more
# than the number of cores on the machine
holmes_manager = holmes.Manager(
model='de_core_news_lg')

def process_documents_from_front_page(
manager, front_page_uri, front_page_label):
def process_documents_from_front_page(front_page_uri, front_page_label):
""" Download and save all the stories from a front page."""

front_page = urllib.request.urlopen(front_page_uri)
front_page_soup = BeautifulSoup(front_page, 'html.parser')
document_texts = []
labels = []
# For each story ...
for anchor in front_page_soup.find_all('a'):
if not anchor['href'].startswith('/') and not anchor['href'].startswith('https'):
Expand All @@ -44,15 +43,16 @@ def process_documents_from_front_page(
this_document_text = ' '.join(this_document_text.split())
# Create a document label from the front page label and the story name
this_document_label = ' - '.join((front_page_label, anchor.contents[0]))
# Parse the document
print('Parsing', this_document_label)
manager.parse_and_register_document(this_document_text, this_document_label)
# Save the document
print('Saving', this_document_label)
output_filename = os.sep.join((working_directory, this_document_label))
output_filename = '.'.join((output_filename, HOLMES_EXTENSION))
with open(output_filename, "w") as file:
file.write(manager.serialize_document(this_document_label))
document_texts.append(this_document_text)
labels.append(this_document_label)
parsed_documents = holmes_manager.nlp.pipe(document_texts)
for index, parsed_document in enumerate(parsed_documents):
label = labels[index]
print('Saving', label)
output_filename = os.sep.join((working_directory, label))
output_filename = '.'.join((output_filename, HOLMES_EXTENSION))
with open(output_filename, "wb") as file:
file.write(parsed_document.to_bytes())

def load_documents_from_working_directory():
serialized_documents = {}
Expand All @@ -61,31 +61,31 @@ def load_documents_from_working_directory():
print('Loading', file)
label = file[:-4]
long_filename = os.sep.join((working_directory, file))
with open(long_filename, "r") as file:
with open(long_filename, "rb") as file:
contents = file.read()
serialized_documents[label] = contents
holmes_manager.deserialize_and_register_documents(serialized_documents)
print('Indexing documents (this may take some time) ...')
holmes_manager.register_serialized_documents(serialized_documents)

if os.path.exists(working_directory):
if not os.path.isdir(working_directory):
raise RuntimeError(' '.join((working_directory), 'must be a directory'))
raise RuntimeError(' '.join((working_directory, 'must be a directory')))
else:
os.mkdir(working_directory)

if os.path.isfile(flag_filename):
load_documents_from_working_directory()
else:
normal_holmes_manager = holmes.Manager(model='de_core_news_md')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/grimm/", 'Gebrüder Grimm')
"https://maerchen.com/grimm/", 'Gebrüder Grimm')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/grimm2/", 'Gebrüder Grimm')
"https://maerchen.com/grimm2/", 'Gebrüder Grimm')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/andersen/", 'Hans Christian Andersen')
"https://maerchen.com/andersen/", 'Hans Christian Andersen')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/bechstein/", 'Ludwig Bechstein')
"https://maerchen.com/bechstein/", 'Ludwig Bechstein')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/wolf/", 'Johann Wilhelm Wolf')
"https://maerchen.com/wolf/", 'Johann Wilhelm Wolf')
# Generate flag file to indicate files can be reloaded on next run
open(flag_filename, 'a').close()
load_documents_from_working_directory()
Expand All @@ -101,8 +101,8 @@ def load_documents_from_working_directory():

class RestHandler():
def on_get(self, req, resp):
resp.body = \
json.dumps(holmes_manager.topic_match_documents_returning_dictionaries_against(
resp.text = \
json.dumps(holmes_manager.topic_match_documents_against(
req.params['entry'][0:200], only_one_result_per_document=True))
resp.cache_control = ["s-maxage=31536000"]

Expand Down
Loading

0 comments on commit fc536f3

Please sign in to comment.