Skip to content

Commit

Permalink
QuickUMLS v.1.4 (#55)
Browse files Browse the repository at this point in the history
Release Notes:

- [NEW] Added support for [unqlite](https://github.com/coleifer/unqlite-python) as an alternative to leveldb for storage of CUIs and Semantic Types. This allows creating multiple QuickUMLS matchers with from the same installation.
- [NEW] added support for conversion of all uppercase words ([#48](#48), thank you sandertan@!).
- [NEW] Automatically downloads SpaCy data for selected language if missing.
- [FIX] Mitigated [#52](#52).
  • Loading branch information
soldni committed May 13, 2020
1 parent bd58713 commit bad4c50
Show file tree
Hide file tree
Showing 6 changed files with 103 additions and 31 deletions.
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[**NEW: v.1.3 is pip-ready!**](https://giphy.com/embed/BlVnrxJgTGsUw) You can now install QuickUMLS through a simple `pip install quickumls`.
[**NEW: v.1.4 supports starting multiple QuickUMLS matchers concurrently!**](https://giphy.com/embed/BlVnrxJgTGsUw) I've finally added support for [unqlite](https://github.com/coleifer/unqlite-python) as an alternative to leveldb for storage of CUIs and Semantic Types (see [here](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4) for more details). unqlite-backed QuickUMLS installation support multiple matchers running at the same time. Other than better multi-processing support, unqlite should have better support for unicode.

# QuickUMLS

Expand All @@ -11,12 +11,12 @@ This project should be compatible with Python 3 (Python 2 is [no longer supporte
## Installation

1. **Obtain a UMLS installation** This tool requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a [license](https://uts.nlm.nih.gov/license.html) from the National Library of Medicine; then you should download all UMLS files from [this page](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html); finally, you can install UMLS using the [MetamorphoSys](https://www.nlm.nih.gov/pubs/factsheets/umlsmetamorph.html) tool as [explained in this guide](https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html). The installation can be removed once the system has been initialized.
2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>.
3. **Obrain a SpaCy corpus**: After you install QuickUMLS and its dependencies, you should be able to do so by running `python -m spacy download en`.
2. **Install QuickUMLS**: You can do so by either running `pip install quickumls` or `python setup.py install`. On macOS, using anaconda is **strongly recommended**<sup>†</sup>.
3. **Create a QuickUMLS installation** Initialize the system by running `python -m quickumls.install <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200 RPM hard drive, initialization takes 8.5 minutes). `python -m quickumls.install` supports the following optional arguments:
- `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
- `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
- `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).
- `-d` / `--database-backend`: Specify which database backend to use for QuickUMLS. The two options are `leveldb` and `unqlite`. The latter supports multi-process reading and has better unicode compatibility, and it used as default for all new 1.4 installations; the former is still used as default when instantiating a QuickUMLS client. More info about differences between the two databases and migration info are available [here](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4).


****: If the installation fails on macOS when using Anaconda, install `leveldb` first by running `conda install -c conda-forge python-leveldb`.
Expand Down Expand Up @@ -50,6 +50,8 @@ matcher.match(text, best_match=True, ignore_syntax=False)

Set `best_match` to `False` if you want to return overlapping candidates, `ignore_syntax` to `True` to disable all heuristics introduced in (Soldaini and Goharian, 2016).

If the matcher throws a warning during initialization, read [this page](https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4) to learn why and how to stop it from doing so.


## Server / Client Support

Expand Down
4 changes: 2 additions & 2 deletions quickumls/about.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
# https://github.com/explosion/spaCy/blob/master/spacy/about.py

__title__ = 'quickumls'
__version__ = '1.3.0r4'
__version__ = '1.4.0r1'
__author__ = 'Luca Soldaini'
__email__ = '[email protected]'
__license__ = 'MIT'
__uri__ = "https://github.com/Georgetown-IR-Lab/QuickUMLS"
__copyright__ = '2014-2019, Georgetown University Information Retrieval Lab'
__copyright__ = '2014-2020, Georgetown University Information Retrieval Lab'
17 changes: 16 additions & 1 deletion quickumls/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,19 @@ def __init__(
)
spacy_lang = constants.SPACY_LANGUAGE_MAP[self.language_flag]

database_backend_fp = os.path.join(quickumls_fp, 'database_backend.flag')
if os.path.exists(database_backend_fp):
with open(database_backend_fp) as f:
self._database_backend = f.read().strip()
else:
print('[WARNING] This installation was created with QuickUMLS v.1.3 or earlier, '
'which does not support multiple database backends. For now, I\'ll '
'assume that leveldb was used as default, implicit assumption will '
'change in future versions of QuickUMLS. More info here: '
'https://github.com/Georgetown-IR-Lab/QuickUMLS/wiki/Migration-QuickUMLS-1.3-to-1.4',
file=sys.stderr)
self._database_backend = 'leveldb'

# domain specific stopwords
self._stopwords = self._stopwords.union(constants.DOMAIN_SPECIFIC_STOPWORDS)

Expand All @@ -149,7 +162,9 @@ def __init__(
self.ss_db = toolbox.SimstringDBReader(
simstring_fp, similarity_name, threshold
)
self.cuisem_db = toolbox.CuiSemTypesDB(cuisem_fp)
self.cuisem_db = toolbox.CuiSemTypesDB(
cuisem_fp, database_backend=self._database_backend
)

def get_info(self):
"""Computes a summary of the matcher options.
Expand Down
53 changes: 41 additions & 12 deletions quickumls/install.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,28 @@
from __future__ import unicode_literals, division, print_function

# built in modules
import argparse
import codecs
import os
from six.moves import input
import shutil
import sys
import time
import codecs
import shutil
import argparse
from six.moves import input

# project modules
from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES

try:
from unidecode import unidecode
except ImportError:
pass


# third party-dependencies
import spacy


# project modules
from .toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
from .constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES, SPACY_LANGUAGE_MAP


def get_semantic_types(path, headers):
sem_types = {}
with codecs.open(path, encoding='utf-8') as f:
Expand Down Expand Up @@ -98,13 +102,13 @@ def extract_from_mrconso(
print(status)


def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir, database_backend):
# Create destination directories for the two databases
mkdir(simstring_dir)
mkdir(cuisty_dir)

ss_db = SimstringDBWriter(simstring_dir)
cuisty_db = CuiSemTypesDB(cuisty_dir)
cuisty_db = CuiSemTypesDB(cuisty_dir, database_backend=database_backend)

simstring_terms = set()

Expand All @@ -116,6 +120,20 @@ def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
cuisty_db.insert(term, cui, stys, preferred)


def install_spacy(lang):
"""Tries to create a spacy object; if it fails, downloads the dataset"""

print(f'Determining if SpaCy for language "{lang}" is installed...')

if lang in SPACY_LANGUAGE_MAP:
try:
spacy.load(SPACY_LANGUAGE_MAP[lang])
print(f'SpaCy is installed and avaliable for {lang}!')
except OSError:
print(f'SpaCy is not available! Attempting to download and install...')
spacy.cli.download(SPACY_LANGUAGE_MAP[lang])


def parse_args():
ap = argparse.ArgumentParser()
ap.add_argument(
Expand All @@ -135,6 +153,10 @@ def parse_args():
'-U', '--normalize-unicode', action='store_true',
help='Normalize unicode strings to their closest ASCII representation'
)
ap.add_argument(
'-d', '--database-backend', choices=('leveldb', 'unqlite'), default='unqlite',
help='KV database to use to store CUIs and semantic types'
)
ap.add_argument(
'-E', '--language', default='ENG', choices=LANGUAGES,
help='Extract concepts of the specified language'
Expand All @@ -146,6 +168,8 @@ def parse_args():
def main():
opts = parse_args()

install_spacy(opts.language)

if not os.path.exists(opts.destination_path):
msg = ('Directory "{}" does not exists; should I create it? [y/N] '
''.format(opts.destination_path))
Expand Down Expand Up @@ -189,6 +213,10 @@ def main():
with open(flag_fp, 'w') as f:
f.write(opts.language)

flag_fp = os.path.join(opts.destination_path, 'database_backend.flag')
with open(flag_fp, 'w') as f:
f.write(opts.database_backend)

mrconso_path = os.path.join(opts.umls_installation_path, 'MRCONSO.RRF')
mrsty_path = os.path.join(opts.umls_installation_path, 'MRSTY.RRF')

Expand All @@ -197,7 +225,8 @@ def main():
simstring_dir = os.path.join(opts.destination_path, 'umls-simstring.db')
cuisty_dir = os.path.join(opts.destination_path, 'cui-semtypes.db')

parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir)
parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir,
database_backend=opts.database_backend)


if __name__ == '__main__':
Expand Down
49 changes: 37 additions & 12 deletions quickumls/toolbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# build-in modules
import re
import os
from functools import wraps
import six
import unicodedata
from string import punctuation
Expand All @@ -12,6 +13,11 @@
# installed modules
import numpy
import leveldb
try:
import unqlite
UNQLITE_AVAILABLE = True
except ImportError:
UNQLITE_AVAILABLE = False

# project imports
from quickumls_simstring import simstring
Expand Down Expand Up @@ -216,21 +222,37 @@ def append(self, interval):


class CuiSemTypesDB(object):
def __init__(self, path):
def __init__(self, path, database_backend='leveldb'):
if not (os.path.exists(path) or os.path.isdir(path)):
err_msg = (
'"{}" is not a valid directory').format(path)
raise IOError(err_msg)

self.cui_db = leveldb.LevelDB(
os.path.join(path, 'cui.leveldb'))
self.semtypes_db = leveldb.LevelDB(
os.path.join(path, 'semtypes.leveldb'))
if database_backend == 'unqlite':
assert UNQLITE_AVAILABLE, (
'You selected unqlite as database backend, but it is not '
'installed. Please install it via `pip install unqlite`'
)
self.cui_db = unqlite.UnQLite(os.path.join(path, 'cui.unqlite'))
self.cui_db_put = self.cui_db.store
self.cui_db_get = self.cui_db.fetch
self.semtypes_db = unqlite.UnQLite(os.path.join(path, 'semtypes.unqlite'))
self.semtypes_db_put = self.semtypes_db.store
self.semtypes_db_get = self.semtypes_db.fetch
elif database_backend == 'leveldb':
self.cui_db = leveldb.LevelDB(os.path.join(path, 'cui.leveldb'))
self.cui_db_put = self.cui_db.Put
self.cui_db_get = self.cui_db.Get
self.semtypes_db = leveldb.LevelDB(os.path.join(path, 'semtypes.leveldb'))
self.semtypes_db_put = self.semtypes_db.Put
self.semtypes_db_get = self.semtypes_db.Get
else:
raise ValueError(f'database_backend {database_backend} not recognized')

def has_term(self, term):
term = prepare_string_for_db_input(safe_unicode(term))
try:
self.cui_db.Get(db_key_encode(term))
self.cui_db_get(db_key_encode(term))
return True
except KeyError:
return
Expand All @@ -242,28 +264,31 @@ def insert(self, term, cui, semtypes, is_preferred):
# some terms have multiple cuis associated with them,
# so we store them all
try:
cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
cuis = pickle.loads(self.cui_db_get(db_key_encode(term)))
except KeyError:
cuis = set()

cuis.add((cui, is_preferred))
self.cui_db.Put(db_key_encode(term), pickle.dumps(cuis))
self.cui_db_put(db_key_encode(term), pickle.dumps(cuis))

try:
self.semtypes_db.Get(db_key_encode(cui))
self.semtypes_db_get(db_key_encode(cui))
except KeyError:
self.semtypes_db.Put(
self.semtypes_db_put(
db_key_encode(cui), pickle.dumps(set(semtypes))
)

def get(self, term):
term = prepare_string_for_db_input(safe_unicode(term))
try:
cuis = pickle.loads(self.cui_db_get(db_key_encode(term)))
except KeyError:
cuis = set()

cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
matches = (
(
cui,
pickle.loads(self.semtypes_db.Get(db_key_encode(cui))),
pickle.loads(self.semtypes_db_get(db_key_encode(cui))),
is_preferred
)
for cui, is_preferred in cuis
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ numpy>=1.8.2
spacy>=1.6.0
unidecode>=0.4.19
nltk>=3.3
quickumls_simstring>=1.1.5r1
quickumls_simstring>=1.1.5r1
unqlite>=0.8.1

0 comments on commit bad4c50

Please sign in to comment.