Skip to content

Commit

Permalink
Merge pull request #7 from Georgetown-IR-Lab/v1.2
Browse files Browse the repository at this point in the history
V1.2
  • Loading branch information
soldni committed May 31, 2017
2 parents 2f86356 + a5cfe16 commit 562812c
Show file tree
Hide file tree
Showing 8 changed files with 416 additions and 80 deletions.
70 changes: 57 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
**We recommend to download the latest tested version form the [release section](https://github.com/Georgetown-IR-Lab/QuickUMLS/releases)**.
**We recommend to download the latest tested version from the [releases section](https://github.com/Georgetown-IR-Lab/QuickUMLS/releases)**.

**NEW: v.1.2 now includes client/server support!** Start a QuickUMLS server once, avoid loading QuickUMLS each time your experiments run! See <a href="#client_server">below</a> for more info.

# QuickUMLS

Expand All @@ -13,43 +15,85 @@ This project should be compatible with both Python 2 and 3 and run on any UNIX s
#### Before Starting

1. Make sure that your Python installation include C headers (e.g., on Ubuntu, make sure `python3-dev` or `python-dev` are installed).
2. This software requires all packages listed in the requirements.txt file. You can install all of them by running `pip install -r requirements.txt`.
2. This software requires all packages listed in the `requirements.txt` file. You can install all of them by running `pip install -r requirements.txt`.
3. Note that, in order to use `spacy`, you are required to download its corpus. You can do that by running `python -m spacy.en.download`.
4. This system requires you to have a valid UMLS installation on disk. The installation can be remove once the system has been initialized.

#### To get the System Running
#### How To get the System Initialized

1. Download and compile Simstring by running `bash setup_simstring.sh <python_version>`, where `<python_version>` is either "`2`" or "`3`".
2. Initialize the system by running `python install.py <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast is the drive where UMLS and QuickUMLS files are stored.
2. Initialize the system by running `python install.py <umls_installation_path> <destination_path>`, where `<umls_installation_path>` is where the installation files are (in particular, we need `MRCONSO.RRF` and `MRSTY.RRF`) and `<destination_path>` is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200RPM hard drive, initialization takes 8.5 minutes).

`install.py` supports the following optional arguments:
- `-L` / `--lowercase`: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
- `-U` / `--normalize-unicode`: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
- `-E` / `--language`: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see [this table provided by NLM](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#LAT).

## APIs

A QuickUMLS object can be instantiated as follows:

```python
>>> matcher = QuickUMLS(quickumls_fp, overlapping_criteria, threshold,
similarity_name, window, accepted_semtypes)
matcher = QuickUMLS(quickumls_fp, overlapping_criteria, threshold,
similarity_name, window, accepted_semtypes)
```

Where:

- `quickumls_fp` is the directory where the QuickUMLS data files are installed.
- `overlapping_criteria` (default: "score") is the criteria used to deal with overlapping concepts; choose "score" if the matching score of the concepts should be consider first, "length" if the longest should be considered first instead.
- `threshold` (default: 0.7) is the minimum similarity value between strings.
- `similarity_name` (default: "jaccard") is the name of similarity to use. Choose between "dice", "jaccard", "cosine", or "overlap".
- `window` (default: 5) is the maximum number of tokens to consider for matching.s
- `accepted_semtypes` (default: see `constants.py`) is the set of UMLS semantic types concepts should belong to. Semantic types are identified by the letter "T" followed by three numbers (e.g., "T131", which identifies the type *"Hazardous or Poisonous Substance"*). See [here](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2013AA.txt) for the full list.
- `overlapping_criteria` (optional, default: "score") is the criteria used to deal with overlapping concepts; choose "score" if the matching score of the concepts should be consider first, "length" if the longest should be considered first instead.
- `threshold` (optional, default: 0.7) is the minimum similarity value between strings.
- `similarity_name` (optional, default: "jaccard") is the name of similarity to use. Choose between "dice", "jaccard", "cosine", or "overlap".
- `window` (optional, default: 5) is the maximum number of tokens to consider for matching.
- `accepted_semtypes` (optional, default: see `constants.py`) is the set of UMLS semantic types concepts should belong to. Semantic types are identified by the letter "T" followed by three numbers (e.g., "T131", which identifies the type *"Hazardous or Poisonous Substance"*). See [here](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2013AA.txt) for the full list.

To use the matcher, simply call

```python
>>> text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
>>> matcher.match(text, best_match=True, ignore_syntax=False)
text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
matcher.match(text, best_match=True, ignore_syntax=False)
```

Set `best_match` to `False` if you want to return overlapping candidates, `ignore_syntax` to `True` to disable all heuristics introduced in (Soldaini and Goharian, 2016).


<h2 id="client_server">[NEW] Server / Client Support</h2>

Starting with v.1.2, QuickUMLS includes a support for being used in a client-server configuration. That is, you can start one QuickUMLS server, and query it from multiple scripts using a client.

To start the server, run `server.py`:

```bash
python server.py /path/to/quickumls/files {-P QuickUMLS port} {-H QuickUMLS host} {QuickUMLS options}
```

Host and port are optional; by default, QuickUMLS runs on `localhost:4645`. You can also pass any QuickUMLS option mentioned above to the server. To obtain a list of options for the server, run `python server.py -h`.

To load the client, import `get_quickumls_client` from `client.py`:

```bash
from client import get_quickumls_client
matcher = get_quickumls_client()
text = "The ulna has dislocated posteriorly from the trochlea of the humerus."
matcher.match(text, best_match=True, ignore_syntax=False)
```

The API of the client is the same of a QuickUMLS object.


In case you wish to run the server in the background, you can do so as follows:

```bash
nohup python server.py /path/to/QuickUMLS {server options} > /dev/null 2>&1 & echo $! > nohup.pid

```

When you are done, don't forget to stop the server by running.
```bash
kill -9 `cat nohup.pid`
rm nohup.pid
```

## References

- Okazaki, Naoaki, and Jun'ichi Tsujii. "*Simple and efficient algorithm for approximate dictionary matching.*" COLING 2010.
Expand Down
12 changes: 12 additions & 0 deletions client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
try:
from network import MinimalClient
from quickumls import QuickUMLS
except ImportError:
from .network import MinimalClient
from .quickumls import QuickUMLS


def get_quickumls_client(host='localhost', port=4645):
'''Return a client for a QuickUMLS server running on host at port'''
client = MinimalClient(QuickUMLS, host=host, port=port, buffersize=4096)
return client
28 changes: 28 additions & 0 deletions constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,31 @@
u'\u3030', u'\u30a0', u'\ufe31', u'\ufe32', u'\ufe58', u'\ufe63',
u'\uff0d'
}

LANGUAGES = {
'BAQ', #Basque
'CHI', #Chinese
'CZE', #Czech
'DAN', #Danish
'DUT', #Dutch
'ENG', #English
'EST', #Estonian
'FIN', #Finnish
'FRE', #French
'GER', #German
'GRE', #Greek
'HEB', #Hebrew
'HUN', #Hungarian
'ITA', #Italian
'JPN', #Japanese
'KOR', #Korean
'LAV', #Latvian
'NOR', #Norwegian
'POL', #Polish
'POR', #Portuguese
'RUS', #Russian
'SCR', #Croatian
'SPA', #Spanish
'SWE', #Swedish
'TUR', #Turkish
}
5 changes: 0 additions & 5 deletions docs/_config.yml

This file was deleted.

54 changes: 0 additions & 54 deletions docs/index.md

This file was deleted.

29 changes: 21 additions & 8 deletions install.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

# project modules
from toolbox import countlines, CuiSemTypesDB, SimstringDBWriter, mkdir
from constants import HEADERS_MRCONSO, HEADERS_MRSTY
from constants import HEADERS_MRCONSO, HEADERS_MRSTY, LANGUAGES

try:
from unidecode import unidecode
Expand All @@ -29,12 +29,12 @@ def get_semantic_types(path, headers):
return sem_types


def get_mrconso_iterator(path, headers):
def get_mrconso_iterator(path, headers, lang='ENG'):
with codecs.open(path, encoding='utf-8') as f:
for i, ln in enumerate(f):
content = dict(zip(headers, ln.strip().split('|')))

if content['lat'] != 'ENG':
if content['lat'] != lang:
continue

yield content
Expand All @@ -52,19 +52,23 @@ def extract_from_mrconso(

start = time.time()

mrconso_iterator = get_mrconso_iterator(mrconso_path, mrconso_header)
mrconso_iterator = get_mrconso_iterator(
mrconso_path, mrconso_header, opts.language
)

total = countlines(mrconso_path)

processed = set()
i = 0

for i, content in enumerate(mrconso_iterator, start=1):
for content in mrconso_iterator:
i += 1

if i % 100000 == 0:
delta = time.time() - start
status = (
'{:,} in {:.2f} s ({:.2%}, {:.1e} s / term)'
''.format(i, delta, i / total, delta / i)
''.format(i, delta, i / total, delta / i if i > 0 else 0)
)
print(status)

Expand All @@ -85,6 +89,13 @@ def extract_from_mrconso(

yield (concept_text, cui, sem_types[cui], preferred)

delta = time.time() - start
status = (
'\nCOMPLETED: {:,} in {:.2f} s ({:.1e} s / term)'
''.format(i, delta, i / total, delta / i if i > 0 else 0)
)
print(status)


def parse_and_encode_ngrams(extracted_it, simstring_dir, cuisty_dir):
# Create destination directories for the two databases
Expand Down Expand Up @@ -154,8 +165,6 @@ def driver(opts):

parse_and_encode_ngrams(mrconso_iterator, simstring_dir, cuisty_dir)

print('Completed!')


if __name__ == '__main__':
ap = argparse.ArgumentParser()
Expand All @@ -176,6 +185,10 @@ def driver(opts):
'-U', '--normalize-unicode', action='store_true',
help='Normalize unicode strings to their closest ASCII representation'
)
ap.add_argument(
'-E', '--language', default='ENG', choices=LANGUAGES,
help='Extract concepts of the specified language'
)
opts = ap.parse_args()

driver(opts)
Loading

0 comments on commit 562812c

Please sign in to comment.