Skip to content

Datatypes

Guilherme Passos edited this page May 3, 2018 · 3 revisions

cl-conllu has a few central classes: sentence, token, and mtoken. They are all defined in data.lisp file. When a CoNLL-U file is read, its contents are turned into instances of these classes.

Sentences

Every CoNLL-U sentence is turned in an instance of the sentence class by cl-conllu. Each instance is characterized by four properties: start, meta, tokens, and mtokens. The start field contains the line number of the file where the sentence block started.

The meta field includes the metainformation regarding the sentence. This information may vary, as we have discussed in the previous section, but usually includes the full (raw) sentence and the sentence ID, as required by the CoNLL-U format specification.

CL-USER> (cl-conllu:sentence-meta (first *sents*))
(("text" . "PT no governo")
 ("source" . "CETENFolha n=1 cad=Opinião sec=opi sem=94a")
 ("sent_id" . "CF1-1") ("id" . "1"))

The tokens are the list of tokens that together form the sentence, and they are themselves instances of the token class.

The mtokens (meta-tokens) are also instances of their own mtoken class, and they are used for multiword tokens (e.g. vámonos = vamos + nos).

Tokens

Instances of the token class have one property for each field/column in the CoNLL-U format’s sentences, that is:

ID
Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
FORM
Word form or punctuation symbol.
LEMMA
Lemma of word form.
UPOSTAG
Universal part-of-speech tag.
XPOSTAG
Language-specific part-of-speech tag; underscore if not available.
FEATS
List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD
Head of the current word, which is either a value of ID or zero if the token is the root (0).
DEPREL
Universal dependency relation to the HEAD (root iff HEAD is 0) or a defined language-specific subtype of one.
DEPS
Enhanced dependency graph in the form of a list of head-deprel pairs.
MISC
Any other annotation.

Besides, token objects which were defined as element of tokens for a sentence object have defined a property sentence which refers to this sentence object.

Clone this wiki locally