-
Notifications
You must be signed in to change notification settings - Fork 5
Datatypes
cl-conllu
has a few central classes: sentence
, token
, and
mtoken
. They are all defined in data.lisp
file. When a CoNLL-U
file is read, its contents are turned into instances of these classes.
Every CoNLL-U sentence is turned in an instance of the sentence
class by cl-conllu
. Each instance is characterized by four
properties: start
, meta
, tokens
, and mtokens
. The start
field contains the line number of the file where the sentence block
started.
The meta
field includes the metainformation regarding the
sentence. This information may vary, as we have discussed in the
previous section, but usually includes the full (raw) sentence and the
sentence ID, as required by the CoNLL-U format specification.
CL-USER> (cl-conllu:sentence-meta (first *sents*)) (("text" . "PT no governo") ("source" . "CETENFolha n=1 cad=Opinião sec=opi sem=94a") ("sent_id" . "CF1-1") ("id" . "1"))
The tokens
are the list of tokens that together form the sentence,
and they are themselves instances of the token
class.
The mtokens
(meta-tokens) are also instances of their own mtoken
class, and they are used for multiword tokens (e.g. vámonos = vamos + nos).
Instances of the token
class have one property for each field/column
in the CoNLL-U format’s sentences, that is:
- ID
- Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
- FORM
- Word form or punctuation symbol.
- LEMMA
- Lemma of word form.
- UPOSTAG
- Universal part-of-speech tag.
- XPOSTAG
- Language-specific part-of-speech tag; underscore if not available.
- FEATS
- List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
- HEAD
- Head of the current word, which is either a value of ID or zero if the token is the root (0).
- DEPREL
- Universal dependency relation to the HEAD (
root
iff HEAD is 0) or a defined language-specific subtype of one. - DEPS
- Enhanced dependency graph in the form of a list of head-deprel pairs.
- MISC
- Any other annotation.
Besides, token
objects which were defined as element of tokens
for a sentence
object have defined a property sentence
which refers to this sentence
object.