Skip to content

module__TreeTagger

Robert Bossy edited this page Jul 27, 2017 · 1 revision

#org.bibliome.alvisnlp.modules.treetagger.TreeTagger

Synopsis

Runs tree-tagger.

Description

org.bibliome.alvisnlp.modules.treetagger.TreeTagger applies tree-tagger on annotations in wordLayerName by generating an appropriate input file. This file will contain one line for each annotation. The first column, the token surface form, is the value of the formFeature feature. The second column, the token predefined POS tag, is the value posFeature feature. The third column, the token predefined lemma, is the value of lemmaFeature feature. If posFeature or lemmaFeature are not defined, then the second and third column are left blank.

The tree-tagger binary is specified by treeTaggerExecutable and the language model to use is specified by parFile. Additionally a lexicon file can be given through lexiconFile.

If sentenceLayerName is defined, then org.bibliome.alvisnlp.modules.treetagger.TreeTagger considers annotations in this layer as sentences. Sentence boundaries are reinforced by providing tree-tagger an additional end-of-sentence marker.

Once tree-tagger has processed the corpus, org.bibliome.alvisnlp.modules.treetagger.TreeTagger adds the predicted POS tag and lemma to the respective posFeature and lemmaFeature features of the corresponding annotations.

If recordDir and recordFeatures are both defined, then tree-tagger predictions are written into files in one file per section in the recordDir directory. recordFeatures is an array of feature names to record. An additional feature n is recognized as the annotation ordinal in the section.

Parameters

Optional

Type: InputFile

Path to the language model file.

Optional

Type: ExecutableFile

Path to the tree-tagger executable file.

Optional

Type: Mapping

Constant features to add to each annotation created by this module

Optional

Type: SourceStream

Path to a tree-tagger lexicon file, if set the lexicon will be applied to the corpus before treetagger processes it.

Optional

Type: OutputDirectory

Path to the directory where to write tree-tagger result files (one file per section).

Optional

Type: String[]]

List of attributes to display in result files.

Default value: true

Type: Expression

Only process document that satisfy this filter.

Default value: form

Type: String

Name of the feature denoting the token surface form.

Default value: ISO-8859-1

Type: String

Tree-tagger input corpus character set.

Default value: lemma

Type: String

Name of the feature to set with the lemma.

Default value: false

Type: Boolean

Either to replace unknown lemmas with the surface form.

Default value: ISO-8859-1

Type: String

Tree-tagger output character set.

Default value: pos

Type: String

Name of the feature to set with the POS tag.

Default value: UTF-8

Type: String

Character encoding of the result files.

Default value: true

Type: Expression

Process only sections that satisfy this filter.

Default value: sentences

Type: String

Name of the layer containing sentence annotations, sentences are reinforced.

Default value: words

Type: String

Name of the layer containing the word annotations.

Clone this wiki locally