Skip to content

module__org.bibliome.alvisnlp.modules.TextFileReader

Robert Bossy edited this page Jul 27, 2017 · 1 revision

#org.bibliome.alvisnlp.modules.TextFileReader

Synopsis

Reads files and adds a document in the corpus for each file.

Description

org.bibliome.alvisnlp.modules.TextFileReader reads file(s) from sourcePath and creates a document in the corpus for each file. The identifier of the created document is the absolute path of the corresponding file. The created document has a single section named section whose contents is the contents of the corresponding file.

If sourcePath is a path to a file, then org.bibliome.alvisnlp.modules.TextFileReader will read this file. If sourcePath is a path to a directory, then org.bibliome.alvisnlp.modules.TextFileReader will read the files in this directory. If recursive is set to true, then the files in sub-directories will be read recursively. org.bibliome.alvisnlp.modules.TextFileReader only reads files whose name match acceptPattern. If acceptPattern is not set, then org.bibliome.alvisnlp.modules.TextFileReader reads all files.

If linesLimit is set, then org.bibliome.alvisnlp.modules.TextFileReader creates a new document for each set of lines. For instance, if linesLimit is set to 10 and a file contains 25 lines, then 3 documents are created: two containing 10 lines and one containing the las 5 lines.

Files are read using the same encoding charset.

The created documents will all have the features defined in constantDocumentFeatures. The unique section will have the features defined in constantSectionFeatures.

Parameters

Optional

Type: SourceStream

Path to the source directory or source file.

Optional

Type: Mapping

Constant features to add to each document created by this module

Optional

Type: Mapping

Constant features to add to each section created by this module

Optional

Type: Integer

Maximum number of lines per document.

Optional

Type: Integer

Maximum number of characters per document. No limit if not set.

Default value: false

Type: Boolean

Use the filename base name instead of the full path as document identifier.

Default value: UTF-8

Type: String

Character set of the input files.

Default value: contents

Type: String

Name of the single section containing the whole contents of a file.

Clone this wiki locally