Skip to content

Folder structure and file naming conventions

Isaac Schifferer edited this page Nov 17, 2023 · 2 revisions

Folder Structure

silnlp uses the SIL_NLP_DATA_PATH environment variable to specify the path for a root folder (e.g., SIL_NLP_DATA_PATH="C:/silnlp"). All of the reference data files and experiment files (configuration, models, predictions, etc) expected by the NMT scripts will be found under this root folder.

The subfolder structure that silnlp requires under this root folder is described in the table below.

Folder Description
  • Alignment
Data and experiments subfolder supporting Alignment experiments.
    • experiments
Experiments subfolder with multiple subfolders, one per experiment.
      • <experiment>
Subfolder for a single experiment.
  • MT
Data and experiments subfolder supporting Machine Translation experiments.
    • corpora
Non-Scripture training data files (WMT '20, NewsTest, MultiCCAligned, etc).

Refer to the next section for information on the naming conventions for the files in this subfolder.
    • experiments
Experiments subfolder with multiple subfolders, one per experiment.
      • <experiment>
Subfolder for a single experiment.
    • scripture
Scripture training data files.
When the extract_corpora script is run on a Paratext project, the extracted Scripture content is written to a file in this subfolder.

Refer to the next section for information on the naming conventions for the files in this subfolder.
      • vref.txt
Canonical list of verse references (e.g., "GEN 1:1"), in order, for all Scripture training data files extracted from the Paratext project. The order in which the verse references appear in this file is the same order in which the verse text appears in all Scripture training data files.
This file can be generated by running the extract_corpora script on the Ref project (see below).
    • terms
Key Biblical Terms (KBT) data files.
When the extract_corpora script is run on a Paratext project with populated KBT's, the extracted KBT's are written to file(s) in this subfolder>.

Refer to the next section for information on the naming conventions for the files in this subfolder.
  • Paratext
Subfolder with Paratext projects and related Paratext supporting data.
    • projects
Subfolder with one or more Paratext projects.
      • <project>
Subfolder with the files from an unzipped Paratext project.
      • Ref
Subfolder containing a Reference Paratext project with versification that all other Paratext projects are aligned to when they are extracted.
    • terms
Reference files for processing Paratext KBT's.

File naming conventions

To be provided ...