Skip to content

Data Migration

Taylor Snead edited this page Sep 2, 2022 · 4 revisions

The code for data migration lives in the migration/ directory.

At the moment, no data is born in our MongoDB database. All data is imported from a myriad of Google Sheets using the Sheets REST API. The migration process for annotated texts consists of the following steps.

  1. Retrieve the contents of the Annotated Texts Index to eventually process each row into an AnnotatedDoc.
  2. For each row of the index, retrieve the contents of the "Annotation Sheet" column cell. The annotation sheet must consist of one or more pages named exactly like so Page 1, Page 2, etc. and two other pages called Metadata and References.
  3. Ingest the "Metadata" page into a DocumentMetadata structure. Correlate each "Source Document Image" cell with the Page X sheet of the same index.
  4. Fill in the segments field of the AnnotatedDoc with the Page X sheet contents concatenated together, by converting each set of annotation rows into an AnnotatedForm.
  5. Push the assembled AnnotatedDoc to the database by running the updateDocument mutation on the GraphQL server.

Metadata

The document metadata must be listed in a sub-sheet titled exactly Metadata. The order of fields should always stay the same based on our spreadsheet template. So far, that ordering is as follows:

  • Document ID: DAILP-defined unique identifier for this document.
  • Genre
  • Source Text
  • Title
  • Page # in Source Text: the number of the page this document starts with, within a larger source (like the Willie Jumper stories). If uncertain, use 1.
  • Page Count: total number of pages contained within
  • Translation Document ID: Identifier of a Google Doc containing the translation, pulled from the share url of that document.
  • Image Source: a shorthand identifier for a source repository. Current possible values: beinecke, dailp
  • Image OIDs: the IIIF OID for each page within the source repository, where each column corresponds to a page.
  • Date: date-time in ISO format like YYYY-MM-DD hh:mm:ss that indicates when this document was created.
  • Contributor Names: Names of each contributor this document, whether as creator/author, translator, annotator, collector, etc. Each name goes in a column.
  • Contributor Role: Short description of the person above's role in the document, i.e. Annotator. The role should match with the person name in the same column one row above.
  • Source Attribution: Where this manuscript was sourced from, i.e. Beineke Rare Book & Manuscript Library
  • Source Attribution Link: A link to the organization or specific record this document was source from. Each entry should correspond with the source just above it.

Edited Collections (WIP)

  • What spreadsheets does data come from?
  • What is the high level structure of collections?
  • Link to DB docs