-
Notifications
You must be signed in to change notification settings - Fork 2
Data Migration
Taylor Snead edited this page Sep 2, 2022
·
4 revisions
The code for data migration lives in the migration/
directory.
At the moment, no data is born in our MongoDB database. All data is imported from a myriad of Google Sheets using the Sheets REST API. The migration process for annotated texts consists of the following steps.
- Retrieve the contents of the Annotated Texts Index to eventually process each row into an
AnnotatedDoc
. - For each row of the index, retrieve the contents of the "Annotation Sheet" column cell. The annotation sheet must consist of one or more pages named exactly like so
Page 1
,Page 2
, etc. and two other pages calledMetadata
andReferences
. - Ingest the "Metadata" page into a
DocumentMetadata
structure. Correlate each "Source Document Image" cell with thePage X
sheet of the same index. - Fill in the
segments
field of theAnnotatedDoc
with thePage X
sheet contents concatenated together, by converting each set of annotation rows into anAnnotatedForm
. - Push the assembled
AnnotatedDoc
to the database by running theupdateDocument
mutation on the GraphQL server.
The document metadata must be listed in a sub-sheet titled exactly Metadata
.
The order of fields should always stay the same based on our spreadsheet template.
So far, that ordering is as follows:
- Document ID: DAILP-defined unique identifier for this document.
- Genre
- Source Text
- Title
-
Page # in Source Text: the number of the page this document starts with, within a larger source (like the Willie Jumper stories). If uncertain, use
1
. - Page Count: total number of pages contained within
- Translation Document ID: Identifier of a Google Doc containing the translation, pulled from the share url of that document.
-
Image Source: a shorthand identifier for a source repository. Current possible values:
beinecke
,dailp
- Image OIDs: the IIIF OID for each page within the source repository, where each column corresponds to a page.
-
Date: date-time in ISO format like
YYYY-MM-DD hh:mm:ss
that indicates when this document was created. - Contributor Names: Names of each contributor this document, whether as creator/author, translator, annotator, collector, etc. Each name goes in a column.
- Contributor Role: Short description of the person above's role in the document, i.e. Annotator. The role should match with the person name in the same column one row above.
- Source Attribution: Where this manuscript was sourced from, i.e. Beineke Rare Book & Manuscript Library
- Source Attribution Link: A link to the organization or specific record this document was source from. Each entry should correspond with the source just above it.
- What spreadsheets does data come from?
- What is the high level structure of collections?
- Link to DB docs