Skip to content

Data Representation

Taylor Snead edited this page Jun 21, 2021 · 5 revisions

We define our data structures in Rust code and store all of our data in JSON format on MongoDB.

Representation

The schema for each data structure is defined in just one place by Rust code. All core data types are defined in a crate under the types directory.

From there, we automatically generate a strict JSON representation of the data that allows us to store it in MongoDB and query data with GraphQL. Our data ingestion process uses the same code that defines the data structures, preventing us from ever creating invalid data. Finally, the front-end code validates all queries against the automatically generated GraphQL schema.

This multi-stage process ensures type safety and data validation across all layers of the technical stack, without ever repeating a data definition. If we need to change the shape of a data structure at any point, the change can be made in one place. Any code which could build invalid data refuses to even run.

Take the following example:

/// A single word in an annotated document.
/// One word contains several layers of interpretation, including the original
/// source text, multiple layers of linguistic annotation, and annotator notes.
#[derive(Serialize, Deserialize, async_graphql::SimpleObject, Clone, Debug)]
#[serde(rename_all = "camelCase")]
#[graphql(complex)]
pub struct AnnotatedForm {
    #[serde(rename = "_id")]
    /// Unique identifier of this form
    pub id: String,
    /// Original source text
    pub source: String,
    /// A normalized version of the word
    pub normalized_source: Option<String>,
    // More fields below...
}

The above defines a new data structure called AnnotatedForm and annotates it with several specific features.

  • A #[derive(X, Y, Z)] block says that the following structure has traits X, Y, and Z, which can be derived automatically.
    • The Serialize and Deserialize traits (from serde) indicate that AnnotatedForm can be converted back and forth to formats like JSON or YAML. We convert the structure to JSON to store it in the database, and convert it from JSON upon retrieval.
    • The SimpleObject trait indicates that AnnotatedForm should be represented as a type in our automatically generated GraphQL schema. Then, AnnotatedForm may be used as the result of any GraphQL query.
    • Clone allows data of this shape to be explicitly copied.
    • Debug allows one to print a human readable description of this structure for debugging purposes.
  • #[serde(rename_all = "camelCase")] converts all snake_case field names into equivalent camelCase when serializing the structure. In this case, normalized_source is serialized as normalizedSource. We apply this pattern across the board to match with the GraphQL/JS/JSON standard of camelCase names.
  • #[graphql(complex)] indicates to the GraphQL schema generator that we intend to add custom computed fields to the GraphQL type as Rust functions in addition to the intrinsic struct fields.
  • #[serde(rename = "_id")] on field id renames just that field during serialization to _id in order to make it the primary key in MongoDB.

Storage

We store all of our data in a MongoDB replica set. MongoDB stores all data in a binary-encoded JSON format called BSON. In Rust code, we can use serde for all conversion of data structures to and from BSON. We use the official MongoDB Rust driver.

Access

To update and access data stored in MongoDB, we host a GraphQL server. The code for it is a Rust crate under the graphql directory.

We use async-graphql to automatically derive a GraphQL schema. It provides lots of great features, like complex objects, async resolvers, batch data loading, and paged queries. I prefer it over the other popular GraphQL server library, juniper because Juniper doesn't handle those features well.

One benefit of using GraphQL as a data layer is that we explicitly define the fields we want in each query. Thus, if a field is removed in the back-end, then the front-end query fails to pass schema validation at build time, rather than just at query time.

Alternatives

We have discussed several alternatives for data representation and access.

Text Encoding Initiative (TEI) XML

  • Storing our data in TEI would likely entail the use of BaseX, an XML database toolkit. From there, BaseX supports data access via JSON-LD or XQuery.
  • We don't get automated build time type validation using JSON-LD or connecting to BaseX. We get schema validation when inserting into the database, but our code could still build invalid data structures.
  • We likely want to write some data transformations or migrations, which requires XSLT, a separate language for transforming XML documents.
  • To really understand JSON-LD, you need to understand RDF and how triplestores work. You will likely also still write queries in SPARQL.
  • Using BaseX + LD + XSLT requires understanding TEI/XML, XSLT, JSON-LD, BaseX, XQuery, RDF, and SPARQL. I could find few online resources accessible to myself as a software engineer. By avoiding LD as a requirement, we could reduce that list to TEI/XML, XSLT, BaseX, and XQuery. This is the XML tech stack, which has been around for a while but modern resources are still lacking. They are also more difficult to integrate with a modern tech stack. Using MongoDB + GraphQL requires understanding MongoDB, GraphQL, and JSON, which are currently ubiquitous tools on the web.

Linked Linguistic Open Data (LLOD)

We use GraphQL as our domain-local data layer, but may provide a linked data add-on in the future to connect this project with the semantic web. Thus, we avoid limiting ourselves to its data representation internally because linked data isn't ideal for data storage or querying. The following are some notes from my research on LLOD.

  • Linked Data example from WordNet: WordNet's use of linked data raises several questions around who determines the semantic hierarchy and the ontology that describes a language. By leaning on bodies like WordNet or the W3C (which consist of corporate representatives) to craft linguistically sound LD ontologies that we use to represent Cherokee, we are taking the agency out of speakers hands.
  • JSON-LD can be stored in MongoDB, but it requires lots of extra schema information to handle properly.
  • Examples of usage: Bilingual Dictionaries in LD, JSON-LD in MongoDB, Ontolex Specification. It's very difficult to find documentation of LLOD standards like ontolex and lexinfo. The W3C specifications are esoteric and unclear, written for internal understanding and largely relying on existing knowledge of the system.
  • Little tooling available for working with LD queries.
    • HyperGraphQL, which could make it easier to query a LD triplestore, is not consistently maintained
    • Tooling is mostly in Java, which we don't plan to use
  • JSON-LD Playground: Helpful for visualizing what JSON-LD representations and queries could look like.
  • We would need to choose several prescriptive ontologies up-front
    • ontolex/morph: The morphology module of ontolex. I could only find one maintainer (?), and the ontolex project seems to have only a few actual contributors on GitHub. This particular module is very much incomplete and has zero documentation at the moment.
    • mmoon: ontology for morphology, I haven't found much more information.
  • However much it may seem like a great tool, LD is esoteric in the technical community. It seems only useful so far for the collection of information, not for data access or manipulation.
  • Linked data sources frequently expose a public SPARQL endpoint, which is an unfortunate practice because SPARQL is built very similarly to SQL. SPARQL injection is a vulnerability for the same reason that SQL injection is why nobody publicly exposes a SQL server to arbitrary queries. We can instead only allow strictly defined queries using GraphQL, which avoids a host of related issues.
  • LD allows us to define our resources in terms of some standard schemas, but there is no reason that our website needs to access the data as LD or via SPARQL. Where we gain ground is in connecting to other LD resources, like other language archives or knowledge graphs.