Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification for v0 CBOR format #23

Open
hobofan opened this issue Jul 25, 2019 · 2 comments
Open

Specification for v0 CBOR format #23

hobofan opened this issue Jul 25, 2019 · 2 comments

Comments

@hobofan
Copy link
Member

hobofan commented Jul 25, 2019

Motivation

The Protobuf format is not that great. An overview from the current doc page of the pros and cons (https://rlay-project.github.io/rlay-client/docs/rlay-ontology-serialization-formats#protobuf-based-format):

Pros:

  • Protobuf libaries were easily available for prototyping in Rust and Solidity at time of creation
  • Via ordered fields in protobuf schemas, it is pretty easy to have a determenistic content-addressable format
  • Low size overhead over contents

Cons:

  • Protobuf is comparatively complex for the simple features we need of it
  • As the protobuf encoding doesn't contain any information about the entity kind, the entity kind has to be known for the encoding to be correctly interpreted
  • Per-EntityKind CID multicodecs would require a lot of codecs to be registered/coordinated
  • Unwieldy to use in end-user applications

Proposal

As a replacement for the Probuf format, I am proposing a CBOR-based format with included prefix. The major upside is that CBOR is standardized, and has robust implementations in many languages. The other major change would be an included prefix that makes it possible to deserialize the entity without externally provided information what entity kind the bytes represent.

  • Prefix as varint (CBOR unsigned vs. multibase varint?) that specifies the entity kind. Could also be a full multicodec per entity kind. Maybe it's even possible to have it be both a multibase varint that together with a multicodec prefix (for a whole block of numbers) can act as a full multicodec.
  • Body as CBOR where empty fields are omitted.
  • As field keys numbers are used instead of full field names like "annotation", so it is more compact.
  • It could be a good idea to use a shared mapping of "field name"->"field key number", so that the only thing that is necessary for properly deserializing is the the entity kind and that mapping. Otherwise we would need to have a per-entity-kind mapping, which would be more complex in its creation. E.g. annotations would be 0 for Class while it is 1 for DataProperty.
  • Since only the field key numbers from 0-23 use 1 byte, and the ones after that 2 bytes, the mapping should prioritize the most common field names. Most common might mean either the field names that appear in most entity kinds (e.g. annotations), or field names that appear in entity kinds that are used very often (e.g. all field names in DataProperty).
@hobofan
Copy link
Member Author

hobofan commented Jul 26, 2019

Regarding the prefix: Using CBOR tags could be a good idea.

Related: pyfisch/cbor#129

@hobofan
Copy link
Member Author

hobofan commented Jan 10, 2021

Might also be a good idea to always encode the entity kind as the 0 field in the CBOR body, instead of having a prefix.

This is probably better supported than CBOR tags in most libraries, and all of the encoding/decoding can be done in a single step after decoding the CBOR, instead of having to parse the prefix first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant