Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizing/standardizing data storage (enforcing bestpractices) through PL/pgsql functions #124

Open
guignonv opened this issue Feb 8, 2022 · 2 comments
Labels
Discussion thread for community discussion.

Comments

@guignonv
Copy link

guignonv commented Feb 8, 2022

I'd like to bring a discussion (yesterday on gather.town) here for the community.

The main strength of a schema like Chado is to normalize how biological data are stored. Then, we can create generic tools that would work with everybody's chado instance as long as they store things the (exact) same way. However, it is often possible to store the same thing in different ways. Then, making generic tools for Chado becomes a harder task.

In order to enforce standard ways of storing things, there is a best-practices documentation but people may not ready it, forget about it or misunderstand things.

I propose an additional approach: adding embedded functions into Chado schema which roles are to store data appropriately. I see several advantages: first, they will enforce the way things are stored (ie. use the appropriate tables and create the appropriate links between things), they can check data integrity (ie. if a date is stored in a *prop.value column, the function can raise an issue if the date is incorrect), they can adapt to schema changes (ie. a same function can remain from one schema version to another even in case of major table changes). There are some drawback: there is a lot of code to write and maintain, there will be many functions to write to fulfill many use cases, how can we decide what is needed? I will post some example in a next comment.

So far, it's just a though (related to a problem I faced), with no code written. It would require a lot of work but I'd like to have the community opinion about this approach.

Edit: well, there is already some code for basic stuff (store_db, store_dbxref, store_organism, store_feature, store_featureloc, store_feature_synonym, store_analysis) but it could be generalized. "store_feature" could be used by new functions like store_dna, store_snp,...

@guignonv
Copy link
Author

guignonv commented Feb 8, 2022

Here is one use case I have: I manage stocks and I would like to add a trait value to a given stock. Depending on the trait, I could use stockprop if the value is continuous or the stock_cvterm table if the value is discrete. It would be fine but... I could also use the phenotype table and associate a trait phenotype to my stock. Then, the process becomes a bit more complex. I have to identify the observable part I am looking at and the attribute I am measuring. Then, I have to create a phenotype entry for that "observable" (find the cvterm), that "attribute" (find another cvterm) and that "value" (which can be either a discrete value in cvalue_id column or a continuous value in value column). Then, I have to create a nd_experiment enrty corresponding to the experiment that lead to the measure. That experiment will require a geolocation as well. But I may use an empty one if needed (ie. nd_geolocation_id 1). After that, I have to link that nd_experiment to the phenotype (nd_experiment_phenotype) and the stock (nd_experiment_stock). It is by far more complex than just inserting one row in stockprop or stock_cvterm tables. Therefore, I believe most people would be too lazy to use nd_experiment and phenotype tables. I even heard that some people added a (wild! ;) ) stock_phenotype table to simplify the thing and still take profit of the observable and attribute fields.

For this use case, I would see a function like store_stock_phenotype(stock_id, value, observable, attribute, nd_experiment_id=NULL, latitude=NULL,longitude=NULL,altitude=NULL,geodetic_datum=NULL,...) in which the user would provide the stock_id, then a "value", an "observable", an "attribute" as text, and maybe other elements. The function would find by itself the corresponding "observable" and "attribute" in the cvterm table by their name. Maybe, we could think of a way to (optionally) specify a CV for each, to avoid ambiguity in case a same cvterm name belongs to several CV. For instance, the "leaf" could belong to a CV "plant anatomy" and a CV "graph". Then, the value could be something like "plant anatomy:leaf" (the colon would be used to split the value and doubled colon would be ignored and turned into single colon). We could also think of a more complex system that would "know" that there is only one valid pair "observable"-"attribute" (using cvterm_relationship table maybe?) to avoid ambiguity. We could go further and add some logic to the database that would allow to know that for such attribute, the value must follow a given format (for continuous values using the "value" column) or use a given CV (for discrete values using the "cvalue_id" column) in which the function will pick the appropriate cvterm associated to the user-provided value. The function could raise an exception if the value is not valid and stop the process.
Once observable, attribute and value are identified, the function can try to get the corresponding phenotype_id from the phenotype table or insert a new record automatically if missing. Then, then function will create an experiment (or use an existing one if an nd_experiment_id has been provided) and associate it to the stock and the phenotype. The experiment will require an nd_geolocation that would be also created by the function according to the user inputs (a default nd_geolocation would be used otherwise).
Finally, the function would return the nd_experiment_id used, in case more things need to be added to the experiment afterward or if the experiment need to be reused to insert another stock-phenotype record.

Such a function would greatly simplify the use of the natural diversity tables which are quite complex. Their complexity make them hard to use and query but materialized views are here to solve that issue. Their interest is that they normalize data if they are used. If such data are "normalized", it is then possible to write generic user interfaces that work on those data. Otherwise, you can't because you don't know if people stored things in stockprop, stock_cvterm or whatever...

This "use case" may not be alone. I believe the same kind of approach could be used for each case in chado best practices. I yet have another use case when I need to add a stock_cvterm relationship: depending on the cvterm used, I may either want to allow several relationships between a stock and the cvterms of a given CV, only one relationship at a time, or only one for a given subset of terms of a CV (using the CV graph to select a subtree of a "root" term that would provide the allowed subset of cvterms). A parametrized function could help here as well.

Note: I initially stored trait values in stockprop/stock_cvterm tables, even sometimes, I think I should have created a custom CV for some discrete values stored in the stockprop table instead of using stock_cvterm because no official CV was existing for those traits at that time. Sometimes, I regret (time pressure, too lazy?)! Now, I'm moving things to phenotype and it's a pain... Am I just the "1%" of people doing such things? ;-)

Discussion is open! :-)

@guignonv guignonv changed the title Normalizing/standardizing data storage (enforcing bestpractices) through pgsql functions Normalizing/standardizing data storage (enforcing bestpractices) through PL/pgsql functions Feb 8, 2022
@laceysanderson
Copy link
Contributor

I agree that using the ND experiment tables is a pain... We're actually using a different approach documented here to bypass them for many of the reasons you mentioned. Such a function would have greatly improved the chances of us using the ND Experiment.

I support such an approach. The benefit of using functions in this manner is that it still provides support for both our approaches:

  • people wanting to use the ND tables could use this function for much easy data insertion.
  • people who didn't or needed another solution simply do not use the function.

Since warnings and such only show up when using the function, it doesn't reduce the flexibility of Chado -just provides easier means of following best practices. Great idea!

@laceysanderson laceysanderson added the Discussion thread for community discussion. label Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion thread for community discussion.
Projects
None yet
Development

No branches or pull requests

2 participants