Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RDF Reader and Extractor #19

Merged
merged 16 commits into from
Jul 10, 2024
Merged

Add RDF Reader and Extractor #19

merged 16 commits into from
Jul 10, 2024

Conversation

BeritJanssen
Copy link
Member

@BeritJanssen BeritJanssen commented May 15, 2024

Close #14 : This branch adds a RDF Reader and Extractor. The RDFReader class has a document_subjects function, which is supposed to return all subjects on the basis of which documents can be retrieved. The subject itself can be extracted (this is supposedly useful for id / url fields), and objects with given predicates.

@jgonggrijp , are these assumptions about the structure of input data too narrow? I tested with the EUParl linked data as well, but this data is rather sparse, and does probably not give a clear picture of how "real" linked data would look like.

What the current implementation doesn't cover is the use case where data is spread over multiple graphs stemming from different input files. This is exactly where linked data should shine, but at the moment there is no use case for this scenario. I assume we can modify the source2dicts function to combine the contents of multiple files into a graph. I'd rather leave this for later though, as long as the assumption here that "documents can be derived from queries with a subject and predicate" is valid.

Copy link
Member

@jgonggrijp jgonggrijp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this project come to fruition!

This branch adds a RDF Reader and Extractor. The RDFReader class has a document_subjects function, which is supposed to return all subjects on the basis of which documents can be retrieved. The subject itself can be extracted (this is supposedly useful for id / url fields), and objects with given predicates.

This part looks very sensible to me.

@jgonggrijp , are these assumptions about the structure of input data too narrow?

Maybe a little. As far as I can tell, you are not yet handling relationship traversal other than RDF lists. An RDF extractor is likely to encounter structured data with multiple levels of nesting. Here is a contrived example, where I might want to extract a single document {id: 'berit', shirt: 'octopi', name: 'Berit Janssen'}:

ex:berit a cdh:Developer;
         cdh:wearsShirt ex:shirt5;
         cdh:fullName ex:name3.
ex:shirt5 a cdh:CoolShirt;
          schema:color "purple";
          rdfs:label "octopi".
ex:name3 a cdh:Name;
         cdh:nameOrder cdh:familyLast;
         cdh:nameParts ("Berit" "Janssen").

A possible way to enable generic relationship traversal might be to allow tuples of URIRefs as the predicate argument to the RDF extractor. This would be analogous to SPARQL's cdh:wearsShirt | rdfs:label notation. By the way, Turtle has a special notation for lists that I'm using here, based on round brackets.

Speaking of lists, I am unsure about the way you handle those and the multiple parameter. You might rely on the assumption that the subject that supplies the id is always the head of the list, i.e., <subject> rdf:first "value"; rdf:rest <rest-of-list>. I think you are more likely to encounter a list only after you follow the object of a predicate, for example <subject> ns1:has_lines <object>. <object> rdf:first "value"; rdf:rest <rest-of-list> or <subject> ns1:has_lines [rdf:first "value". rdf:rest <rest-of-list>].

Maybe you are also making the assumption that if a subject-predicate pair has multiple values, it is always in the form of a list. This is not the case. You can also simply have multiple triples with the same subject and predicate, but different objects:

<http://example.org/shakespeare/line/hamlet-actI-scene5-3>
    ns1:hasSpeaker <http://example.org/shakespeare/character/GHOST> ;
    ns1:hasLine "My hour is almost come," ;
    ns1:hasLine "When I to sulph'rous and tormenting flames" ;
    ns1:hasLine "Must render up myself." .

Which can be abbreviated with commas:

<http://example.org/shakespeare/line/hamlet-actI-scene5-3>
    ns1:hasSpeaker <http://example.org/shakespeare/character/GHOST> ;
    ns1:hasLine "My hour is almost come," ,
                "When I to sulph'rous and tormenting flames" ,
                "Must render up myself." .

In this case a list makes more sense because it preserves order, but often, this is not needed.

What the current implementation doesn't cover is the use case where data is spread over multiple graphs stemming from different input files. This is exactly where linked data should shine, but at the moment there is no use case for this scenario. I assume we can modify the source2dicts function to combine the contents of multiple files into a graph. I'd rather leave this for later though, as long as the assumption here that "documents can be derived from queries with a subject and predicate" is valid.

Honestly, I am not bothered by this at all. It seems reasonable to me to expect the user to combine all data in one file before supplying it to the extractor. Concatenating files is not hard to do.

ianalyzer_readers/extract.py Outdated Show resolved Hide resolved
ianalyzer_readers/extract.py Outdated Show resolved Hide resolved
ianalyzer_readers/readers/rdf.py Outdated Show resolved Hide resolved
ianalyzer_readers/readers/rdf.py Outdated Show resolved Hide resolved
tests/rdf_reader.py Outdated Show resolved Hide resolved
tests/test_rdf_reader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@lukavdplas lukavdplas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me! :)

I have added some comments regarding the documentation.

ianalyzer_readers/extract.py Outdated Show resolved Hide resolved
ianalyzer_readers/extract.py Outdated Show resolved Hide resolved
ianalyzer_readers/extract.py Outdated Show resolved Hide resolved
ianalyzer_readers/readers/rdf.py Outdated Show resolved Hide resolved
ianalyzer_readers/readers/rdf.py Outdated Show resolved Hide resolved
ianalyzer_readers/readers/rdf.py Outdated Show resolved Hide resolved
ianalyzer_readers/readers/rdf.py Outdated Show resolved Hide resolved
@BeritJanssen BeritJanssen merged commit bd706e9 into develop Jul 10, 2024
8 checks passed
@BeritJanssen BeritJanssen deleted the feature/turtle branch July 10, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RDF reader
3 participants