Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoNLL-U metadata validation/cleansing #251

Open
chiarcos opened this issue Jun 25, 2024 · 2 comments
Open

CoNLL-U metadata validation/cleansing #251

chiarcos opened this issue Jun 25, 2024 · 2 comments
Assignees

Comments

@chiarcos
Copy link

chiarcos commented Jun 25, 2024

  • version: Annatto 0.8.0 - 2024-06-17

  • issue: After the conversion of (valid) CoNLL-U v1 data with non-standard metadata (see below), the output could be imported into ANNIS, but only partially visualized (no dependency view).

  • suggestion

    • detect non-standard metadata during conversion, produce a warning and keep it as comment (not metadata)
    • add relevant test data to tests/data/import/conll
  • background: In the CoNLL-U format, CoNLL comments before the sentence can be used to provide metadata, where a metadata attribute (e.g., text) is assigned a value (separated by =). In CoNLL-U v2, there are two obligatory metadata fields, text and sent_id, in CoNLL-U v1, metadata is optional, in CoNLL-X, metadata is treated as comment. In the following data snippet, an invalid separator is used, causing the ANNIS visualizer to break (p.c. by Thomas Krause). Apparently, this is because the converter tried to quietly recover the invalid metadata.

  • example

      # text: 97 — Doch, was de Red von Ehrlichkeit, Von trüen Sinn in Freud un Leid, Kamm noch von't Voaderland derto — Denn flog åm ok de Schwanz mån so.
      1	97	_	NUM	NUM	_	3	compound	_	_
      2	-	_	PUNCT	$(	_	1	punct	_	_
      3	Doch	_	ADV	ADV	_	12	advmod	_	_
      4	,	_	PUNCT	$,	_	3	punct	_	_
      5	was	_	AUX	AUX.3.Ind.Prs.Sg	Number=Sing|Person=3|Tense=Pres	12	aux	_	_
      6	de	_	DET	DET.Def.Fem.Nom.Sg	Case=Nom|Definite=Def|Gender=Fem|Number=Sing	7	det	_	_
      7	Red	_	NOUN	NOUN.Fem.Nom	Case=Nom|Gender=Fem	12	obj	_	_
      8	von	_	ADP	ADP	_	9	case	_	_
      9	Ehrlichkeit	_	NOUN	NOUN.Dat.Neut	Case=Dat|Gender=Neut	7	nmod	_	_
      10	,	_	PUNCT	$,	_	7	punct	_	_
      11	Von	_	ADP	ADP	_	12	mark	_	_
      12	trüen	_	ADJ	ADJ.Inf	VerbForm=Inf	0	root	_	_
      13	Sinn	_	NOUN	NOUN.Nom.Pl	Case=Nom|Number=Plur	12	nsubj	_	_
      14	in	_	ADP	ADP	_	15	case	_	_
      15	Freud	_	NOUN	NOUN	_	13	nmod	_	_
      16	un	_	CCONJ	CCONJ	_	17	cc	_	_
      17	Leid	_	VERB	VERB	_	15	conj	_	_
      18	,	_	PUNCT	$,	_	12	punct	_	_
      19	Kamm	_	VERB	VERB.2.Imp.Sg	Mood=Imp|Number=Sing|Person=2	12	conj	_	_
      20	noch	_	ADV	ADV	_	19	advmod	_	_
      21	von't	_	ADP	ADP	_	22	case	_	_
      22	Voaderland	_	NOUN	NOUN	_	19	obl	_	_
      23	derto	_	ADV	ADV	_	25	advmod	_	_
      24	-	_	PUNCT	$(	_	23	punct	_	_
      25	Denn	_	ADV	ADV	_	26	advmod	_	_
      26	flog	_	VERB	VERB.3.Ind.Prt.Sg.st	Number=Sing|Person=3|Tense=Past	19	conj	_	_
      27	åm	_	PRON	PRON	_	30	case	_	_
      28	ok	_	ADV	ADV	_	30	advmod	_	_
      29	de	_	DET	DET.Def.Fem.Nom.Sg	Case=Nom|Definite=Def|Gender=Fem|Number=Sing	30	det	_	_
      30	Schwanz	_	NOUN	NOUN	_	26	obl	_	_
      31	mån	_	ADV	ADV	_	26	advmod	_	_
      32	so	_	ADV	ADV	_	26	compound:prt	_	_
      33	.	_	PUNCT	$.	_	12	punct	_	_
    
@chiarcos chiarcos changed the title CoNLL-U validation/cleansing CoNLL-U metadata validation/cleansing Jun 25, 2024
@MartinKl MartinKl self-assigned this Jun 25, 2024
@MartinKl
Copy link
Collaborator

Thank you for the submission. Your request addresses several issues.

First, the dependency visualizer does not work, because graphml export sets the node key wrong. This is being fixed by #252

About supporting the data you provided and/or other versions of CoNLL: We suggest to stick to the notation using = as key-value delimiter for sentence annotations, since this seems easy to replace. We will nevertheless extend the conll module to import annotations that do not start with key = as bare values that will be added as a sentence annotation conll::comment holding said value. See #257 for more details. In case of your data this would lead to annotations conll::comment="text: ..." for each sentence.

Are there any other features of CoNLL-X that you consider necessary?

@chiarcos
Copy link
Author

chiarcos commented Jun 25, 2024

Thank you, #257 is the best way to deal with that IMHO.

As for other features of CoNLL-X, the last two columns have different functions (cf. https://aclanthology.org/W06-2920.pdf). I guess it's not worth supporting that because they were not widely used, in the first place and this pertains to legacy data, only, which does not seem to be publicly available anymore (at least not from https://ilk.uvt.nl/conll/post_task_data.html). It is still used by some older parsers, though, and sometimes required as input for downstream tasks. So, while I would not advise to go for full CoNLL-X support, I would suggest to be robust against CoNLL-X input, i.e., check whether CoNLL-X data with PHEAD (9th column) set to an integer would break the CoNLL-U conversion, because CoNLL-U expects pairs of IDs and dependency labels, there, and only these.

You can synthesize such data from CoNLL-U data by just copying the values from the HEAD column into the 9th column, and the values from the DEP column into the 10th column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants