Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs and improvements #7

Open
tboenig opened this issue Oct 11, 2023 · 0 comments
Open

Bugs and improvements #7

tboenig opened this issue Oct 11, 2023 · 0 comments

Comments

@tboenig
Copy link

tboenig commented Oct 11, 2023

Thank you for sharing this corpus.
Creating GT is not an easy job. I took a random look at the page files from the pageXmlTranskribusCorrected folders.

I noticed the following problems:

  1. the entire text of a line was encoded at the Word level, as a single Word.
    Solution: Convert Word ind line
  2. often the drop-capital are annotated as Graphic
  3. many separators can be seen as so called fake separators and should be corrected
  4. a wish, Transkribus does not create valid page instances, of course such annotations as:
    <TranskribusMetadata docId="188203" .../> can be commented out.
    but:
  • open type="" attributes
  • open id="" Attributes should be corrected to.
    • the Alto format files contain very deeply structured data, unfortunately when converting to Page-XML format this information was not included.

I will be very welcome to help you to improve the data within my possibilities.
Thanks again for everything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant