Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parsers #53

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from
Open

Add parsers #53

wants to merge 32 commits into from

Conversation

joeyaurel
Copy link
Owner

@joeyaurel joeyaurel commented Aug 7, 2020

New PR in favor of original PR #50 by @cdhorn, because of conflicts with the latest version in the develop branch.

Todo list:

Original post:

Hi Nick,
I have made a number of changes I'm hoping you'll consider merging. It might have been better to try to implement each in a separate branch, I'm sort of new at this so I apologize. To try to summarize them at a high level:

  • I removed the FileElement as it is really a duplicate of ObjectElement, added SourceElement, RepositoryElement, NoteElement, HeaderElement, SubmitterElement and SubmissionElement.
  • I added a set of subparsers for all of the various substructures in the standard within the given record types.
  • Added a get_record() method to all record elements that parses and returns the full record as structured data in a dict format. A lot of this is logic I needed for something else I'm starting to toy with and it seemed to make sense to me to have it in the base parser.
  • Added a Reader class that gives a couple simple methods to fetch all the records by type or all of them in one shot.
  • Added records.py with types for the Reader.
  • Broke exceptions out into errors.py.
  • Some more updates to tags.py to add a few more and fix some bugs/typos.
  • Added standards.py with links to the 5.5, 5.5.1, 5.5.1 GEDCOM-L, and 5.5.5 standards and used those when raising exceptions when applicable.
  • Added detect.py to detect the file encoding and the GEDCOM version. This added a dependency on the chardet and ansel packages. It now opens and parses Ansel files although I am not 100% sure I handled it right. As the codec is set when file opened it is not opened in binary mode and I removed the encode utf-8-sig stuff elsewhere. Please review those changes carefully, I've never really worked with different codecs and character sets before.
  • Gedcom 5.5.5 has strict requirements around validating format and logical structure, so if it detects a 5.5.5 file it raises an exception as the standard requires although it probably can parse the format of them fine. You can remove this if you think it should not be done.
  • Added type hints to just about everything so they should not be needed in the doc strings.
  • Cleaned up many doc strings and expanded them in a few areas.
    Thanks,
    Chris

@joeyaurel joeyaurel added this to the 2.0.0 milestone Aug 7, 2020
@joeyaurel joeyaurel added this to To do in Python GEDCOM Parser via automation Aug 7, 2020
@joeyaurel joeyaurel self-assigned this Aug 7, 2020
@joeyaurel joeyaurel mentioned this pull request Aug 7, 2020
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 8, 2020
@joeyaurel joeyaurel force-pushed the cdhorn-add-parsers branch 2 times, most recently from 0c2af1f to a3ebc2b Compare August 8, 2020 13:39
Repository owner deleted a comment Aug 8, 2020
Repository owner deleted a comment Aug 10, 2020
cdhorn and others added 11 commits June 3, 2021 20:12
Removed FileElement as duplicates role of ObjectElement. Added Source, Repository, Note, Header,
Submission, and Submitter elements to handle all the defined record types in the standard. Add
subparsers for all of the substructures defined in the standard. Added get_record method to all
record elements to parse the full record structure and return it as a dictionary. Before processing
file added encoding check to identify type. If Python Ansel module is installed use it so can now
decode Ansel Gedcoms. Verify encoding found matches encoding claimed. Get version number. Validate
if 5.5.5 and if so reject it as that standard requires as no strict 5.5.5 reader exists yet. Added
standards.py with references to the different standards and use where needed. Added Reader object to
wrap the Parser and provide get_records_by_type and get_all_records methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants