Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce lazy_tree (super dictionaries) #1733

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Jan 12, 2024

Description

This PR adds:

  • lazy_tree option to AsdfConfig
  • lazy_tree argument to asdf.open (defaults to AsdfConfig.lazy_tree)
  • Converter.lazy attribute used to indicate if a converter supports "lazy" objects
  • asdf.lazy_nodes "lazy" container classes for list, dict, ordered dict

By default the "lazy" option is False.

When lazy_tree is True and an ASDF file is opened the tagged nodes in the tree are not immediately converted to custom objects. Instead, the containers in the tree (dicts, lists, OrderedDicts) are replaced with AsdfNode subclasses that act like these containers and convert tagged values to custom objects when they are accessed (See #1705 for discussion of this feature). During conversion, if asdf encounters a Converter that either defines lazy=False or does not define lazy the remainder of the branch will be converted to non-"lazy" objects and passed to the Converter. If instead the Converter defines lazy=True the "lazy" object (ie a AsdfDictNode for a dict) will be passed to the Converter.

Checklist:

  • pre-commit checks ran successfully
  • tests ran successfully
  • for a public change, a changelog entry was added
  • for a public change, documentation was updated
  • for any new features, unit tests were added

@braingram braingram marked this pull request as ready for review January 17, 2024 16:31
@braingram braingram requested a review from a team as a code owner January 17, 2024 16:31
@braingram braingram added this to the 3.1.0 milestone Jan 17, 2024
asdf/_tests/conftest.py Show resolved Hide resolved
asdf/_tests/core/_converters/test_complex.py Show resolved Hide resolved
asdf/_tests/test_lazy_nodes.py Show resolved Hide resolved
asdf/exceptions.py Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
asdf/lazy_nodes.py Show resolved Hide resolved
asdf/lazy_nodes.py Outdated Show resolved Hide resolved
@braingram
Copy link
Contributor Author

converted to draft until #1733 (comment) is addressed

@braingram braingram force-pushed the feature/lazy_tree branch 2 times, most recently from b14e530 to e359dde Compare May 13, 2024 15:58
@braingram
Copy link
Contributor Author

The following branch of roman_datamodels adds lazy to the node converters (and makes some minor node changes to account for AsdfDictNode not passing isinstance(..., dict) etc) to allow lazy loading of roman trees:
https://github.com/spacetelescope/roman_datamodels/compare/main...braingram:roman_datamodels:lazy?expand=1

@braingram braingram marked this pull request as ready for review May 14, 2024 16:39
@braingram braingram requested a review from eslavich May 14, 2024 16:39
@braingram
Copy link
Contributor Author

braingram commented Jul 1, 2024

JWST regtests: https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-Developers-Pull-Requests/1571/
passed with 2 unrelated (and common random) failures

romancal regtests: https://github.com/spacetelescope/RegressionTests/actions/runs/9746576095
(I ran the romancal tests with photutils==1.12.0 since 1.13.0 is currently breaking main: spacetelescope/romancal#1291)
ran with no failures

@braingram
Copy link
Contributor Author

@nden @perrygreenfield the regtests all pass with this PR (except for the 2 jwst tests that randomly and frequently fail).

@nden
Copy link
Contributor

nden commented Jul 3, 2024

If I understand how this works from the description, once I request a quantity array, all quantity arrays are loaded into memory. Is this correct?

@nden
Copy link
Contributor

nden commented Jul 3, 2024

No, there's something else, not sure what. The above comment is true for quantity arrays. For numpy arrays, it works as expected. Loading one array does not load any other arrays.

@braingram
Copy link
Contributor Author

Thanks for giving it a try. What file did you use for testing? If it's a roman file things will behave differently if you're using roman_datamodels main vs the "lazy" branch linked above. I think this points to this feature (and PR) needing more documentation.

Here's a non-roman example (please let me know if you give it a try and find anything different from the example) it doesn't require any special versions of anything (except for using asdf from the source branch for this PR).

import asdf
import numpy as np
import astropy.units as u

# make 5 quantiy arrays
qs = [u.Quantity(np.zeros(3+i) + i, u.m) for i in range(5)]

# save them to an ASDF file
af = asdf.AsdfFile()
af["qs"] = qs
af.write_to("test.asdf")

# open the file with a "lazy_tree"
with asdf.open("test.asdf", lazy_tree=True) as af:
    # When opened asdf always reads the first and last block
    # (this is true for lazy and non-lazy trees). Since we
    # are using a 'lazy_tree' only these blocks will be loaded
    # and since these are lazy blocks just the headers will be read.

    print("before accessing quantities")
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

    # Since we're using a 'lazy_tree' the 'qs' 'list' will be
    # a special AsdfListNode object
    print(f"'qs' type = {type(af['qs'])}")

    # Accessing the first quantity will convert the tagged
    # representation to a quantity
    print(f"qs[0] = {af['qs'][0]=}")
    # but no other blocks will be loaded
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

    # Accessing the second quantity will cause a block to load
    print(f"qs[1] = {af['qs'][1]=}")
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

When I run the example I get the following output:

before accessing quantities
Loaded blocks: [True, False, False, False, True]
'qs' type = <class 'asdf.lazy_nodes.AsdfListNode'>
qs[0] = af['qs'][0]=<Quantity [0., 0., 0.] m>
Loaded blocks: [True, False, False, False, True]
qs[1] = af['qs'][1]=<Quantity [1., 1., 1., 1.] m>
Loaded blocks: [True, True, False, False, True]

For the above example the "containers" (list-like AsdfListNode and dict-like AsdfDictNode) objects in the tree are made "lazy" (since lazy_tree=True) and the contained objects only deserialized when they are accessed. For the above example the index 1 item in the "qs" "list" isn't converted to a quantity until it's accessed with qs[1]. At that time asdf turns the tagged representation for index 1 into a quantity (which triggers loading the index 1 block). For the non-accessed items in the "list" (like qs[2]) they're never converted to a quantity in the above example (so for qs[2] the index 2 block is never loaded).

Roman files are a bit different because they use STNode subclasses for containers. If we create a fake "roman" file:

im = roman_datamodels.maker_utils.mk_level2_image()
af = asdf.AsdfFile()
af['roman'] = im
af.write_to("roman.asdf")

If we load it with asdf.open we'll see the following loaded blocks

>> af = asdf.open("roman.asdf", lazy_tree=True)
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'

This is only because we haven't accessed the "roman" key from the lazy AsdfDictNode. Accessing "roman" (with the main branch of roman_datamodels) results in many blocks being loaded (every block that maps to a quantity):

>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'111100001101111'

This is because the converter that deserializes roman_datamodels.stnode.WfiImage doesn't set lazy=True (by default asdf assumes nothing in extensions is lazy). Since the converter for this object isn't lazy asdf will convert everything within the sub-tree that it hands to the converter before calling the converter (so the converter never sees a lazy node, matching the current asdf behavior). Accessing af["roman"] triggers asdf to convert everything within the WfiImage sub-tree.

If instead, we use the modified version of roman_datamodels (which sets lazy=True for the converter handling WfiImage) things are much "lazier".

>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'

Here accessing just the top level "roman" doesn't trigger asdf to load everything within that sub-tree (since the converter has lazy=True) but if we access the data 1 block will be loaded (the exact one may differ depending on the order of the blocks).

>> af["roman"]["data"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000100001'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants