Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next design, directories oriented?/localized, is needed #69

Open
yarikoptic opened this issue Feb 16, 2022 · 2 comments
Open

Next design, directories oriented?/localized, is needed #69

yarikoptic opened this issue Feb 16, 2022 · 2 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Feb 16, 2022

This issue could be considered a duplicate of earlier dandi/dandi-cli#848 (comment) but as that one just started with suggestion of an alternative implementation within dandi-cli, I decided to file a separate one within fscacher which would provide more coverage over the situation and since I still hope that fscacher could be a reusable package to provide needed solution.

Outstanding issues we have which are all bottlenecking IMHO in our initial simplistic fscacher implementation, which

  • I1: uses the same cache for a function (or collection of functions) regardless of their parametrization etc
  • I2: places all caches into centralized user-wide cache directory (e.g., ~/.cache/fscacher/{cachename}): caches related to a dataset which might get removed,
  • I3: uses joblib's memoize so that each invocation gets its own directory + 2 files on disk (thus 3 inodes):
(dandi-devel) jovyan@jupyter-yarikoptic:~/.cache/fscacher/dandi-checksums/joblib/dandi/support/digests$
$ ls get_zarr_checksum/b74a957ae8f7e2e4fc2b07d1aaa73775/
metadata.json  output.pkl
  - in effect allows to avoid need for any locking since fingerprint defines the directory name to use

Such simplistic initial design showed its limitations by

The question is how we could redesign (or just expand since current implementation is good in its simplicity) fscacher to possibly

  • RF I3: provide an alternative (to directory with 2 files)
    • more efficient in case of working with lots of small files
    • less reliant on filesystem and thus more gentle to inodes storage backend
    • might need to have locking mechanisms added to guarantee
  • RF I2 and/or may be I1 provide some "locality":
    • "dataset level": so we could keep cache within any given dandiset, so if that dandiset is removed, cache goes along with it
    • "directory/path" (in some cases only): keep all fingerprints for files within .zarr/ folder within a singular cache "file" -- would save inodes, for mv etc operations would be a matter of copying/changing one such file.
    • in principle, both localities could already be achieved with current fscacher to some degree by establishing separate cache per specific dataset/zarr file. We just would need to get away from using a simple generic @decorator form and instantiate cache per each dandiset/zarr file in a given location. For 'zarr' support though some alternative storage backend would be needed
  • gain more versatile cleaning
    • may be tie a little with the code, i.e. removal of file by the code could trigger cleaning the cache for that path. although not sure if worth adding such explicit ties really.
  • support "early decision" for directories (say that changed - run the target compute while in parallel finishing getting the full directory fingerprint; might need storing "tree" fingerprint at least to some level depth)

edit: adding an explicit section

Features

  • be able to query the status/if changed on a path (directory). Will be of use for DataLad as well. So we might need to abstract those interfaces

WDYT @jwodder -- have any ideas/vision on what next design in fscacher could be to make our life in dandi-cli better?

@jwodder
Copy link
Member

jwodder commented Feb 16, 2022

@yarikoptic I was actually considering an alternative implementation for fscacher recently, one that uses SQLite3 for storing data instead of a distributed file store. I have no idea what effect that would have on performance. Here are my notes so far:


Proposal for a SQLite-Based Reimplementation of fscacher

  • class: PathDatabase — low-level database of data per fingerprinted files, used to implement caching and things that need to bypass direct caching like upload: progress indication for "digesting"? dandi/dandi-cli#400

    • Constructor arguments:
      • dirpath: Union[str, Path] (required)
      • pickle_version
        • pickle_protocol?
        • See "Questions" regarding the default value
      • metadata: Optional[Dict[str, Any]]
        • database-wide parameters
        • Corresponds to PersistentCache's "tokens"
      • max_items: Optional[int]
      • max_item_age: Optional[timedelta]
      • autoclean: bool — whether to call clean() after every set()
    • On construction, if the supplied metadata (or pickle_version?) does not match what the database was saved with, the database is deleted and recreated.
    • On construction, if the database's schema version is not 1, an error is raised.
    • Method: for_path(path: Union[str, Path], parameters: Any = None) -> ContextManager[PathEntry]
      • Fingerprints the given path and returns an object for manipulating the corresponding entry in the database
      • The context manager holds an open cursor to the database and a lock (fasteners or flufl.lock; TBD) on the database file
    • Method: clean(max_items: Optional[int], max_item_age: Optional[timedelta]) -> None
    • Method: rename(oldpath: AnyPath, newpath: AnyPath) -> None
      • Renames for all stored parameter sets
      • Renaming directories containing symlinks could be a problem ...
    • Method: copy(oldpath: AnyPath, newpath: AnyPath) -> None
      • Makes copies for all stored parameter sets
      • Fingerprints the newpath in order to get timestamps
    • Method: delete(path: AnyPath) -> None?
  • class: PathEntry

    • method: get() -> Any
      • raises some custom error if there's no entry for the path in the database
    • method: set(x: Any) -> None

Database Schema

CREATE TABLE versions (
    -- Contains one row, holding the version of the database schema
    -- This document describes schema version 1
    schema_version INT NOT NULL,
    pickle_version INT NOT NULL,
);

CREATE TABLE metadata (
    key TEXT NOT NULL,
    value BLOB,  -- pickled value
);

CREATE TABLE nodes (
    id INT PRIMARY KEY AUTOINCREMENT NOT NULL,
    path BLOB NOT NULL,  -- fsencoded value
    filetype INT NOT NULL,  -- `stat.S_IFMT(file.stat().st_mode)`
        -- Should this just be an "isdir" bool instead?
    inode INT NOT NULL,  -- Ignored/zero for directories
    size INT NOT NULL,  -- Ignored/zero for directories
    mtime_ns INT NOT NULL,  -- Ignored/zero for directories
    ctime_ns INT NOT NULL,  -- Ignored/zero for directories
    parameters BLOB NOT NULL,  -- pickled data
    value BLOB NOT NULL,  -- pickled data
    last_accessed REAL NOT NULL,
    UNIQUE (path, parameters),
);

CREATE TABLE diritems (
    parent INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE,
    child INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE,
    UNIQUE (parent, child),
);

Questions

  • Should the default value of pickle_version be the same as that used by the running version of Python (thereby causing the database to be reset if accessed by two different versions of Python with different default pickle protocol versions), or should it be hardcoded at a certain value?
    • Alternatively, could we get away with only comparing unpickled values, thereby making pickle_version only relevant for writing (and thus not needing to be saved in the DB)?
  • [handling of symlinks]
    • [Ensuring that there are no circular references in diritems]

@yarikoptic
Copy link
Member Author

yarikoptic commented Feb 22, 2022

cool, thank you @jwodder! I didn't analyze in full, some notes on DB aspect which I think is to address the I3:

  • we would need to provide per-DB / process (and/or thread? not sure if sqlite is thread-safe) locking so there is no race conditions etc.
  • pickle_version -- IMHO it is ok to reset (or just include as a token) if a pickle version differs
  • would be nice to get some idea on how fast/slow such sqlite based implementation would be in comparison to current one.
  • looking at https://www.sqlite.org/speed.html might be worth considering a. disabling sync if each operation is on its own; b. batching operations (e.g. while operating on directories) for a faster insert/update on the entire tree at once.
  • edit: symlinks which are relative and renames could indeed be tricky. Would require comparison if resolved path for a symlink either remains absolute the same or is under the being renamed directory in source/target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants