Next design, directories oriented?/localized, is needed #69

yarikoptic · 2022-02-16T21:31:40Z

This issue could be considered a duplicate of earlier dandi/dandi-cli#848 (comment) but as that one just started with suggestion of an alternative implementation within dandi-cli, I decided to file a separate one within fscacher which would provide more coverage over the situation and since I still hope that fscacher could be a reusable package to provide needed solution.

Outstanding issues we have which are all bottlenecking IMHO in our initial simplistic fscacher implementation, which

I1: uses the same cache for a function (or collection of functions) regardless of their parametrization etc
I2: places all caches into centralized user-wide cache directory (e.g., ~/.cache/fscacher/{cachename}): caches related to a dataset which might get removed,
I3: uses joblib's memoize so that each invocation gets its own directory + 2 files on disk (thus 3 inodes):

(dandi-devel) jovyan@jupyter-yarikoptic:~/.cache/fscacher/dandi-checksums/joblib/dandi/support/digests$
$ ls get_zarr_checksum/b74a957ae8f7e2e4fc2b07d1aaa73775/
metadata.json  output.pkl

  - in effect allows to avoid need for any locking since fingerprint defines the directory name to use

Such simplistic initial design showed its limitations by

making difficult to prune old cache selectively (helping interfacing/functionality is proposed in Give Memory.reduce_size() items_limit and age_limit options joblib/joblib#1200 but was never finalized/merged)
hard or impossible to "carry over" the cache while moving files around, e.g. by dandi mv command: more efficient caching dandi/dandi-cli#848 (comment) and some analysis to make it blocked by joblib backend: memoized_path_copy helper to complement @memoize_path #57
can add lots (over 1000%?) of overhead in some cases (e.g., see #914) -- possibly due to needing to create directory with 2 files even for a quick invocation
wasting filesystem inodes which could be of precious resource on HPC/some filesystems

The question is how we could redesign (or just expand since current implementation is good in its simplicity) fscacher to possibly

RF I3: provide an alternative (to directory with 2 files)
- more efficient in case of working with lots of small files
- less reliant on filesystem and thus more gentle to inodes storage backend
- might need to have locking mechanisms added to guarantee
RF I2 and/or may be I1 provide some "locality":
- "dataset level": so we could keep cache within any given dandiset, so if that dandiset is removed, cache goes along with it
- "directory/path" (in some cases only): keep all fingerprints for files within .zarr/ folder within a singular cache "file" -- would save inodes, for mv etc operations would be a matter of copying/changing one such file.
- in principle, both localities could already be achieved with current fscacher to some degree by establishing separate cache per specific dataset/zarr file. We just would need to get away from using a simple generic @decorator form and instantiate cache per each dandiset/zarr file in a given location. For 'zarr' support though some alternative storage backend would be needed
gain more versatile cleaning
- may be tie a little with the code, i.e. removal of file by the code could trigger cleaning the cache for that path. although not sure if worth adding such explicit ties really.
support "early decision" for directories (say that changed - run the target compute while in parallel finishing getting the full directory fingerprint; might need storing "tree" fingerprint at least to some level depth)

edit: adding an explicit section

Features

be able to query the status/if changed on a path (directory). Will be of use for DataLad as well. So we might need to abstract those interfaces

WDYT @jwodder -- have any ideas/vision on what next design in fscacher could be to make our life in dandi-cli better?

The text was updated successfully, but these errors were encountered:

jwodder · 2022-02-16T21:39:40Z

@yarikoptic I was actually considering an alternative implementation for fscacher recently, one that uses SQLite3 for storing data instead of a distributed file store. I have no idea what effect that would have on performance. Here are my notes so far:

Proposal for a SQLite-Based Reimplementation of `fscacher`

class: PathDatabase — low-level database of data per fingerprinted files, used to implement caching and things that need to bypass direct caching like upload: progress indication for "digesting"? dandi/dandi-cli#400
- Constructor arguments:
  - dirpath: Union[str, Path] (required)
  - pickle_version
    - pickle_protocol?
    - See "Questions" regarding the default value
  - metadata: Optional[Dict[str, Any]]
    - database-wide parameters
    - Corresponds to PersistentCache's "tokens"
  - max_items: Optional[int]
  - max_item_age: Optional[timedelta]
  - autoclean: bool — whether to call clean() after every set()
- On construction, if the supplied metadata (or pickle_version?) does not match what the database was saved with, the database is deleted and recreated.
- On construction, if the database's schema version is not 1, an error is raised.
- Method: for_path(path: Union[str, Path], parameters: Any = None) -> ContextManager[PathEntry]
  - Fingerprints the given path and returns an object for manipulating the corresponding entry in the database
  - The context manager holds an open cursor to the database and a lock (fasteners or flufl.lock; TBD) on the database file
- Method: clean(max_items: Optional[int], max_item_age: Optional[timedelta]) -> None
- Method: rename(oldpath: AnyPath, newpath: AnyPath) -> None
  - Renames for all stored parameter sets
  - Renaming directories containing symlinks could be a problem ...
- Method: copy(oldpath: AnyPath, newpath: AnyPath) -> None
  - Makes copies for all stored parameter sets
  - Fingerprints the newpath in order to get timestamps
- Method: delete(path: AnyPath) -> None?
class: PathEntry
- method: get() -> Any
  - raises some custom error if there's no entry for the path in the database
- method: set(x: Any) -> None

Database Schema

CREATE TABLE versions (
    -- Contains one row, holding the version of the database schema
    -- This document describes schema version 1
    schema_version INT NOT NULL,
    pickle_version INT NOT NULL,
);

CREATE TABLE metadata (
    key TEXT NOT NULL,
    value BLOB,  -- pickled value
);

CREATE TABLE nodes (
    id INT PRIMARY KEY AUTOINCREMENT NOT NULL,
    path BLOB NOT NULL,  -- fsencoded value
    filetype INT NOT NULL,  -- `stat.S_IFMT(file.stat().st_mode)`
        -- Should this just be an "isdir" bool instead?
    inode INT NOT NULL,  -- Ignored/zero for directories
    size INT NOT NULL,  -- Ignored/zero for directories
    mtime_ns INT NOT NULL,  -- Ignored/zero for directories
    ctime_ns INT NOT NULL,  -- Ignored/zero for directories
    parameters BLOB NOT NULL,  -- pickled data
    value BLOB NOT NULL,  -- pickled data
    last_accessed REAL NOT NULL,
    UNIQUE (path, parameters),
);

CREATE TABLE diritems (
    parent INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE,
    child INT NOT NULL REFERENCES nodes(id) ON DELETE CASCADE,
    UNIQUE (parent, child),
);

Questions

Should the default value of pickle_version be the same as that used by the running version of Python (thereby causing the database to be reset if accessed by two different versions of Python with different default pickle protocol versions), or should it be hardcoded at a certain value?
- Alternatively, could we get away with only comparing unpickled values, thereby making pickle_version only relevant for writing (and thus not needing to be saved in the DB)?
[handling of symlinks]
- [Ensuring that there are no circular references in diritems]

yarikoptic · 2022-02-22T19:40:20Z

cool, thank you @jwodder! I didn't analyze in full, some notes on DB aspect which I think is to address the I3:

we would need to provide per-DB / process (and/or thread? not sure if sqlite is thread-safe) locking so there is no race conditions etc.
pickle_version -- IMHO it is ok to reset (or just include as a token) if a pickle version differs
would be nice to get some idea on how fast/slow such sqlite based implementation would be in comparison to current one.
looking at https://www.sqlite.org/speed.html might be worth considering a. disabling sync if each operation is on its own; b. batching operations (e.g. while operating on directories) for a faster insert/update on the entire tree at once.
edit: symlinks which are relative and renames could indeed be tricky. Would require comparison if resolved path for a symlink either remains absolute the same or is under the being renamed directory in source/target.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next design, directories oriented?/localized, is needed #69

Next design, directories oriented?/localized, is needed #69

yarikoptic commented Feb 16, 2022 •

edited

Loading

jwodder commented Feb 16, 2022 •

edited

Loading

yarikoptic commented Feb 22, 2022 •

edited

Loading

Next design, directories oriented?/localized, is needed #69

Next design, directories oriented?/localized, is needed #69

Comments

yarikoptic commented Feb 16, 2022 • edited Loading

Features

jwodder commented Feb 16, 2022 • edited Loading

Proposal for a SQLite-Based Reimplementation of fscacher

Database Schema

Questions

yarikoptic commented Feb 22, 2022 • edited Loading

yarikoptic commented Feb 16, 2022 •

edited

Loading

jwodder commented Feb 16, 2022 •

edited

Loading

Proposal for a SQLite-Based Reimplementation of `fscacher`

yarikoptic commented Feb 22, 2022 •

edited

Loading