Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH(TST): add a test to verify correct operation on closures/local functions #18

Open
yarikoptic opened this issue Nov 18, 2020 · 12 comments
Assignees

Comments

@yarikoptic
Copy link
Member

as given in example of #17

if that could not work (e.g. we then need to add closure's space into fingerprinted signature, and somehow cannot/should not do that) -- we should raise some exception (might be NotImplementedError("caching on functions with closures is not supported"))

the same probably would go for functions accessing global variables space, but I guess it is nothing we could deduce/control for, right?

@jwodder
Copy link
Member

jwodder commented Nov 18, 2020

The tests already apply memoization to local functions, e.g., here:

def test_memoize_path(cache, tmp_path):
calls = []
@cache.memoize_path
def memoread(path, arg, kwarg=None):
calls.append([path, arg, kwarg])
with open(path) as f:
return f.read()

@yarikoptic
Copy link
Member Author

this is within a single test function run. What if test_memoize_path is actually a function invoked within a test multiple times?

could we detect whenever function uses global or some other closure-ed namespace?

@jwodder
Copy link
Member

jwodder commented Nov 18, 2020

@yarikoptic The following test succeeds (Is this what you had in mind?), though I don't immediately see anything in joblib that would address this:

def test_memoize_path_closure(cache, monkeypatch, tmp_path):

    def mkfunc(x):
        @cache.memoize_path
        def func(path):
            return (x, os.stat(path).st_size)
        return func

    monkeypatch.chdir(tmp_path)
    fname = "foo.txt"
    with open(fname, "wb") as fp:
        fp.write(b"This is test text.\n")
    func1 = mkfunc(23)
    func2 = mkfunc(42)
    assert func1(fname) == (23, 19)
    assert func2(fname) == (42, 19)

@yarikoptic
Copy link
Member Author

can we detect a closure (i.e. whenever func does uses some outside variables)?

@jwodder
Copy link
Member

jwodder commented Nov 19, 2020

@yarikoptic The closest thing I can find is the __closure__ attribute of function objects, which records the function's "free variables." I'm not entirely sure what counts for that and what doesn't, but see the following:

>>> x = 42
>>> def foo():
...     return x
... 
>>> foo.__closure__
>>> 
>>> def bar(x):
...     def baz():
...         return x
...     return baz
... 
>>> bar(42).__closure__
(<cell at 0x104a1b7f0: int object at 0x10494ce50>,)
>>> 

@yarikoptic
Copy link
Member Author

cool. Let's just add NotImplementedError("Detected function with a closure, might be unsafe to cache, do not use fscacher") if any non empty __closure__ is detected until we run into use case where we need to somehow support that.

@jwodder
Copy link
Member

jwodder commented Nov 19, 2020

@yarikoptic The memoread() function that I pointed to above has a non-empty __closure__, so that makes the tests fail.

@yarikoptic
Copy link
Member Author

Tracking for closures is largely a "safety" measure, sure thing people (and us) can find ways around. I guess "workaround" would be to use some global or create a class to collect information on those calls, or use some mocking. Up to you how to "workaround"

@jwodder
Copy link
Member

jwodder commented Nov 20, 2020

@yarikoptic You're suggesting that I rewrite the tests so that they don't use local functions so that the library can fail when it detects a closure? I feel that that's a bad idea, especially since the test I posted above suggests that fscacher already correctly handles closures via some behavior of joblib that I can't find with a quick search. I think the library should continue being closure-agnostic, we add the above test (and maybe a couple others) to the test suite, and if a problem with closures ever occurs that the library doesn't handle properly, then we can revisit this issue.

@yarikoptic
Copy link
Member Author

I feel that that's a bad idea, especially since the test I posted above suggests that fscacher already correctly handles closures via some behavior of joblib that I can't find with a quick search.

my bad, indeed it does somehow! magic... ok, let's continue to be "closure-agnostic" but can that test also be extended to verify that subsequent mkfunc(23); mkfunc(42) would pick up cached values? due to the magic done by joblib on handling closures, I guess surprises might come

@jwodder
Copy link
Member

jwodder commented Nov 20, 2020

@yarikoptic I'm confused. What I was going to post originally was:

So the test I posted above simply shows that, if joblib is caching the closures, then different instances of the "same" closure function get separate caches, which is good, and means that using fscache to memoize closures won't give wrong results (for pure functions, at least). However, the following test:

def test_memoize_path_closure(cache, monkeypatch, tmp_path):
    calls = []

    def mkfunc(x):
        @cache.memoize_path
        def func(path):
            calls.append(x)
            return (x, os.stat(path).st_size)
        return func

    monkeypatch.chdir(tmp_path)
    fname = "foo.txt"
    with open(fname, "wb") as fp:
        fp.write(b"This is test text.\n")
    func1 = mkfunc(23)
    func2 = mkfunc(42)
    assert func1(fname) == (23, 19)
    assert calls == [23]
    assert func2(fname) == (42, 19)
    assert calls == [23, 42]
    assert func1(fname) == (23, 19)
    assert calls == [23, 42]

fails on the last line, where calls is actually [23, 42, 23]. The same thing happens if func appends to a file instead of an outer variable. Even if the func2 lines are commented out, the second call to func1 still appends to calls. Taken together with the fact that caching local functions inside test functions succeeds, my current understanding of joblib's function caching is that it works for local functions as long as you're using them inside the function where they're declared, but the caching apparently doesn't work for local functions/closures returned from a function.

But then I thought "Should this be regarded as a bug in joblib? I should write an MVCE." So I wrote this:

from pathlib import Path
from tempfile import TemporaryDirectory
from joblib import Memory

with TemporaryDirectory() as tmpdir_:
    tmpdir = Path(tmpdir_)
    cache = Memory(str(tmpdir / "foo"), verbose=0)

    def show_record(recordfile):
        try:
            return recordfile.read_text().splitlines()
        except FileNotFoundError:
            return []

    def local_closure(recordfile):
        @cache.cache
        def func(x):
            with recordfile.open("a") as fp:
                print(x, file=fp)
            return recordfile.stat().st_size

        print("record =", show_record(recordfile))
        print("func(23) =", func(23))
        print("record =", show_record(recordfile))
        print("func(42) =", func(42))
        print("record =", show_record(recordfile))
        print("func(23) =", func(23))
        print("record =", show_record(recordfile))

    local_closure(tmpdir / "record01.txt")

    print()

    def return_closure(recordfile):
        @cache.cache
        def func2(x):
            with recordfile.open("a") as fp:
                print(x, file=fp)
            return recordfile.stat().st_size
        return func2

    rec = tmpdir / "record02.txt"
    func = return_closure(rec)
    print("record =", show_record(rec))
    print("func(23) =", func(23))
    print("record =", show_record(rec))
    print("func(42) =", func(42))
    print("record =", show_record(rec))
    print("func(23) =", func(23))
    print("record =", show_record(rec))

    print()

    rec = tmpdir / "record03.txt"
    func = return_closure(rec)
    print("record =", show_record(rec))
    print("func(23) =", func(23))
    print("record =", show_record(rec))
    print("func(42) =", func(42))
    print("record =", show_record(rec))
    print("func(23) =", func(23))
    print("record =", show_record(rec))

and when I ran it, I found that the first two record+func blocks returned properly cached results, while, for the third block, show_record() was always [] and func returned the same values as the previous func, implying that jobclosure treated the two instances of func2 in return_closure as the same function and thus assigned them the same cache, which is at odds with the behavior in the test in my previous comment. (Is fscacher doing something to distinguish the two funcs in mkfunc? I don't see any such thing.) So now I don't know what to make of joblib's local function support.

@yarikoptic
Copy link
Member Author

thanks for digging! I have no immediate ideas, and memory fails me (again) -- I remember vaguely having to deal with something related to doing something to distinguish the two funcs in mkfunc?, will dig a bit later too.

So now I don't know what to make of joblib's local function support.

told you -- it is magic! ;) but we will figure it out and turn it either into boring "technology" or a bug ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants