Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSRS-AnnA Integration #22

Open
brubsby opened this issue Sep 19, 2023 · 19 comments
Open

FSRS-AnnA Integration #22

brubsby opened this issue Sep 19, 2023 · 19 comments

Comments

@brubsby
Copy link

brubsby commented Sep 19, 2023

Just inquiring if there's a reason pandas is locked at this version, my machine is having issues installing this specific version. I figure it might be due to ankipandas requiring it, but didn't see any info about this after a cursory search.

FWIW, I found this addon via a discussion in the FSRS github centered around improving the accuracy of the FSRS algorithm by identifying "conceptual" or (you might call them) "semantic" siblings. So I wanted to try this out, as it seemed a very promising direction of research.

@brubsby
Copy link
Author

brubsby commented Sep 19, 2023

Found 681 cards...

Removed overlap between rated cards and due cards: 6 cards removed. Keeping 675 cards.

Asking Anki for information about 687 cards...
(Large number of cards to retrieve: creating 10 threads of size 140)

Done threads: 100%|██████████| 5/5 [00:04<00:00,  1.10thread/s]
Traceback (most recent call last):
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 2550, in <module>
    anna = AnnA(**args)
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 529, in __init__
    self._init_dataFrame()
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 772, in _init_dataFrame
    self.df = pd.DataFrame().append(list_cardInfo,
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\venv\lib\site-packages\pandas\core\generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?
Got all infos in 4 seconds.


Process finished with exit code 1

I suppose I have found a reason :^)

@brubsby
Copy link
Author

brubsby commented Sep 19, 2023

Seems fixed with this change at AnnA.py:772

        self.df = pd.DataFrame(list_cardInfo)
        self.df["cardId"] = self.df["cardId"].astype('int64')
        self.df = self.df.set_index("cardId").sort_index()
        self.df["interval"] = self.df["interval"].astype(float)
        return True

My card ids were very large for some reason and were overflowing when truncated to pandas 2 default int types.

But now I get:

Computing distance matrix on all available cores...
Scaling each vertical row of the distance matrix...
Scaling: 100%|██████████| 687/687 [00:00<00:00, 6279.42card/s]
Computing mean and std of distance...
(excluding diagonal)
Mean distance: 0.78, std: 0.13

Traceback (most recent call last):
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 2550, in <module>
    anna = AnnA(**args)
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 539, in __init__
    self._compute_distance_matrix()
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 1344, in _compute_distance_matrix
    self._print_similar()
  File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 1355, in _print_similar
    signal.signal(signal.SIGALRM, time_watcher)
AttributeError: module 'signal' has no attribute 'SIGALRM'. Did you mean: 'SIGABRT'?

Process finished with exit code 1

Which is likely due to signal.SIGALRM not being cross platform (I'm on windows). I think there's a good python timeout library I may remember the name of soon.

@brubsby
Copy link
Author

brubsby commented Sep 19, 2023

Found it, I think it's better at cross platform than signal, but I could be wrong, maybe I'll try to give you a pr with it to see if it works on your machine

from func_timeout import func_timeout, FunctionTimedOut

then at line 1344:

        if self.skip_print_similar is False:
            try:
                func_timeout(60, self._print_similar)
            except FunctionTimedOut:
                red("Taking too long to find similar nonequal cards, skipping")
        return True

@brubsby
Copy link
Author

brubsby commented Sep 19, 2023

Seems to work now after fixing some int overflow errors with either the new pandas version, or my specific deck, but I'm not as well versed in the usual functioning of the script, so I'm not 100% confident one of the "tasks" I wasn't testing isn't also failing for a reason. But should be enough for a PR perhaps.

I do get this error a lot, and an eardrum bursting beedoop every time, but it seems to work in spite of that:

Copying anki database to local cache file
Ankipandas will use anki collection found at C:\Users\brubsby\AppData\Roaming\Anki2\brubsby\collection.anki2
NOTIF: Exception : [Errno 22] Invalid argument: './cache/None_All::Linguistics::Acquisition::Sign'
Using fallback method...
Vectorizing using TFIDF:   0%|          | 0/687 [00:00<?, ?it/s]C:\GitProjects\AnnA_Anki_neuronal_Appendix\venv\lib\site-packages\sklearn\feature_extraction\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(
Vectorizing using TFIDF: 100%|██████████| 687/687 [00:00<00:00, 21739.52it/s]

@brubsby
Copy link
Author

brubsby commented Sep 20, 2023

#23
realized I had done the changes on main, and most of the work is on dev, went ahead and made the changes on dev too (as well as a couple others) that got it to work on my machine (and hopefully still continue to work on yours)

I'll continue to mess around with it and see if I can't better understand what exactly this program is doing and how to integrate it with FSRS (if at all possible)! Also probably worth mentioning that the dev version of the addon is incompatible with the main version of the addon, and figuring out how to "deploy" a dev version wasn't completely obvious to me, so maybe it warrants a section in the readme.

@brubsby
Copy link
Author

brubsby commented Sep 20, 2023

Relative overdueness calculation seems incorrect, I have 100 cards scheduled wiith FSRS that are due today, and AnnA thinks 40% of them are dangerously overdue, even though there are only 12 that are overdue at all.

@thiswillbeyourgithub
Copy link
Owner

Seems to work now after fixing some int overflow errors with either the new pandas version, or my specific deck, but I'm not as well versed in the usual functioning of the script, so I'm not 100% confident one of the "tasks" I wasn't testing isn't also failing for a reason. But should be enough for a PR perhaps.

I do get this error a lot, and an eardrum bursting beedoop every time, but it seems to work in spite of that:

Copying anki database to local cache file
Ankipandas will use anki collection found at C:\Users\brubsby\AppData\Roaming\Anki2\brubsby\collection.anki2
NOTIF: Exception : [Errno 22] Invalid argument: './cache/None_All::Linguistics::Acquisition::Sign'
Using fallback method...
Vectorizing using TFIDF:   0%|          | 0/687 [00:00<?, ?it/s]C:\GitProjects\AnnA_Anki_neuronal_Appendix\venv\lib\site-packages\sklearn\feature_extraction\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(
Vectorizing using TFIDF: 100%|██████████| 687/687 [00:00<00:00, 21739.52it/s]

So i clarified a bit the error message. Basixally using the whole deck means applying tfidf using vocabulary from the whole deck regardless of note types etc. For example I have a source field for my medical cards that thanks to OCR are filled with words. This can be useful ponderation for tfidf or not, depending on the user, hence this whole_deck argument.

Another example is sorting your vocabulary cards without example sentences while vectorizing WITH example sentences.

It's fine for it to fail.

@thiswillbeyourgithub
Copy link
Owner

Found it, I think it's better at cross platform than signal, but I could be wrong, maybe I'll try to give you a pr with it to see if it works on your machine

from func_timeout import func_timeout, FunctionTimedOut

then at line 1344:

        if self.skip_print_similar is False:
            try:
                func_timeout(60, self._print_similar)
            except FunctionTimedOut:
                red("Taking too long to find similar nonequal cards, skipping")
        return True

Would you mind saving me some time and sourcing your claim about func_timeout being (at least on paper) cross platform please?

@thiswillbeyourgithub
Copy link
Owner

I'll continue to mess around with it and see if I can't better understand what exactly this program is doing and how to integrate it with FSRS (if at all possible)! Also probably worth mentioning that the dev version of the addon is incompatible with the main version of the addon, and figuring out how to "deploy" a dev version wasn't completely obvious to me, so maybe it warrants a section in the readme.

Well I'm very short on time so the idea is that the dev version is most of the time not 'production ready' and I usually do a merge every couple of months and check the addon then.

I have heard of FSRS but never actually took time to understand what it's all about, would you have a few links handy or an ELI5?

@thiswillbeyourgithub
Copy link
Owner

Relative overdueness calculation seems incorrect, I have 100 cards scheduled wiith FSRS that are due today, and AnnA thinks 40% of them are dangerously overdue, even though there are only 12 that are overdue at all.

Please tell me all the kwargs you're using to get at this result. If you can also print the logs here that might help. IIRC I had a slightly different implementation of the RO but thought it should have the same real life results.

@brubsby
Copy link
Author

brubsby commented Sep 29, 2023

I'll continue to mess around with it and see if I can't better understand what exactly this program is doing and how to integrate it with FSRS (if at all possible)! Also probably worth mentioning that the dev version of the addon is incompatible with the main version of the addon, and figuring out how to "deploy" a dev version wasn't completely obvious to me, so maybe it warrants a section in the readme.

Well I'm very short on time so the idea is that the dev version is most of the time not 'production ready' and I usually do a merge every couple of months and check the addon then.

I have heard of FSRS but never actually took time to understand what it's all about, would you have a few links handy or an ELI5?

Not sure if there's a great ELI5 floating around, but I'll try to give it a shot:

ELI5: FSRS is a scheduling algorithm that learns from your review history to better schedule your reviews. (And is going to be implemented into Anki soon)

In more detail, essentially it's a machine learning based algorithm that optimizes an explainable model of your retention patterns per deck, using all of your historical reviews as time series data, in order to more accurately predict your recall of a card than basically any algorithm in the past has been able to do. And in doing so, minimizes the amount of time you have to spend reviewing for the same retention level. Often reducing the amount of reviews one has to do by around 14%.

There's lots of information about it on the github's wiki:
https://github.com/open-spaced-repetition/fsrs4anki#introduction

However the algorithm doesn't take into account "conceptual siblings", like AnnA attempts to. I strongly believe that good integration of these two ideas could lead to large gains in study efficiency. Though it is certainly a very difficult problem, and depends to a large extent on the particulars of decks and how well AnnA approximates the true, in-brain, conceptual links between cards.

@brubsby
Copy link
Author

brubsby commented Sep 29, 2023

Relative overdueness calculation seems incorrect, I have 100 cards scheduled wiith FSRS that are due today, and AnnA thinks 40% of them are dangerously overdue, even though there are only 12 that are overdue at all.

Please tell me all the kwargs you're using to get at this result. If you can also print the logs here that might help. IIRC I had a slightly different implementation of the RO but thought it should have the same real life results.

I don't recall the kwargs and don't have a copy of the logs, but I suppose the relative overdueness could've been due to the fact that, were they buried, they would then be too relatively overdue, due to being freshly learned cards with short intervals like 1d. If that's the way the calculation works, this was probably what was happening.

@brubsby
Copy link
Author

brubsby commented Sep 29, 2023

Found it, I think it's better at cross platform than signal, but I could be wrong, maybe I'll try to give you a pr with it to see if it works on your machine

from func_timeout import func_timeout, FunctionTimedOut

then at line 1344:

        if self.skip_print_similar is False:
            try:
                func_timeout(60, self._print_similar)
            except FunctionTimedOut:
                red("Taking too long to find similar nonequal cards, skipping")
        return True

Would you mind saving me some time and sourcing your claim about func_timeout being (at least on paper) cross platform please?

https://pypi.org/project/func-timeout/ says:

Support
I’ve tested func_timeout with python 2.7, 3.4, 3.5, 3.6, 3.7. It should work on other versions as well.
Works on windows, linux/unix, cygwin, mac

and I can personally vouch that it works on Windows, while the signal implementation currently employed in AnnA has no possibility of doing so.

as well, it seems to allow a much cleaner implementation in general, though my PR didn't do much more cleanup than was necessary

@brubsby
Copy link
Author

brubsby commented Sep 29, 2023

Well I'm very short on time so the idea is that the dev version is most of the time not 'production ready' and I usually do a merge every couple of months and check the addon then.

This is understandable and probably a good way to go about it, I mostly just was mad at myself for accidentally pulling main when I meant to pull dev when I was making my changes.

@thiswillbeyourgithub
Copy link
Owner

However the algorithm doesn't take into account "conceptual siblings", like AnnA attempts to. I strongly believe that good integration of these two ideas could lead to large gains in study efficiency. Though it is certainly a very difficult problem, and depends to a large extent on the particulars of decks and how well AnnA approximates the true, in-brain, conceptual links between cards.

Those information were very helpful. Thank you very much.

I am now absolutely pumped at the idea of using AnnA to enhance FSRS :). I'm thinking experimenting with adding a new feature as input of the NN that indicates wether a conceptually similar card was reviewed recently might be interesting. It turns out that finding the k nearest neighbors of each card is not that computationnaly intensive and scales okay. The feature could simply be a softmaxed time distance to the most recent sibling. Might be needed to add another feature that indicates the grade of this latest sibling review.

Just thinking out loud of course. But the code of Anna might be the quickest way to extract a distance matrix to experiment with FSRS.

I don't recall the kwargs and don't have a copy of the logs, but I suppose the relative overdueness could've been due to the fact that, were they buried, they would then be too relatively overdue, due to being freshly learned cards with short intervals like 1d. If that's the way the calculation works, this was probably what was happening.

If that happens again we can take a look. I might have made mistakes!

and I can personally vouch that it works on Windows, while the signal implementation currently employed in AnnA has no possibility of doing so.
as well, it seems to allow a much cleaner implementation in general, though my PR didn't do much more cleanup than was necessary

Great, happy to merge the PR once you make it cleaner and on the dev branch. Thanks a lot!

This is understandable and probably a good way to go about it, I mostly just was mad at myself for accidentally pulling main when I meant to pull dev when I was making my changes.

Happens to me all the time :)

@brubsby
Copy link
Author

brubsby commented Oct 2, 2023

I am now absolutely pumped at the idea of using AnnA to enhance FSRS :). I'm thinking experimenting with adding a new feature as input of the NN that indicates wether a conceptually similar card was reviewed recently might be interesting. It turns out that finding the k nearest neighbors of each card is not that computationnaly intensive and scales okay. The feature could simply be a softmaxed time distance to the most recent sibling. Might be needed to add another feature that indicates the grade of this latest sibling review.

Good to hear :) I figured the idea would tickle your brain once you saw the merits of FSRS and the possibility of gains from integrating it with AnnA. I had written up a few thoughts about my brainstorming of the exact ways it could integrated: (open-spaced-repetition/fsrs4anki#352 (comment) and open-spaced-repetition/fsrs4anki#352 (comment)) (if you can overlook the rambling and length).

tl;dr: I think using AnnA distance matrix as a starting point, and then trying to optimize the distance matrix further using time-series review data to further tease out the strength of the conceptual overlap (and then using that to inform retrievability calculations) seems like the ultimate path. This is of course a very difficult problem, and would probably be best tried after doing something simpler as a proof of concept.

Building off of your idea about k nearest neighbors, perhaps a simple idea would be for one of the learned weights being the k number of neighbors to use (or maybe a similarity cutoff), to "learn" the optimal similarity threshold of a deck (below which, conceptual siblings are not thought to help one another). And then maybe some parameters for the magnitude of the effect, and tying the stability/difficulty of conceptual neighbors' reviews together in some learned and parameterized way, while keeping the distance matrix constant. With this way of doing things, one could directly quantify (offline) how well the changes to the algorithm predict the review data, which is a benefit, as it could be more convincing of a result than anecdotes, and result in more attention on the method.

Another user has suggested simply using AnnA to inform the "disperse siblings" functionality of FSRS, which could be a quicker and easier win for practical use, though this change would not directly effect the prediction capability of the algorithm. However the variety of ways one can alter AnnA for specific decks makes this perhaps seem like a difficult UX problem.

@thiswillbeyourgithub
Copy link
Owner

Thanks again.

I posted a comment to the thread. If you need me to push something to expose some parts of the code I'd be happy to. Like if you just want the dist matrix or whatever.

@brubsby brubsby changed the title Reason for pandas == 1.2.3 version lock? FSRS-AnnA Integration Oct 3, 2023
@brubsby
Copy link
Author

brubsby commented Oct 3, 2023

I don't have any plans to work on this at least until the dust settles on FSRS's integration into Anki (which is currently ongoing, with a beta out), but I'll let you know here if I need any help!

@thiswillbeyourgithub
Copy link
Owner

Perfect. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants