Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for Cointaxman #108

Open
scientes opened this issue Feb 26, 2022 · 1 comment
Open

Optimizations for Cointaxman #108

scientes opened this issue Feb 26, 2022 · 1 comment

Comments

@scientes
Copy link
Contributor

scientes commented Feb 26, 2022

In the last few days some Optimizations came to my mind I'd like to share to not forget them and so that they can be discussed. I'd make each of these a separate issue but would discuss them first if they are worth it:

1. Add support for cached pairs to #14

Currently #14 doesn't check the db if parts of the price paris on the route are already downloaded. To make this the case some rework of the pathing and the db structure is required algo is required
Bogus CSV:

date,Token,operation,amount
1.1.2021 0:00,BTC,BUY,10            #Price written to db: btc/USDT:10000 ;asumed Path: BTC->USDT->EUR or BTC->EUR
1.1.2021 0:00,USDT,SELL,100000      #Path: USDT -> EUR
1.1.2021 0:00,BNB,FEE,0.5           # Assumed Path: BNB ->USDT->EUR

This Screnario would have worst case 5 requests (best case 3 with BTC->EUR and BNB->EUR) to the exchange (1-2s per request due to ratelimit)
These paths are for demonstration only and may be different in reality.
Currently with #14 the price written to db from the csv ist BTC/usdt and this price is ignored by #14 which will probably pick BTC->Eur which will be not as accurate as BTC/USDT (excat price known via csv)*USDT/EUR (low volatility due to being stable-coins)

to fix this we need to:

  • prioritze Paths with (highest to least weight wise):
    • where the price comes from the csv (we would need a new columm which marks those prices)
    • where a price pair is already in the db but not from the csv
    • where a price pair has low volatility (fiat2fiat)
      -cache all requested pairs in the db (currently we save BTC/EUR for the path BTC->USDT->EUR and not BTC/EUR,BTC/USDT and USDT/EUR)

all in all this would bring us down to 2 requests (USDT-> EUR and BNB->USDT) and would also increase the Accuracy. In some edge cases we need to split our batches up into smaller chunks which could lead to more requests. But due to having higher accuracy this would be acceptable.

Due to this being also a bit complicated i would like to seperate this from #14 PR wise to not delay #14 longer than necessary

2. Option for using the Orderhistory instead of OHLCV

OHLCV is a little bit inaccurate due to being an average over 60s. An idea would be to use the order history which would be more accurate because we can choose almost the exact date when our transaction ocurred. Due to being not supported by some exchanges the ohlcv method should always be the fallback method. Also this would be considerably slower than the ohlcv method and should be optional.

3. Add support for resuming tax calculations from file

We only want to calculate taxes for the current year but if we already have tax reports generated for the previous years we should have an optional setting which would allow us to read in the old generated tax reports and not start from scratch. This would also be nice for the current year so that we can resume the calculation from where we left off.

4. read/write cached path data from/to db

The two current Bottlenecks for price calculation are the ratelimit and the pathfinding through the graph. We cant't speed up the ratelimit but there are two methods to speed up the graph.
Currently the Graph is generated from scratch for every exection.
There are two bottlenecks here:

  • For each pair in the graph the time when it was active on the exchange is queried via 1w OHLCV (1 request per pair, so about 0.5-2s per pair needed due to ratelimit) candles. This can't really be sped up much because we need to update this information on every startup (,caching this would mean us probably missing if a pair goes inactive on the exchange). This is already cached in memory and done on demand, so this only leads to a warmup period at the beginning of approx 30s-3min. I have no real idea how to speed this up other than caching
  • Each Path calculation takes about 1s. This is under normal cicrumstances not a problem due to the ratelimit being significantly longer but with my first optimization proposal this could become the new ratelimit. Fix: cache all viable paths per pair to the db. the Paths should only change when a new exchange is added to the pool thus caching these would be an option. This should not save the active period to the db because this can and will change.

5. tax calculation Multicore support

Currently for very big Tax reports (2k++ entries) we have a bottleneck at cpu level at the tax calculation due to being singlecore. A relatively easy fix would be the use a ProcessPool and calculate each coins tax returns in a different Process. Due to the overhead (1-3s) of starting the processes this is only viable for larger Tax returns

Feedback is appreciated

@provinzio
Copy link
Owner

ad 1 Great idea! I am sorry, that I am currently unable to give #14 the attention it deserves. It is an awesome improvement and reduces my price fetching by more than 50%, good job! The code is good but complex. I am currently working on my tax declaration and might not be able to fully review and check it for this declaration.

ad 2 More accuracy is always good. Disabling this feature might never be a good option as we should strive for the highest accurracy for the real tax declaration. Keep in mind, that "bad" prices (calculated from a long price span) are saved "forever" in the database. Activating this feature in a future run won't fix/update the "bad" prices.

ad 3 I like the idea and this will definitly become in handy in the next years. Our savefile should remember the read-in account statements (with filepath and MD5-sum) to check, which files are already done for and which are left. This method should throw an error when an uncached account statement has a transaction which is prior to the last transaction in the savefile, because we might have to recalculate everything in this case to fullfill FIFO.

The calculation time for me is currently no problem. The most time is spend fetching prices. Therefore I would give this issue a low priority. But feel free to tackle this.

ad 4 good point

ad 5 parallelization might not be viable for all kinds of operation types. e.g. two sell operations can't be easily processed in parallel, as we have to detemine the sold coins of the first sell operations before we can evaluate the second sell operation. And we can not evaluate a sell operation if the buy/airdrop/commission operation in front of the sell operation isn't processed.

Actually, I do not see no good way to parallize the evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants