Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Scan Times from Additional HTTP Requests #538

Open
ja53n opened this issue May 27, 2024 · 2 comments
Open

Long Scan Times from Additional HTTP Requests #538

ja53n opened this issue May 27, 2024 · 2 comments

Comments

@ja53n
Copy link

ja53n commented May 27, 2024

Describe the bug
I noticed that TumblThree app scan times are much higher than expected for blogs with duplicates and decided to look into this.

The TumblThree app seems to be sending a HTTP request to ".media.tumblr.com/" for each duplicate found, creating a large amount of additional HTTP requests. The initial json response "/api/read/json?debug=1&num=..." seems to have a unique file reference ID that could be pulled from "regular-body". Greatly reducing the number of requests needed to complete the scan and reducing the server load. You can replicate this by enabling "force rescan" and using any HTTP logger of your choice. This issue impacts rescan, reblogs, duplicates, etc and I think this would be useful for a lot of users. Sadly I don't have the coding background to fix this myself, which is why I am raising this issue.

To Reproduce
Steps to reproduce the behavior:

  1. Setup HTTP monitoring or debug trace for TumblThree.
  2. Start TumblThree with deduplication setting enabled and rescan an existing site that was already processed.
  3. See the additional ".media.tumblr.com/" requests for files already in the index cache.

Expected behavior
Fast scan times with only the json file if content is duplicates.

Desktop (please complete the following information):

  • TumblThree version: v2.13
  • OS: Windows 10 Home
  • Browser: Chrome
  • Version 125
@thomas694
Copy link
Contributor

Are you downloading normal or 'hidden' blogs? What are your settings? Any other relevant information?

@thomas694
Copy link
Contributor

Well, the missing information was that the already downloaded files were downloaded for another blog and not for the scanned one.
And the affected posts are those with embedded images, so the JSON structure isn't that helpful.

We'll change it to check not only the current blog but also all other blogs for duplicates in this case.

thomas694 added a commit that referenced this issue Jul 3, 2024
- For embedded photos it only checked the current index file for duplicates.
- Now all index files and archives are checked, if enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants