Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make zimcheck report its findings without accumulating them all in memory #311

Open
veloman-yunkan opened this issue Jul 26, 2022 · 6 comments · Fixed by kiwix/libkiwix#887

Comments

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Jul 26, 2022

zimcheck's traditional (non-JSON) report format enforces an implementation where all findings (errors/warnings) are accumulated in memory so that they can be output grouped by the check type. When support for JSON output format was added the in-memory representation of each finding became more space consuming. As a result on huge ZIM-files with lots of errors zimcheck's memory consumption can be prohibitively large. For example, on a machine with 8GB of RAM the following checks run on stackoverflow.com_en_all_2022-05.zim (which has more than 64M entries) ended like follows:

zimcheck --redirect_loop completed in 8.5 minutes using 1097 MiB of RAM (there were no errors)
zimcheck --url_external completed in ~2 hours using 1219 MiB of RAM and producing a 79MiB output
zimcheck --url_internal --url_external reached 6GiB of RAM usage in ~8 hours and was still far from completion. Forcing (with the help of a debugger) premature output produced a 745MiB report. The run was then interrupted.

I don't see why we shouldn't change the output format (and the implementation) so that errors/warnings can be output as soon as they are discovered.

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 Alternatively, we may limit the number of errors/warnings - if the purpose of zimcheck is to identify bugs in the ZIM creation software, then we don't have to report all issues as most of them will have the same root cause. That workaround will be easier to implement.

@kelson42
Copy link
Contributor

@veloman-yunkan Question: might that be that the problem is worse because we don't have error codes and store therefore a lot of free text, see #239 ?

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 Not exactly. In-memory representation of errors is definitely not optimal. However there is a lower bound to how much we can reduce it. A real solution is to either limit the error count or not accumulate the errors in RAM at all.

@kelson42
Copy link
Contributor

@veloman-yunkan I would prefer to just fix the software elegantly that the memory consumption does not go wild. I understand that following current approach (in RAM), there will be always a "lower bound". 6GB is too high... but from the Kiwix point-of-view, we could probably live with worse case scenarios up to 2GB. Therefore, I wonder if just handling the errors smartly in memory would not be good enough? Would that not be worth an attempt? At least doing that can not be a bad move!

@veloman-yunkan veloman-yunkan changed the title Make zimcheck report its finding without accumulating them all in memory Make zimcheck report its findings without accumulating them all in memory Jul 28, 2022
@veloman-yunkan
Copy link
Collaborator Author

Would that not be worth an attempt? At least doing that can not be a bad move!

@kelson42 That can be a waste of time. What if the best we can get after implementing that optimization is still above 2GB for stackoverflow.com_en_all_2022-05.zim?

@mgautierfr
Copy link
Collaborator

I agree that we could change the output behavior to print the errors as soon as we have found them and not wait the end to be able to regroup them by category. We can somehow assume that:

  • checking zim files by tools and using json doesn't need to regroup error, we can output the error directly as a json stream
  • checking zim files by humain, Regroup the output is nice but not regrouping it is also nice. We can have the error sooner and we can decide to stop the checking (ctrl-c) without having to wait for check completion (which can be indeed pretty long)
  • If there is a lot of errors, regrouped log is a bit easier to read but not so much. If you have a 745MiB error log, you will never read the full log, regrouped or not
  • If there is few errors, both regrouped and plain error log should be easily readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants