Make zimcheck report its findings without accumulating them all in memory #311

veloman-yunkan · 2022-07-26T12:51:50Z

zimcheck's traditional (non-JSON) report format enforces an implementation where all findings (errors/warnings) are accumulated in memory so that they can be output grouped by the check type. When support for JSON output format was added the in-memory representation of each finding became more space consuming. As a result on huge ZIM-files with lots of errors zimcheck's memory consumption can be prohibitively large. For example, on a machine with 8GB of RAM the following checks run on stackoverflow.com_en_all_2022-05.zim (which has more than 64M entries) ended like follows:

zimcheck --redirect_loop completed in 8.5 minutes using 1097 MiB of RAM (there were no errors)
zimcheck --url_external completed in ~2 hours using 1219 MiB of RAM and producing a 79MiB output
zimcheck --url_internal --url_external reached 6GiB of RAM usage in ~8 hours and was still far from completion. Forcing (with the help of a debugger) premature output produced a 745MiB report. The run was then interrupted.

I don't see why we shouldn't change the output format (and the implementation) so that errors/warnings can be output as soon as they are discovered.

The text was updated successfully, but these errors were encountered:

veloman-yunkan · 2022-07-27T11:13:53Z

@kelson42 Alternatively, we may limit the number of errors/warnings - if the purpose of zimcheck is to identify bugs in the ZIM creation software, then we don't have to report all issues as most of them will have the same root cause. That workaround will be easier to implement.

kelson42 · 2022-07-27T12:44:28Z

@veloman-yunkan Question: might that be that the problem is worse because we don't have error codes and store therefore a lot of free text, see #239 ?

veloman-yunkan · 2022-07-27T13:16:20Z

@kelson42 Not exactly. In-memory representation of errors is definitely not optimal. However there is a lower bound to how much we can reduce it. A real solution is to either limit the error count or not accumulate the errors in RAM at all.

kelson42 · 2022-07-27T18:40:02Z

@veloman-yunkan I would prefer to just fix the software elegantly that the memory consumption does not go wild. I understand that following current approach (in RAM), there will be always a "lower bound". 6GB is too high... but from the Kiwix point-of-view, we could probably live with worse case scenarios up to 2GB. Therefore, I wonder if just handling the errors smartly in memory would not be good enough? Would that not be worth an attempt? At least doing that can not be a bad move!

veloman-yunkan · 2022-07-28T07:04:54Z

Would that not be worth an attempt? At least doing that can not be a bad move!

@kelson42 That can be a waste of time. What if the best we can get after implementing that optimization is still above 2GB for stackoverflow.com_en_all_2022-05.zim?

mgautierfr · 2022-08-10T08:47:24Z

I agree that we could change the output behavior to print the errors as soon as we have found them and not wait the end to be able to regroup them by category. We can somehow assume that:

checking zim files by tools and using json doesn't need to regroup error, we can output the error directly as a json stream
checking zim files by humain, Regroup the output is nice but not regrouping it is also nice. We can have the error sooner and we can decide to stop the checking (ctrl-c) without having to wait for check completion (which can be indeed pretty long)
If there is a lot of errors, regrouped log is a bit easier to read but not so much. If you have a 745MiB error log, you will never read the full log, regrouped or not
If there is few errors, both regrouped and plain error log should be easily readable.

kelson42 added zimcheck enhancement labels Jul 26, 2022

veloman-yunkan changed the title ~~Make zimcheck report its finding without accumulating them all in memory~~ Make zimcheck report its findings without accumulating them all in memory Jul 28, 2022

veloman-yunkan mentioned this issue Feb 7, 2023

Tribute to marine primates kiwix/libkiwix#887

Merged

mgautierfr closed this as completed in kiwix/libkiwix#887 Feb 8, 2023

kelson42 reopened this Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make zimcheck report its findings without accumulating them all in memory #311

Make zimcheck report its findings without accumulating them all in memory #311

veloman-yunkan commented Jul 26, 2022 •

edited

Loading

veloman-yunkan commented Jul 27, 2022

kelson42 commented Jul 27, 2022

veloman-yunkan commented Jul 27, 2022

kelson42 commented Jul 27, 2022

veloman-yunkan commented Jul 28, 2022

mgautierfr commented Aug 10, 2022

Make zimcheck report its findings without accumulating them all in memory #311

Make zimcheck report its findings without accumulating them all in memory #311

Comments

veloman-yunkan commented Jul 26, 2022 • edited Loading

veloman-yunkan commented Jul 27, 2022

kelson42 commented Jul 27, 2022

veloman-yunkan commented Jul 27, 2022

kelson42 commented Jul 27, 2022

veloman-yunkan commented Jul 28, 2022

mgautierfr commented Aug 10, 2022

veloman-yunkan commented Jul 26, 2022 •

edited

Loading