Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep and upload database of finished jobs #465

Open
JustAnotherArchivist opened this issue Sep 20, 2020 · 8 comments
Open

Keep and upload database of finished jobs #465

JustAnotherArchivist opened this issue Sep 20, 2020 · 8 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

When a job crashes or is aborted, its DB and therefore its queue is simply deleted while the other things (WARC so far, JSON, and since #396 the log file) are kept. I think we should also retain the database. It may even be worth considering keeping it for all jobs.

Besides preserving the remaining queue for crashed and aborted jobs, it also allows for easier access to the crawl information. For example, it's much easier to extract all URLs that failed three times or that resulted in a particular status code from the DB than painful processing of the log file. It could also allow for running 'update crawls' (outside of ArchiveBot) at a later time by reusing the DB of a job to skip (some) URLs that were already retrieved without having to construct such a DB from the log file.

The obvious downside is the data/storage size. However, in the grand scheme of things, this doesn't make a big difference. As a point of reference, job 6recrrotn072khaaje73k60kh – one of the largest jobs currently running at 65 million URLs – has a DB file of 15.8 GiB. This is pretty much insignificant compared to the job's data size of 4.8 TiB, especially as compression decreases the size further by a factor 4-5 (zstd without tuning: 3.57 GiB or 22.6 %). So this is an increase in data per job on the order of 1 ‰ (except in the rare extreme cases where the vast majority of URLs is ignored).

@Arkiver2
Copy link
Member

This is a great idea, I think size is not a problem here.

I'm not sure what is exactly stored in the DB, any sensitive information? If not, we should definitely preserve the DB (also on finished jobs).

@JustAnotherArchivist
Copy link
Contributor Author

JustAnotherArchivist commented Sep 21, 2020

Nothing sensitive at all. It contains only information on the crawl itself that could in theory be regenerated using the source code and the WARCs (but you might turn suicidal trying to do that): URLs, their relations (parent, root), recursion info (level, inline level), crawl info (status, try count, priority [currently unused]), and some info on the content ('link type', status code). POST data and local filenames would also end up there but are not used by AB. Sometime in the future, cookies will also be there, but again, nothing that couldn't be reconstructed from the WARCs anyway (and the IRC commands if we add manual cookie control).

@Arkiver2
Copy link
Member

It's pretty difficult (and would have to process a ton of data) to reconstruct this. Size is not a problem (relative to total WARC size). Since there's nothing sensitive in the DB, let's do it.

We could gzip it and upload together with the JSON and WARCs.

@JustAnotherArchivist
Copy link
Contributor Author

Yep. It's possible in theory but completely unfeasible in practice.

I'll play around with gzip vs zstd a bit. It'll be a .db.gz or .db.zst file with the same filename structure as everything else.

@JustAnotherArchivist
Copy link
Contributor Author

I ran a few tests on large-ish databases on a busy pipeline in a terminal:

Job Original size gzip -6 size ... time gzip -9 size ... time zstd size ... time
1m71j820n4qka3ob7w6dlja3y 23.7 GiB 3.93 GiB 17 mn 3.58 GiB 2.5 mn
9hdfwijhzx86os1k3tm1wgq3i 1034 MiB 241 MiB 33 s 239 MiB 50 s 221 MiB 4.3 s
2g2xqrj2na5od7mk5mql0q3bn 326 MiB 67.3 MiB 10 s 66.7 MiB 23 s 64.7 MiB 2.3 s
73pjjo1i8uyububkhbpaf6ndr 5.00 GiB 0.915 GiB 2 mn 20 s 0.890 GiB 30 s

The implications are pretty obvious.

I'll probably switch the log compression (on crashed/aborted jobs) to zstd as well. Although zstd actually produces a larger file than even gzip -6 with the default settings in a test, it only takes a slight increase of the compression level to fix that. zstd -10 takes about the same time as gzip -9 on my partial test log from job 9hdfwijhzx86os1k3tm1wgq3i (1010 MiB) at 33 s but produces a file of 85 MiB compared to gzip's 99 MiB. I'll do some more testing to find the sweet spot there.

@JustAnotherArchivist JustAnotherArchivist changed the title Keep and upload database on crashes and aborts Keep and upload database of finished jobs Sep 21, 2020
@JustAnotherArchivist
Copy link
Contributor Author

I looked into this a bit again. I took the DB from 5nbpflkse0rs1tlgch8n4efud (2.94 GB, 13 million URLs, runtime before crashing about a week) and the partial log file from 3pwf0useacbmua9uwp4idpale (3.64 GB, 12 million URLs, runtime about a month so far) and compressed them at most levels of zstd and gzip. I ran this on a fairly busy AB pipeline (jap-kakapo), so it should be representative of what the runtime might look like in reality. The jobs are obviously among the larger ones running through AB. My analysis consisted of staring at shitty graphs of user time vs compression ratio in LibreOffice Calc.

Test results

Database

zstd

Compression level Original size Compressed size Compression ratio Real time User time Sys time
1 2944745472 721835831 24.51% 13.616 11.565 1.751
2 2944745472 700663107 23.79% 13.192 13.300 1.097
3 2944745472 682199494 23.17% 16.743 16.700 1.231
4 2944745472 677610833 23.01% 21.281 21.426 1.187
5 2944745472 661601839 22.47% 50.899 50.691 1.414
6 2944745472 657653273 22.33% 56.692 56.470 1.247
7 2944745472 630368182 21.41% 68.079 67.727 1.542
8 2944745472 625318048 21.24% 79.158 79.157 1.252
9 2944745472 622723235 21.15% 93.913 93.947 1.114
10 2944745472 613131472 20.82% 117.855 117.652 1.381
11 2944745472 610937389 20.75% 131.157 130.767 1.344
12 2944745472 609634199 20.70% 176.475 176.017 1.516
13 2944745472 609777705 20.71% 201.050 196.477 2.412
14 2944745472 607311093 20.62% 218.251 215.267 2.175
15 2944745472 605756166 20.57% 265.313 262.540 2.719
16 2944745472 588934765 20.00% 572.187 561.204 3.635
17 2944745472 562606051 19.11% 697.623 690.251 4.968
18 2944745472 538896215 18.30% 1085.334 1077.788 5.231
19 2944745472 530637003 18.02% 1519.945 1512.603 4.898

gzip

Compression level Original size Compressed size Compression ratio Real time User time Sys time
1 2944745472 806176600 27.38% 47.730 42.625 1.891
2 2944745472 800534883 27.19% 47.969 44.378 1.551
3 2944745472 770088717 26.15% 56.418 54.249 1.833
4 2944745472 736143418 25.00% 65.347 62.334 1.711
5 2944745472 723571018 24.57% 71.107 68.941 1.759
6 2944745472 717027291 24.35% 89.407 87.594 1.560
7 2944745472 713746787 24.24% 103.271 100.502 1.680
8 2944745472 711333243 24.16% 126.486 124.023 1.536
9 2944745472 711214985 24.15% 138.508 134.626 1.927

Log

zstd

(Only ran it up to level 15 because it was getting ridiculous...)

Compression level Original size Compressed size Compression ratio Real time User time Sys time
1 3641670876 440404842 12.09% 11.606 11.189 1.098
2 3641670876 435763309 11.97% 12.000 11.859 1.232
3 3641670876 432647510 11.88% 15.586 15.240 1.290
4 3641670876 433149771 11.89% 18.272 17.941 1.072
5 3641670876 402242867 11.05% 39.903 39.730 1.240
6 3641670876 395880291 10.87% 43.403 43.543 1.198
7 3641670876 379345921 10.42% 58.751 58.411 1.505
8 3641670876 369124857 10.14% 72.449 71.646 1.644
9 3641670876 367090066 10.08% 87.926 87.384 1.644
10 3641670876 365891660 10.05% 103.167 103.085 1.317
11 3641670876 365068174 10.02% 124.915 124.942 1.265
12 3641670876 363906198 9.99% 164.777 163.296 1.347
13 3641670876 359998040 9.89% 228.925 228.116 2.024
14 3641670876 358985335 9.86% 267.016 265.797 2.248
15 3641670876 358212227 9.84% 334.854 333.494 2.032

gzip

Compression level Original size Compressed size Compression ratio Real time User time Sys time
1 3641670876 506536391 13.91% 36.188 33.271 1.340
2 3641670876 493880878 13.56% 32.974 31.712 1.168
3 3641670876 483203714 13.27% 36.171 33.592 1.383
4 3641670876 452770760 12.43% 48.991 45.913 1.296
5 3641670876 436844810 12.00% 47.157 45.902 1.175
6 3641670876 418611901 11.50% 63.652 60.297 1.332
7 3641670876 416090448 11.43% 70.472 68.945 1.268
8 3641670876 400631037 11.00% 88.818 87.520 1.128
9 3641670876 400425421 11.00% 114.291 112.666 1.244
(Raw terminal output in case I screwed up the tabulation somewhere)
> for lvl in {1..22}; do echo $lvl; time zstd -$lvl patriots.win-inf-20210123-012541-5nbpf-wpull.db -o patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst$lvl; echo; echo; done
1
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 24.51%   (2944745472 => 721835831 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1) 

real	0m13.616s
user	0m11.565s
sys	0m1.751s


2
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.79%   (2944745472 => 700663107 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2) 

real	0m13.192s
user	0m13.300s
sys	0m1.097s


3
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.17%   (2944745472 => 682199494 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3) 

real	0m16.743s
user	0m16.700s
sys	0m1.231s


4
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.01%   (2944745472 => 677610833 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4) 

real	0m21.281s
user	0m21.426s
sys	0m1.187s


5
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.47%   (2944745472 => 661601839 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5) 

real	0m50.899s
user	0m50.691s
sys	0m1.414s


6
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.33%   (2944745472 => 657653273 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6) 

real	0m56.692s
user	0m56.470s
sys	0m1.247s


7
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.41%   (2944745472 => 630368182 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7) 

real	1m8.079s
user	1m7.727s
sys	0m1.542s


8
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.24%   (2944745472 => 625318048 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8) 

real	1m19.158s
user	1m19.157s
sys	0m1.252s


9
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.15%   (2944745472 => 622723235 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9) 

real	1m33.913s
user	1m33.947s
sys	0m1.144s


10
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.82%   (2944745472 => 613131472 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10) 

real	1m57.855s
user	1m57.652s
sys	0m1.381s


11
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.75%   (2944745472 => 610937389 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11) 

real	2m11.157s
user	2m10.767s
sys	0m1.344s


12
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.70%   (2944745472 => 609634199 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12) 

real	2m56.475s
user	2m56.017s
sys	0m1.516s


13
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.71%   (2944745472 => 609777705 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13) 

real	3m21.050s
user	3m16.477s
sys	0m2.412s


14
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.62%   (2944745472 => 607311093 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14) 

real	3m38.251s
user	3m35.267s
sys	0m2.175s


15
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.57%   (2944745472 => 605756166 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15) 

real	4m25.313s
user	4m22.540s
sys	0m2.719s


16
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.00%   (2944745472 => 588934765 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16) 

real	9m32.187s
user	9m21.204s
sys	0m3.635s


17
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 19.11%   (2944745472 => 562606051 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17) 

real	11m37.623s
user	11m30.251s
sys	0m4.968s


18
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.30%   (2944745472 => 538896215 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18) 

real	18m5.334s
user	17m57.788s
sys	0m5.231s


19
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.02%   (2944745472 => 530637003 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19) 

real	25m19.945s
user	25m12.603s
sys	0m4.898s

> for lvl in {1..9}; do echo $lvl; time gzip -$lvl <patriots.win-inf-20210123-012541-5nbpf-wpull.db >patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz$lvl; echo; echo; done
1

real	0m47.730s
user	0m42.625s
sys	0m1.891s


2

real	0m47.969s
user	0m44.378s
sys	0m1.551s


3

real	0m56.418s
user	0m54.249s
sys	0m1.833s


4

real	1m5.347s
user	1m2.334s
sys	0m1.711s


5

real	1m11.107s
user	1m8.941s
sys	0m1.759s


6

real	1m29.407s
user	1m27.594s
sys	0m1.560s


7

real	1m43.271s
user	1m40.502s
sys	0m1.680s


8

real	2m6.486s
user	2m4.023s
sys	0m1.536s


9

real	2m18.508s
user	2m14.626s
sys	0m1.927s

> for lvl in {1..19}; do echo $lvl; time zstd -$lvl 3pwf0useacbmua9uwp4idpale.log -o 3pwf0useacbmua9uwp4idpale.log.zst$lvl; echo; echo; done
1
3pwf0useacbmua9uwp4idpale.log : 12.09%   (3641670876 => 440404842 bytes, 3pwf0useacbmua9uwp4idpale.log.zst1) 

real	0m11.606s
user	0m11.189s
sys	0m1.098s


2
3pwf0useacbmua9uwp4idpale.log : 11.97%   (3641670876 => 435763309 bytes, 3pwf0useacbmua9uwp4idpale.log.zst2) 

real	0m12.000s
user	0m11.859s
sys	0m1.232s


3
3pwf0useacbmua9uwp4idpale.log : 11.88%   (3641670876 => 432647510 bytes, 3pwf0useacbmua9uwp4idpale.log.zst3) 

real	0m15.586s
user	0m15.240s
sys	0m1.290s


4
3pwf0useacbmua9uwp4idpale.log : 11.89%   (3641670876 => 433149771 bytes, 3pwf0useacbmua9uwp4idpale.log.zst4) 

real	0m18.272s
user	0m17.941s
sys	0m1.072s


5
3pwf0useacbmua9uwp4idpale.log : 11.05%   (3641670876 => 402242867 bytes, 3pwf0useacbmua9uwp4idpale.log.zst5) 

real	0m39.903s
user	0m39.730s
sys	0m1.240s


6
3pwf0useacbmua9uwp4idpale.log : 10.87%   (3641670876 => 395880291 bytes, 3pwf0useacbmua9uwp4idpale.log.zst6) 

real	0m43.403s
user	0m43.543s
sys	0m1.198s


7
3pwf0useacbmua9uwp4idpale.log : 10.42%   (3641670876 => 379345921 bytes, 3pwf0useacbmua9uwp4idpale.log.zst7) 

real	0m58.751s
user	0m58.411s
sys	0m1.505s


8
3pwf0useacbmua9uwp4idpale.log : 10.14%   (3641670876 => 369124857 bytes, 3pwf0useacbmua9uwp4idpale.log.zst8) 

real	1m12.449s
user	1m11.646s
sys	0m1.644s


9
3pwf0useacbmua9uwp4idpale.log : 10.08%   (3641670876 => 367090066 bytes, 3pwf0useacbmua9uwp4idpale.log.zst9) 

real	1m27.926s
user	1m27.384s
sys	0m1.644s


10
3pwf0useacbmua9uwp4idpale.log : 10.05%   (3641670876 => 365891660 bytes, 3pwf0useacbmua9uwp4idpale.log.zst10) 

real	1m43.167s
user	1m43.085s
sys	0m1.317s


11
3pwf0useacbmua9uwp4idpale.log : 10.02%   (3641670876 => 365068174 bytes, 3pwf0useacbmua9uwp4idpale.log.zst11) 

real	2m4.915s
user	2m4.942s
sys	0m1.265s


12
3pwf0useacbmua9uwp4idpale.log :  9.99%   (3641670876 => 363906198 bytes, 3pwf0useacbmua9uwp4idpale.log.zst12) 

real	2m44.777s
user	2m43.296s
sys	0m1.347s


13
3pwf0useacbmua9uwp4idpale.log :  9.89%   (3641670876 => 359998040 bytes, 3pwf0useacbmua9uwp4idpale.log.zst13) 

real	3m48.925s
user	3m48.116s
sys	0m2.024s


14
3pwf0useacbmua9uwp4idpale.log :  9.86%   (3641670876 => 358985335 bytes, 3pwf0useacbmua9uwp4idpale.log.zst14) 

real	4m27.016s
user	4m25.797s
sys	0m2.248s


15
3pwf0useacbmua9uwp4idpale.log :  9.84%   (3641670876 => 358212227 bytes, 3pwf0useacbmua9uwp4idpale.log.zst15) 

real	5m34.854s
user	5m33.494s
sys	0m2.032s

> for lvl in {1..9}; do echo $lvl; time gzip -$lvl <3pwf0useacbmua9uwp4idpale.log >3pwf0useacbmua9uwp4idpale.log.gz$lvl; echo; echo; done
1

real	0m36.188s
user	0m33.271s
sys	0m1.340s


2

real	0m32.974s
user	0m31.712s
sys	0m1.168s


3

real	0m36.171s
user	0m33.592s
sys	0m1.383s


4

real	0m48.991s
user	0m45.913s
sys	0m1.296s


5

real	0m47.157s
user	0m45.902s
sys	0m1.175s


6

real	1m3.652s
user	1m0.297s
sys	0m1.332s


7

real	1m10.472s
user	1m8.945s
sys	0m1.268s


8

real	1m28.818s
user	1m27.520s
sys	0m1.128s


9

real	1m54.291s
user	1m52.666s
sys	0m1.244s

> ll
total 34151264
drwxr-xr-x  2 archivebot archivebot       4096 Feb 21 03:42 .
drwxr-xr-x 20 archivebot archivebot       4096 Feb 21 02:55 ..
-rw-r--r--  1 archivebot archivebot 3641670876 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log
-rw-r--r--  1 archivebot archivebot  506536391 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz1
-rw-r--r--  1 archivebot archivebot  493880878 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz2
-rw-r--r--  1 archivebot archivebot  483203714 Feb 21 03:34 3pwf0useacbmua9uwp4idpale.log.gz3
-rw-r--r--  1 archivebot archivebot  452770760 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz4
-rw-r--r--  1 archivebot archivebot  436844810 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz5
-rw-r--r--  1 archivebot archivebot  418611901 Feb 21 03:36 3pwf0useacbmua9uwp4idpale.log.gz6
-rw-r--r--  1 archivebot archivebot  416090448 Feb 21 03:38 3pwf0useacbmua9uwp4idpale.log.gz7
-rw-r--r--  1 archivebot archivebot  400631037 Feb 21 03:39 3pwf0useacbmua9uwp4idpale.log.gz8
-rw-r--r--  1 archivebot archivebot  400425421 Feb 21 03:41 3pwf0useacbmua9uwp4idpale.log.gz9
-rw-r--r--  1 archivebot archivebot  440404842 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst1
-rw-r--r--  1 archivebot archivebot  365891660 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst10
-rw-r--r--  1 archivebot archivebot  365068174 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst11
-rw-r--r--  1 archivebot archivebot  363906198 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst12
-rw-r--r--  1 archivebot archivebot  359998040 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst13
-rw-r--r--  1 archivebot archivebot  358985335 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst14
-rw-r--r--  1 archivebot archivebot  358212227 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst15
-rw-r--r--  1 archivebot archivebot  435763309 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst2
-rw-r--r--  1 archivebot archivebot  432647510 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst3
-rw-r--r--  1 archivebot archivebot  433149771 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst4
-rw-r--r--  1 archivebot archivebot  402242867 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst5
-rw-r--r--  1 archivebot archivebot  395880291 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst6
-rw-r--r--  1 archivebot archivebot  379345921 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst7
-rw-r--r--  1 archivebot archivebot  369124857 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst8
-rw-r--r--  1 archivebot archivebot  367090066 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst9
-rw-r--r--  2 archivebot archivebot 2944745472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db
-rw-r--r--  1 archivebot archivebot  806176600 Feb 21 02:17 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz1
-rw-r--r--  1 archivebot archivebot  800534883 Feb 21 02:18 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz2
-rw-r--r--  1 archivebot archivebot  770088717 Feb 21 02:19 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz3
-rw-r--r--  1 archivebot archivebot  736143418 Feb 21 02:20 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz4
-rw-r--r--  1 archivebot archivebot  723571018 Feb 21 02:21 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz5
-rw-r--r--  1 archivebot archivebot  717027291 Feb 21 02:22 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz6
-rw-r--r--  1 archivebot archivebot  713746787 Feb 21 02:24 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz7
-rw-r--r--  1 archivebot archivebot  711333243 Feb 21 02:26 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz8
-rw-r--r--  1 archivebot archivebot  711214985 Feb 21 02:28 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz9
-rw-r--r--  1 archivebot archivebot  721835831 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1
-rw-r--r--  1 archivebot archivebot  613131472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10
-rw-r--r--  1 archivebot archivebot  610937389 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11
-rw-r--r--  1 archivebot archivebot  609634199 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12
-rw-r--r--  1 archivebot archivebot  609777705 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13
-rw-r--r--  1 archivebot archivebot  607311093 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14
-rw-r--r--  1 archivebot archivebot  605756166 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15
-rw-r--r--  1 archivebot archivebot  588934765 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16
-rw-r--r--  1 archivebot archivebot  562606051 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17
-rw-r--r--  1 archivebot archivebot  538896215 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18
-rw-r--r--  1 archivebot archivebot  530637003 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19
-rw-r--r--  1 archivebot archivebot  700663107 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2
-rw-r--r--  1 archivebot archivebot  682199494 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3
-rw-r--r--  1 archivebot archivebot  677610833 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4
-rw-r--r--  1 archivebot archivebot  661601839 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5
-rw-r--r--  1 archivebot archivebot  657653273 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6
-rw-r--r--  1 archivebot archivebot  630368182 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7
-rw-r--r--  1 archivebot archivebot  625318048 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8
-rw-r--r--  1 archivebot archivebot  622723235 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9

My conclusion: the sweet spot with zstd seems to be 10 for databases and 8 for logs. Up to that, there is an acceptable increase in runtime with significant space savings. Beyond that, the large increase in compression time outweighs the relatively small size reduction. Unless someone yells at me, that's what I'll implement soon™.

Fun side note: even zstd -2 compresses better than gzip -9 – and at a 10 times shorter runtime!

@JustAnotherArchivist
Copy link
Contributor Author

A complication is SQLite's Write-Ahead Log (which records changes to the DB that aren't merged into the main database file yet). When the DB gets closed, it gets merged, and only wpull.db remains (but is this guaranteed behaviour?). This is what happens on aborting, for example. But when wpull crashes, wpull.db-wal and wpull.db-shm remain. Merging explicitly is possible using sqlite3 wpull.db 'PRAGMA wal_checkpoint' (docs, possibly an argument would be better), but I'm not sure whether that always works. Perhaps there'd need to be a fallback to preserve all three files in case the wal_checkpoint fails to merge them together.

@pabs3
Copy link

pabs3 commented May 10, 2024

Might be worth dumping the SQLite databases to SQL and compressing that instead of compressing the raw binary database files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants