feat(datasource/julia-pkg-server): add Julia `PkgServer` #29623

bauglir · 2024-06-12T12:10:16Z

Changes

PkgServers are the primary package distribution system for Julia (the alternative being operating on the underlying Git repositories directly, which I might contribute later but seemed like a more daunting task given the number of possible hosting platforms). This datasource enables retrieval of packages from registries through these PkgServers. It supports multiple registries across PkgServers (a single server can host multiple registries and the same registry can be hosted across different PkgServers).

Julia-specific versioning is not yet implemented, I will implement that as a follow-up.

Context

Julia support #6508.

Documentation

I have updated the documentation, or
No documentation update is required

How I've tested my work

I have verified these changes via:

Code inspection only, or
Newly added/modified unit tests, or
No unit tests but ran on a real repository, or
Both unit tests + ran on a real repository

`PkgServer`s are the primary package distribution system for Julia (the alternative being operating on the underlying Git repositories directly). This datasource enables retrieval of packages from registries through these `PkgServer`s. It supports multiple registries across `PkgServer`s. Julia-specific versioning is not yet implemented, this will be implemented as a follow-up. References renovatebot#6508.

This behavior is already tested at the `retrieveRegistryState` level. It was left over from when this was implemented directly at the datasource level.

lib/modules/datasource/julia-pkg-server/__fixtures__/ExtraHTTP.tar

bauglir · 2024-06-12T12:20:08Z

lib/modules/datasource/julia-pkg-server/registry.ts

+    // TODO: The tests for this functionality only require a single "chunk" to
+    // be read from the tarball to succeed. Hence, it would be sufficient to
+    // just process a single chunk directly as a string and return it. This
+    // does not hold in general and a (minimal) test requiring multiple chunks
+    // should be implemented


I have a tarball where this fails (which was obtained from the official PkgServer), but that tarball is over 7MB large and I did not want to include that as a fixture. I have tried to create a more minimal example where the simpler approach fails, but have so far been unable to do so.

How often do we need to fetch tarballs, if they can be large? And for how long can we cache them?

That is a very good question and one I'm not entirely sure about myself.

The tarballs depend on the state of the registry, so they change whenever a new package is published to a registry. This is done by issuing a PR to the repository hosting the registry. So a suitable cache duration highly depends on the expected rate of package releases, when the corresponding PRs are merged, and how soon new releases should be available to Renovate. Given that these variables will very much depend on the registry from which dependencies are pulled, I'm not entirely sure there will be a uniformly suitable answer.

That being said, the General registry (which is where all public, open-source packages tend to be published) sees a couple of releases per hour and "automatically merges" suitable PRs every 8 minutes. So at a minimum a cache duration of 8 minutes can be used, although in practice it is probably fine to cache these up to an hour, maybe even a bit longer.

If it's a single tarball per registry and not per package then I think something like a 15-30 minute cache time is fine. If it was common for the tarball to be unchanged for longer periods of time then we'd want to retain and use etag or last-modified headers for subsequent requests, but if registries typically get updated frequently then there's little gain from that

Yeah, it's a single tarball per registry. I believe the default ttlMinutes is 30 minutes, so I'll leave this at that.

PkgServers support multiple "flavors" where "conservative" is the alternative currently available. That tends to lag by a couple of hours and used to be the default mode. But that used to cause quite some complaints from the community related to CI specifically, see JuliaPackaging/PkgServer.jl#144 for more context, so I don't think we'd want to support that. Once we do I think we could update the caching strategy accordingly.

package.json

bauglir · 2024-06-12T12:26:12Z

lib/modules/datasource/julia-pkg-server/registry.ts

+    const tarballParser = new tarParse({
+      filter: (path) => pathsToExtract.includes(path),
+    });
+
+    tarballParser.on('end', () => {
+      if (Object.values(fileContentChunks).every(({ length }) => length > 0)) {
+        const fileContents: FileContents<string> = {};
+        for (const path of pathsToExtract) {
+          fileContents[path] = Buffer.concat(fileContentChunks[path]).toString(
+            'utf-8',
+          );
+        }
+
+        resolve(fileContents);
+      } else {
+        resolve(null);
+      }
+    });
+
+    tarballParser.on('entry', (entry) => {
+      entry.on('data', (data) => {
+        fileContentChunks[entry.path].push(data);
+      });
+
+      entry.on('error', reject);
+    });
+
+    tarballParser.end(tarball);


This extracts the necessary files in memory. It might be more straightforward to unpack do disk instead and work with some form of caching. I'd be happy to change it if that's better suited to Renovate's overall design.

it depends on how big these tarballs can be. ~10 MB should be ok, but bigger files can cause memory issues (oom)

Extracting the files should work in an asynchronous, streaming fashion with this implementation so memory should not be an issue here. I have seen this same pattern suggested for files that ran into the gigabytes of data as well, for instance here which uses the slightly higher level list functionality. However, when downloading the files are kept in memory, so the OOM would likely occur there. So the concern is still valid.

The tarballs for the largest registry currently come in at 7.7MB, which is years' worth of data but close to the ~10MB mark. I don't have statistics at hand for how fast this grows, but it's probably good for another couple of years at least. That being said, it probably doesn't hurt to prepare.

@rarkins we need to think about those big blobs. we cache every request in memory, so we've this blob twice.

renovate/lib/util/http/index.ts

Line 186 in 26337ac

(options.method === 'get' || options.method === 'head')

we probably should pass memCache:false for those files and only cache the smaller transformed data to package cache.

@bauglir I think we should extract all deps from that file and add them to the package cache, so we can speed up additional lookups. we do something similar at rubygems datasource.

I think that makes sense. I did a quick test and it looks like parsing everything and putting that in an object results in ~800kB of memory being used (checked by capturing a heap snapshot after parsing all files in the tarball). So that should save some memory.

I'll take a stab at this. My plan would be something along the following lines. It looks like this can be a bit more straightforward than the rubygems use case as it doesn't have to hit multiple endpoints, etc. Am I correct in assuming that using the @cache decorator on the getReleases method and then populating the cache with the same key with all the packages whenever that method gets executed would be sufficient (and disabling caching for the downloaded tarballs)?

One thing I noticed in a quick experiment is that if I take this approach and Renovate needs to update multiple packages that it checks all of the @cache decorators before doing any getReleases calls. This then results in none of the caches being there yet (on a fresh cache or after the TTL has expired) and the full retrieval being done still occurring multiple times. Subsequent runs then correctly hit the cache. Can I not use the @cache decorator this way and should I do all the cache management directly within getReleases?

HonkingGoose

Most of my changes are for the docs or code comments.

In general the changes are:

One sentence per line for docs files.
Simpler words and phrases.
Shorter sentences.
Fix some typos.

This matches the Renovate Style Guide.

I'll let the maintainers do a technical review of your code and tests. 😉

lib/modules/datasource/julia-pkg-server/index.ts

lib/modules/datasource/julia-pkg-server/readme.md

lib/modules/datasource/julia-pkg-server/registry.spec.ts

lib/modules/datasource/julia-pkg-server/registry.ts

HonkingGoose · 2024-06-13T12:16:31Z

lib/modules/datasource/julia-pkg-server/registry.ts

+ * The final part of the path of a (fully specified) registry URL references
+ * the tree SHA of the state of of the registry repository the package server
+ * is serving. Given the content-addressable nature of this SHA, it can be used
+ * as a caching mechanism.


Suggested change

* The final part of the path of a (fully specified) registry URL references

* the tree SHA of the state of of the registry repository the package server

* is serving. Given the content-addressable nature of this SHA, it can be used

* as a caching mechanism.

* The final part of the path of a (fully specified) registry URL references

* the tree SHA-1 of the state of of the registry repository the package server

* is serving. Given the content-addressable nature of this SHA-1, it can be used

* as a caching mechanism.

I think Git still defaults to SHA-1 hashes? I think there's experimental support for SHA-256 now. Not sure if we need to be specific about the SHA-1/SHA-256/SHA-512 hashes here.

What happens if the datasource queries a repository that uses SHA-256 hashes? If we do not support those newer hashes, we should probably test for that, and show a error to the user.

The PkgServer implementation does not take this difference into account currently, it only supports SHA-1 states. Given that Renovate queries registries through the package server, I would expect that this would initially get solved at the package server level.

If a SHA-256 "state" would be added to a URL it would fail to parse and be reported here.

Once package servers start supporting registry repositories using SHA-256 they will start reporting those URLs through their /registries endpoint and currently this is indeed not tested for.

I think this raises a good question as to whether we should be enforcing strict formatting on the state. Either we should probably check it everywhere, or we can forego the check altogether which will then just result in 404 errors when retrieving the registry with the upside that it would not require changes to Renovate if the package server starts supporting different state formats.

Check input matches expectations

I think this raises a good question as to whether we should be enforcing strict formatting on the state.

It's probably best to to check what Renovate "pulls in". So in general terms:

Check input is valid

Check input matches expected format

Sanitize dangerous stuff

Only then proceed with processing the input

SHA state check as general utility/function to be used by other Renovate code?

Checking if the state is SHA-1 vs SHA-256 vs SHA-512 sounds like something that other Renovate code could re-use? Maybe that should be a new file/function in a suitable place?

To check or not to check?

I vote check

Either we should probably check it everywhere, or we can forego the check altogether

I would vote check everywhere, but I'm not a maintainer. 😄

If we decide to not check, show good error message

... or we can forego the check altogether which will then just result in 404 errors when retrieving the registry ...

I guess that would look like this then?

Renovate tries to create PR for SHA-256 registry package.

Renovate shows generic 404 error.

User thinks Renovate is broken.

I would prefer something like this:

Renovate tries to create PR for SHA-256 registry package.

Renovate shows 404 error, and explains that the SHA-256 bit is not supported yet.

This probably prompts the user to ask the Renovate maintainers to add code to support the new hash.

Your upside is my downside

... with the upside that it would not require changes to Renovate if the package server starts supporting different state formats.

I don't agree with this. If the package server supports a new format, we probably need to account for that in the Renovate code. So it's best if we state explicitly what's supported, what's not, and have a way to get nudged when things break.

Maintainers make the decision here

I'll let the maintainers decide what to do with all this info above, and with your idea! 😉

can we simply allow sha256? so we're future prove when the package server becomes support for it? I think in near future more and more new repos will use sha256 by default.

Thanks for the extensive input.

The state verification currently only applies to user-provided input. Given that a package server will directly report what URLs to use when asking it for the state of a registry, the URLs/states it provides are assumed to be correct (note that the package server hosts these URLs itself). If the package server turns out to provide wrong URLs that will indeed result in 404s.

Currently, no validation is done on those returned URLs (but I do have a note locally on whether this is something we'd want to do which I think is what you are suggesting). So I'm open to changing that.

can we simply allow sha256?

I have added this in 1ffa7ae.

lib/modules/datasource/julia-pkg-server/registry.ts

lib/modules/datasource/julia-pkg-server/common.ts

lib/modules/datasource/julia-pkg-server/registry.ts

rarkins · 2024-06-13T13:54:29Z

lib/modules/datasource/julia-pkg-server/registry.ts

+    // TODO: The tests for this functionality only require a single "chunk" to
+    // be read from the tarball to succeed. Hence, it would be sufficient to
+    // just process a single chunk directly as a string and return it. This
+    // does not hold in general and a (minimal) test requiring multiple chunks
+    // should be implemented


How often do we need to fetch tarballs, if they can be large? And for how long can we cache them?

lib/modules/datasource/julia-pkg-server/registry.ts

lib/modules/datasource/julia-pkg-server/index.ts

Co-authored-by: HonkingGoose <[email protected]> Co-authored-by: Rhys Arkins <[email protected]>

This uses `re2` under the hood which prevents certain DoS attacks. Co-authored-by: Rhys Arkins <[email protected]>

Co-authored-by: HonkingGoose <[email protected]>

…caching package releases Co-authored-by: Rhys Arkins <[email protected]>

Forgot to add in 455032c.

…r class for `re2`

viceice

is there no way to only download the required files instead of the full tarballs? 🤔

lib/modules/datasource/julia-pkg-server/index.ts

viceice · 2024-06-14T12:23:08Z

lib/modules/datasource/julia-pkg-server/registry.ts

+    const tarballParser = new tarParse({
+      filter: (path) => pathsToExtract.includes(path),
+    });
+
+    tarballParser.on('end', () => {
+      if (Object.values(fileContentChunks).every(({ length }) => length > 0)) {
+        const fileContents: FileContents<string> = {};
+        for (const path of pathsToExtract) {
+          fileContents[path] = Buffer.concat(fileContentChunks[path]).toString(
+            'utf-8',
+          );
+        }
+
+        resolve(fileContents);
+      } else {
+        resolve(null);
+      }
+    });
+
+    tarballParser.on('entry', (entry) => {
+      entry.on('data', (data) => {
+        fileContentChunks[entry.path].push(data);
+      });
+
+      entry.on('error', reject);
+    });
+
+    tarballParser.end(tarball);


it depends on how big these tarballs can be. ~10 MB should be ok, but bigger files can cause memory issues (oom)

lib/modules/datasource/julia-pkg-server/registry.ts

…ng the default registry Co-authored-by: Michael Kriese <[email protected]>

Co-authored-by: Michael Kriese <[email protected]>

bauglir · 2024-06-14T13:41:38Z

is there no way to only download the required files instead of the full tarballs? 🤔

Unfortunately not 😞 At least not when relying on package servers. That would require direct interaction with the underlying Git repositories. This is an alternative datasource that I was planning on creating as well, but that supports different use cases (e.g. package servers enable access to registries to which one does not have access to the underlying repository of a registry) and this one seemed more straightforward to implement.

I'm open to starting a discussion around whether package servers could start to provide direct access to these files, but that will likely be a long discussion and might not be possible given the design of the package server infrastructure.

This prevents binary data from having to be stored in the repository at the cost of a slightly higher test runtime. It makes it easier to inspect files that end up in the tarballs when testing.

viceice · 2024-06-15T08:54:07Z

is there no way to only download the required files instead of the full tarballs? 🤔

Unfortunately not 😞 At least not when relying on package servers. That would require direct interaction with the underlying Git repositories. This is an alternative datasource that I was planning on creating as well, but that supports different use cases (e.g. package servers enable access to registries to which one does not have access to the underlying repository of a registry) and this one seemed more straightforward to implement.

I'm open to starting a discussion around whether package servers could start to provide direct access to these files, but that will likely be a long discussion and might not be possible given the design of the package server infrastructure.

we should start that discussion on the package server repo, so they hopefully consider it. it would be a huge win for renovate.

lib/modules/datasource/julia-pkg-server/__fixtures__/ExtraHTTP/H/HTTP/Package.toml

lib/modules/datasource/julia-pkg-server/test/index.ts

lib/modules/datasource/julia-pkg-server/registry.ts

lib/modules/datasource/julia-pkg-server/index.ts

viceice · 2024-06-15T09:16:23Z

lib/modules/datasource/julia-pkg-server/registry.ts

+    const tarballParser = new tarParse({
+      filter: (path) => pathsToExtract.includes(path),
+    });
+
+    tarballParser.on('end', () => {
+      if (Object.values(fileContentChunks).every(({ length }) => length > 0)) {
+        const fileContents: FileContents<string> = {};
+        for (const path of pathsToExtract) {
+          fileContents[path] = Buffer.concat(fileContentChunks[path]).toString(
+            'utf-8',
+          );
+        }
+
+        resolve(fileContents);
+      } else {
+        resolve(null);
+      }
+    });
+
+    tarballParser.on('entry', (entry) => {
+      entry.on('data', (data) => {
+        fileContentChunks[entry.path].push(data);
+      });
+
+      entry.on('error', reject);
+    });
+
+    tarballParser.end(tarball);


@rarkins we need to think about those big blobs. we cache every request in memory, so we've this blob twice.

renovate/lib/util/http/index.ts

Line 186 in 26337ac

(options.method === 'get' || options.method === 'head')

we probably should pass memCache:false for those files and only cache the smaller transformed data to package cache.

@bauglir I think we should extract all deps from that file and add them to the package cache, so we can speed up additional lookups. we do something similar at rubygems datasource.

…ion directly The datasource ID is defined in a central location to prevent circular dependencies when referring to it through the `JuliaPkgServerDatasource` class. For consistency it should primarily be referred to in this way.

…egistry URLs This consolidates the knowledge about the format of registry URLs a bit better in the codebase making it easier to change the format if necessary. It primarily affects tests.

Primarily removes some duplication in tests while ensuring consistency.

…stry URLs This is not currently supported by `PkgServer` (which assumes SHA-1). This supported is added for future-proofing, given that the underlying Git repositories may start to use SHA-256 instead to encode tree SHAs.

…failed HTTP requests Failed HTTP requests are also logged at the generic datasource level and from the base class. These additional warnings do not add much value.

Co-authored-by: Michael Kriese <[email protected]>

…support

bauglir · 2024-06-26T21:45:44Z

we should start that discussion on the package server repo, so they hopefully consider it. it would be a huge win for renovate.

I have started the discussion in JuliaPackaging/PkgServer.jl#210 and will be happy to follow up on simplifying things once/if that gets resolved.

bauglir changed the title ~~Add Julia PkgServer datasource~~ feat(datasource/julia-pkg-server): add Julia PkgServer Jun 12, 2024

bauglir changed the title ~~feat(datasource/julia-pkg-server): add Julia PkgServer~~ feat(datasource/julia-pkg-server): add Julia PkgServer Jun 12, 2024

bauglir added 2 commits June 12, 2024 14:16

test(datasource/julia-pkg-server): remove duplicate test

f50503a

This behavior is already tested at the `retrieveRegistryState` level. It was left over from when this was implemented directly at the datasource level.

refactor(datasource/julia-pkg-server): remove accidental TODO comment

1e2bafc

bauglir commented Jun 12, 2024

View reviewed changes

package.json Show resolved Hide resolved

bauglir commented Jun 12, 2024

View reviewed changes

bauglir marked this pull request as ready for review June 12, 2024 12:32

HonkingGoose suggested changes Jun 13, 2024

View reviewed changes

zharinov self-requested a review June 13, 2024 12:47

rarkins requested changes Jun 13, 2024

View reviewed changes

bauglir and others added 4 commits June 14, 2024 10:05

docs(datasource/julia-pkg-server): fix spelling and grammar in comments

3396e65

Co-authored-by: HonkingGoose <[email protected]> Co-authored-by: Rhys Arkins <[email protected]>

refactor(datasource/julia-pkg-server): use regEx instead of RegExp

455032c

This uses `re2` under the hood which prevents certain DoS attacks. Co-authored-by: Rhys Arkins <[email protected]>

docs(datasource/julia-pkg-server): simplify user-facing documentation

d1b0130

Co-authored-by: HonkingGoose <[email protected]>

refactor(datasource/julia-pkg-server): include the registry URL when …

77d0ec6

…caching package releases Co-authored-by: Rhys Arkins <[email protected]>

bauglir requested review from HonkingGoose and rarkins June 14, 2024 08:24

bauglir added 2 commits June 14, 2024 10:42

refactor(datasource/julia-pkg-server): add missing import

e1427c3

Forgot to add in 455032c.

refactor(datasource/julia-pkg-server): fix hexadecimal digit characte…

be79f32

…r class for `re2`

viceice reviewed Jun 14, 2024

View reviewed changes

bauglir and others added 2 commits June 14, 2024 15:07

refactor(datasource/julia-pkg-server): align property name representi…

ebdecfd

…ng the default registry Co-authored-by: Michael Kriese <[email protected]>

refactor(datasource/julia-pkg-server): fix type annotation

bad425c

Co-authored-by: Michael Kriese <[email protected]>

bauglir added 2 commits June 14, 2024 15:50

test(datasource/julia-pkg-server): add test for wrong URLs

27e29a2

test(datasource/julia-pkg-server): generate registry tarballs on-demand

813fab4

This prevents binary data from having to be stored in the repository at the cost of a slightly higher test runtime. It makes it easier to inspect files that end up in the tarballs when testing.

bauglir requested a review from viceice June 14, 2024 15:23

viceice requested changes Jun 15, 2024

View reviewed changes

bauglir and others added 7 commits June 19, 2024 09:41

refactor(datasource/julia-pkg-server): add helper function to build r…

f0f4845

…egistry URLs This consolidates the knowledge about the format of registry URLs a bit better in the codebase making it easier to change the format if necessary. It primarily affects tests.

refactor(datasource/julia-pkg-server): centralize default values

1115192

Primarily removes some duplication in tests while ensuring consistency.

test(datasource/julia-pkg-server): embed test files for tarballs

8ba359e

refactor(datasource/julia-pkg-server): remove duplicate warnings for …

10e34e6

…failed HTTP requests Failed HTTP requests are also logged at the generic datasource level and from the base class. These additional warnings do not add much value.

refactor(datasource/julia-pkg-server): simplify registry URL parsing

d4e3baa

Co-authored-by: Michael Kriese <[email protected]>

docs(datasource/julia-pkg-server): document package-level source URL …

7f1153c

…support

bauglir mentioned this pull request Jun 26, 2024

Add endpoints to obtain files from within registry tarballs JuliaPackaging/PkgServer.jl#210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasource/julia-pkg-server): add Julia `PkgServer` #29623

feat(datasource/julia-pkg-server): add Julia `PkgServer` #29623

bauglir commented Jun 12, 2024 •

edited

Loading

bauglir Jun 12, 2024 •

edited

Loading

rarkins Jun 13, 2024

bauglir Jun 13, 2024

rarkins Jun 14, 2024

bauglir Jun 14, 2024

bauglir Jun 12, 2024

viceice Jun 14, 2024

bauglir Jun 14, 2024

viceice Jun 15, 2024

bauglir Jun 24, 2024

HonkingGoose left a comment

HonkingGoose Jun 13, 2024

bauglir Jun 14, 2024

HonkingGoose Jun 15, 2024

viceice Jun 15, 2024

bauglir Jun 26, 2024

rarkins Jun 13, 2024

viceice left a comment

viceice Jun 14, 2024

bauglir commented Jun 14, 2024

viceice commented Jun 15, 2024

viceice Jun 15, 2024

bauglir commented Jun 26, 2024

feat(datasource/julia-pkg-server): add Julia PkgServer #29623

Are you sure you want to change the base?

feat(datasource/julia-pkg-server): add Julia PkgServer #29623

Conversation

bauglir commented Jun 12, 2024 • edited Loading

Changes

Context

Documentation

How I've tested my work

bauglir Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonkingGoose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Check input matches expectations

SHA state check as general utility/function to be used by other Renovate code?

To check or not to check?

I vote check

If we decide to not check, show good error message

Your upside is my downside

Maintainers make the decision here

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viceice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bauglir commented Jun 14, 2024

viceice commented Jun 15, 2024

Choose a reason for hiding this comment

bauglir commented Jun 26, 2024

feat(datasource/julia-pkg-server): add Julia `PkgServer` #29623

feat(datasource/julia-pkg-server): add Julia `PkgServer` #29623

bauglir commented Jun 12, 2024 •

edited

Loading

bauglir Jun 12, 2024 •

edited

Loading