Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasource/julia-pkg-server): add Julia PkgServer #29623

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

bauglir
Copy link

@bauglir bauglir commented Jun 12, 2024

Changes

PkgServers are the primary package distribution system for Julia (the alternative being operating on the underlying Git repositories directly, which I might contribute later but seemed like a more daunting task given the number of possible hosting platforms). This datasource enables retrieval of packages from registries through these PkgServers. It supports multiple registries across PkgServers (a single server can host multiple registries and the same registry can be hosted across different PkgServers).

Julia-specific versioning is not yet implemented, I will implement that as a follow-up.

Context

Documentation

  • I have updated the documentation, or
  • No documentation update is required

How I've tested my work

I have verified these changes via:

  • Code inspection only, or
  • Newly added/modified unit tests, or
  • No unit tests but ran on a real repository, or
  • Both unit tests + ran on a real repository

`PkgServer`s are the primary package distribution system for Julia (the
alternative being operating on the underlying Git repositories
directly). This datasource enables retrieval of packages from registries
through these `PkgServer`s. It supports multiple registries across
`PkgServer`s.

Julia-specific versioning is not yet implemented, this will be
implemented as a follow-up.

References renovatebot#6508.
@bauglir bauglir changed the title Add Julia PkgServer datasource feat(datasource/julia-pkg-server): add Julia PkgServer Jun 12, 2024
@bauglir bauglir changed the title feat(datasource/julia-pkg-server): add Julia PkgServer feat(datasource/julia-pkg-server): add Julia PkgServer Jun 12, 2024
This behavior is already tested at the `retrieveRegistryState` level. It
was left over from when this was implemented directly at the datasource
level.
Comment on lines +73 to +77
// TODO: The tests for this functionality only require a single "chunk" to
// be read from the tarball to succeed. Hence, it would be sufficient to
// just process a single chunk directly as a string and return it. This
// does not hold in general and a (minimal) test requiring multiple chunks
// should be implemented
Copy link
Author

@bauglir bauglir Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a tarball where this fails (which was obtained from the official PkgServer), but that tarball is over 7MB large and I did not want to include that as a fixture. I have tried to create a more minimal example where the simpler approach fails, but have so far been unable to do so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do we need to fetch tarballs, if they can be large? And for how long can we cache them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good question and one I'm not entirely sure about myself.

The tarballs depend on the state of the registry, so they change whenever a new package is published to a registry. This is done by issuing a PR to the repository hosting the registry. So a suitable cache duration highly depends on the expected rate of package releases, when the corresponding PRs are merged, and how soon new releases should be available to Renovate. Given that these variables will very much depend on the registry from which dependencies are pulled, I'm not entirely sure there will be a uniformly suitable answer.

That being said, the General registry (which is where all public, open-source packages tend to be published) sees a couple of releases per hour and "automatically merges" suitable PRs every 8 minutes. So at a minimum a cache duration of 8 minutes can be used, although in practice it is probably fine to cache these up to an hour, maybe even a bit longer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a single tarball per registry and not per package then I think something like a 15-30 minute cache time is fine. If it was common for the tarball to be unchanged for longer periods of time then we'd want to retain and use etag or last-modified headers for subsequent requests, but if registries typically get updated frequently then there's little gain from that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a single tarball per registry. I believe the default ttlMinutes is 30 minutes, so I'll leave this at that.

PkgServers support multiple "flavors" where "conservative" is the alternative currently available. That tends to lag by a couple of hours and used to be the default mode. But that used to cause quite some complaints from the community related to CI specifically, see JuliaPackaging/PkgServer.jl#144 for more context, so I don't think we'd want to support that. Once we do I think we could update the caching strategy accordingly.

package.json Show resolved Hide resolved
Comment on lines +82 to +109
const tarballParser = new tarParse({
filter: (path) => pathsToExtract.includes(path),
});

tarballParser.on('end', () => {
if (Object.values(fileContentChunks).every(({ length }) => length > 0)) {
const fileContents: FileContents<string> = {};
for (const path of pathsToExtract) {
fileContents[path] = Buffer.concat(fileContentChunks[path]).toString(
'utf-8',
);
}

resolve(fileContents);
} else {
resolve(null);
}
});

tarballParser.on('entry', (entry) => {
entry.on('data', (data) => {
fileContentChunks[entry.path].push(data);
});

entry.on('error', reject);
});

tarballParser.end(tarball);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extracts the necessary files in memory. It might be more straightforward to unpack do disk instead and work with some form of caching. I'd be happy to change it if that's better suited to Renovate's overall design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends on how big these tarballs can be. ~10 MB should be ok, but bigger files can cause memory issues (oom)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting the files should work in an asynchronous, streaming fashion with this implementation so memory should not be an issue here. I have seen this same pattern suggested for files that ran into the gigabytes of data as well, for instance here which uses the slightly higher level list functionality. However, when downloading the files are kept in memory, so the OOM would likely occur there. So the concern is still valid.

The tarballs for the largest registry currently come in at 7.7MB, which is years' worth of data but close to the ~10MB mark. I don't have statistics at hand for how fast this grows, but it's probably good for another couple of years at least. That being said, it probably doesn't hurt to prepare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rarkins we need to think about those big blobs. we cache every request in memory, so we've this blob twice.

(options.method === 'get' || options.method === 'head')

we probably should pass memCache:false for those files and only cache the smaller transformed data to package cache.

@bauglir I think we should extract all deps from that file and add them to the package cache, so we can speed up additional lookups. we do something similar at rubygems datasource.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. I did a quick test and it looks like parsing everything and putting that in an object results in ~800kB of memory being used (checked by capturing a heap snapshot after parsing all files in the tarball). So that should save some memory.

I'll take a stab at this. My plan would be something along the following lines. It looks like this can be a bit more straightforward than the rubygems use case as it doesn't have to hit multiple endpoints, etc. Am I correct in assuming that using the @cache decorator on the getReleases method and then populating the cache with the same key with all the packages whenever that method gets executed would be sufficient (and disabling caching for the downloaded tarballs)?

One thing I noticed in a quick experiment is that if I take this approach and Renovate needs to update multiple packages that it checks all of the @cache decorators before doing any getReleases calls. This then results in none of the caches being there yet (on a fresh cache or after the TTL has expired) and the full retrieval being done still occurring multiple times. Subsequent runs then correctly hit the cache. Can I not use the @cache decorator this way and should I do all the cache management directly within getReleases?

@bauglir bauglir marked this pull request as ready for review June 12, 2024 12:32
Copy link
Collaborator

@HonkingGoose HonkingGoose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my changes are for the docs or code comments.

In general the changes are:

  • One sentence per line for docs files.
  • Simpler words and phrases.
  • Shorter sentences.
  • Fix some typos.

This matches the Renovate Style Guide.

I'll let the maintainers do a technical review of your code and tests. 😉

lib/modules/datasource/julia-pkg-server/index.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/index.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/index.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/readme.md Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/readme.md Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/registry.spec.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
Comment on lines 43 to 46
* The final part of the path of a (fully specified) registry URL references
* the tree SHA of the state of of the registry repository the package server
* is serving. Given the content-addressable nature of this SHA, it can be used
* as a caching mechanism.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The final part of the path of a (fully specified) registry URL references
* the tree SHA of the state of of the registry repository the package server
* is serving. Given the content-addressable nature of this SHA, it can be used
* as a caching mechanism.
* The final part of the path of a (fully specified) registry URL references
* the tree SHA-1 of the state of of the registry repository the package server
* is serving. Given the content-addressable nature of this SHA-1, it can be used
* as a caching mechanism.

I think Git still defaults to SHA-1 hashes? I think there's experimental support for SHA-256 now. Not sure if we need to be specific about the SHA-1/SHA-256/SHA-512 hashes here.

What happens if the datasource queries a repository that uses SHA-256 hashes? If we do not support those newer hashes, we should probably test for that, and show a error to the user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PkgServer implementation does not take this difference into account currently, it only supports SHA-1 states. Given that Renovate queries registries through the package server, I would expect that this would initially get solved at the package server level.

If a SHA-256 "state" would be added to a URL it would fail to parse and be reported here.

Once package servers start supporting registry repositories using SHA-256 they will start reporting those URLs through their /registries endpoint and currently this is indeed not tested for.

I think this raises a good question as to whether we should be enforcing strict formatting on the state. Either we should probably check it everywhere, or we can forego the check altogether which will then just result in 404 errors when retrieving the registry with the upside that it would not require changes to Renovate if the package server starts supporting different state formats.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check input matches expectations

I think this raises a good question as to whether we should be enforcing strict formatting on the state.

It's probably best to to check what Renovate "pulls in". So in general terms:

  • Check input is valid
  • Check input matches expected format
  • Sanitize dangerous stuff
  • Only then proceed with processing the input

SHA state check as general utility/function to be used by other Renovate code?

Checking if the state is SHA-1 vs SHA-256 vs SHA-512 sounds like something that other Renovate code could re-use? Maybe that should be a new file/function in a suitable place?

To check or not to check?

I vote check

Either we should probably check it everywhere, or we can forego the check altogether

I would vote check everywhere, but I'm not a maintainer. 😄

If we decide to not check, show good error message

... or we can forego the check altogether which will then just result in 404 errors when retrieving the registry ...

I guess that would look like this then?

  1. Renovate tries to create PR for SHA-256 registry package.
  2. Renovate shows generic 404 error.
  3. User thinks Renovate is broken.

I would prefer something like this:

  1. Renovate tries to create PR for SHA-256 registry package.
  2. Renovate shows 404 error, and explains that the SHA-256 bit is not supported yet.
  3. This probably prompts the user to ask the Renovate maintainers to add code to support the new hash.

Your upside is my downside

... with the upside that it would not require changes to Renovate if the package server starts supporting different state formats.

I don't agree with this. If the package server supports a new format, we probably need to account for that in the Renovate code. So it's best if we state explicitly what's supported, what's not, and have a way to get nudged when things break.

Maintainers make the decision here

I'll let the maintainers decide what to do with all this info above, and with your idea! 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simply allow sha256? so we're future prove when the package server becomes support for it? I think in near future more and more new repos will use sha256 by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive input.

The state verification currently only applies to user-provided input. Given that a package server will directly report what URLs to use when asking it for the state of a registry, the URLs/states it provides are assumed to be correct (note that the package server hosts these URLs itself). If the package server turns out to provide wrong URLs that will indeed result in 404s.

Currently, no validation is done on those returned URLs (but I do have a note locally on whether this is something we'd want to do which I think is what you are suggesting). So I'm open to changing that.

can we simply allow sha256?

I have added this in 1ffa7ae.

lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/common.ts Outdated Show resolved Hide resolved
@zharinov zharinov self-requested a review June 13, 2024 12:47
lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
Comment on lines +73 to +77
// TODO: The tests for this functionality only require a single "chunk" to
// be read from the tarball to succeed. Hence, it would be sufficient to
// just process a single chunk directly as a string and return it. This
// does not hold in general and a (minimal) test requiring multiple chunks
// should be implemented
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do we need to fetch tarballs, if they can be large? And for how long can we cache them?

lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/index.ts Outdated Show resolved Hide resolved
bauglir and others added 4 commits June 14, 2024 10:05
Copy link
Member

@viceice viceice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no way to only download the required files instead of the full tarballs? 🤔

lib/modules/datasource/julia-pkg-server/index.ts Outdated Show resolved Hide resolved
Comment on lines +82 to +109
const tarballParser = new tarParse({
filter: (path) => pathsToExtract.includes(path),
});

tarballParser.on('end', () => {
if (Object.values(fileContentChunks).every(({ length }) => length > 0)) {
const fileContents: FileContents<string> = {};
for (const path of pathsToExtract) {
fileContents[path] = Buffer.concat(fileContentChunks[path]).toString(
'utf-8',
);
}

resolve(fileContents);
} else {
resolve(null);
}
});

tarballParser.on('entry', (entry) => {
entry.on('data', (data) => {
fileContentChunks[entry.path].push(data);
});

entry.on('error', reject);
});

tarballParser.end(tarball);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends on how big these tarballs can be. ~10 MB should be ok, but bigger files can cause memory issues (oom)

lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
@bauglir
Copy link
Author

bauglir commented Jun 14, 2024

is there no way to only download the required files instead of the full tarballs? 🤔

Unfortunately not 😞 At least not when relying on package servers. That would require direct interaction with the underlying Git repositories. This is an alternative datasource that I was planning on creating as well, but that supports different use cases (e.g. package servers enable access to registries to which one does not have access to the underlying repository of a registry) and this one seemed more straightforward to implement.

I'm open to starting a discussion around whether package servers could start to provide direct access to these files, but that will likely be a long discussion and might not be possible given the design of the package server infrastructure.

This prevents binary data from having to be stored in the repository at
the cost of a slightly higher test runtime. It makes it easier to
inspect files that end up in the tarballs when testing.
@bauglir bauglir requested a review from viceice June 14, 2024 15:23
@viceice
Copy link
Member

viceice commented Jun 15, 2024

is there no way to only download the required files instead of the full tarballs? 🤔

Unfortunately not 😞 At least not when relying on package servers. That would require direct interaction with the underlying Git repositories. This is an alternative datasource that I was planning on creating as well, but that supports different use cases (e.g. package servers enable access to registries to which one does not have access to the underlying repository of a registry) and this one seemed more straightforward to implement.

I'm open to starting a discussion around whether package servers could start to provide direct access to these files, but that will likely be a long discussion and might not be possible given the design of the package server infrastructure.

we should start that discussion on the package server repo, so they hopefully consider it. it would be a huge win for renovate.

lib/modules/datasource/julia-pkg-server/test/index.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/registry.ts Outdated Show resolved Hide resolved
lib/modules/datasource/julia-pkg-server/index.ts Outdated Show resolved Hide resolved
Comment on lines +82 to +109
const tarballParser = new tarParse({
filter: (path) => pathsToExtract.includes(path),
});

tarballParser.on('end', () => {
if (Object.values(fileContentChunks).every(({ length }) => length > 0)) {
const fileContents: FileContents<string> = {};
for (const path of pathsToExtract) {
fileContents[path] = Buffer.concat(fileContentChunks[path]).toString(
'utf-8',
);
}

resolve(fileContents);
} else {
resolve(null);
}
});

tarballParser.on('entry', (entry) => {
entry.on('data', (data) => {
fileContentChunks[entry.path].push(data);
});

entry.on('error', reject);
});

tarballParser.end(tarball);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rarkins we need to think about those big blobs. we cache every request in memory, so we've this blob twice.

(options.method === 'get' || options.method === 'head')

we probably should pass memCache:false for those files and only cache the smaller transformed data to package cache.

@bauglir I think we should extract all deps from that file and add them to the package cache, so we can speed up additional lookups. we do something similar at rubygems datasource.

…ion directly

The datasource ID is defined in a central location to prevent circular
dependencies when referring to it through the `JuliaPkgServerDatasource`
class. For consistency it should primarily be referred to in this way.
bauglir and others added 7 commits June 19, 2024 09:41
…egistry URLs

This consolidates the knowledge about the format of registry URLs a bit
better in the codebase making it easier to change the format if
necessary. It primarily affects tests.
Primarily removes some duplication in tests while ensuring consistency.
…stry URLs

This is not currently supported by `PkgServer` (which assumes SHA-1).
This supported is added for future-proofing, given that the underlying
Git repositories may start to use SHA-256 instead to encode tree SHAs.
…failed HTTP requests

Failed HTTP requests are also logged at the generic datasource level and
from the base class. These additional warnings do not add much value.
@bauglir
Copy link
Author

bauglir commented Jun 26, 2024

we should start that discussion on the package server repo, so they hopefully consider it. it would be a huge win for renovate.

I have started the discussion in JuliaPackaging/PkgServer.jl#210 and will be happy to follow up on simplifying things once/if that gets resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants