removing file extensions from package URLs · ideas

Stream: ideas

Topic: removing file extensions from package URLs

Richard Feldman (Dec 13 2023 at 16:58):

I'd like to remove the file extensions from package URLs, and instead use only HTTP headers and the contents of the file to determine how they're compressed.

One motivation for this is that with the current design, it's impossible to upgrade to a better compression format in the future without breaking package URLs. For example, someone might want to first upload with gzip to get a bugfix or security release out as fast as possible, because brotli compression can take awhile, and then re-uploading later with brotli at maximum compression, without breaking anyone.

Also, some hosting providers or CDNs might offer built-in brotli compression, so having to pre-compress with brotli when publishing takes unnecessarily long and is redundant with the compression the hosting provider will do automatically.

Richard Feldman (Dec 13 2023 at 16:59):

I think both the HTTP headers and the contents of the file should be used because some hosting providers don't let you customize the HTTP headers - e.g. GitHub Releases doesn't natively support brotli and doesn't give you a way to customize headers, and that's how almost all Roc packages are likely to be hosted, so if we didn't support doing the compression yourself (via roc bundle), then we just wouldn't get that compression at all in the most common case

Richard Feldman (Dec 13 2023 at 17:01):

a challenge is that although it's trivial to detect if the file is compressed with gzip (all gzip streams begin with the bytes 0x1F 0x8B specifically to identify them as being gzipped), the same is not true of brotli, which doesn't have signature bytes at the beginning as part of its specification

Richard Feldman (Dec 13 2023 at 17:02):

however, there's a proposal to introduce this which includes 4 initial signature bytes (0x91, 0x19, 0x62, 0x66) that are not valid brotli, and therefore can't be mistaken for header-less brotli

Richard Feldman (Dec 13 2023 at 17:07):

that looks like a well-reasoned proposal, but the most popular Rust crate for brotli uses a different one (0xE1, 0x97)

Richard Feldman (Dec 13 2023 at 17:08):

also Mark Adler, coauthor of gzip, proposed (apparently at Google's request) a standard for this back in 2016, which also used a different magic number (0xCE, 0xB2, 0xB2, 0x81)

Richard Feldman (Dec 13 2023 at 17:20):

what's even more confusing to me is that it seems like all of these are using incorrect magic numbers? The brotli RFC says that the stream header (which seems to be the very beginning of the brotli stream) begins with 7 bits, and:

Note that bit pattern 0010001 is invalid and must not be used.

so that means the two bit patterns that are invalid for the first 8 bytes are those 7 invalid bits plus either 0 or 1 as the eighth bit, which works out to be 0x22 and 0x23

Anton (Dec 13 2023 at 17:21):

If the hash of the archive is then only checked after compression that does mean a file can be replaced with a zip bomb.

Richard Feldman (Dec 13 2023 at 17:24):

true, but I think we need to defend against zip bombs in general because someone might just publish a new release of a package that's a zip bomb

Richard Feldman (Dec 13 2023 at 17:39):

given all that, I don't like the idea of trying to use signature bytes to identify a brotli-encoded file. It seems like they don't have that sorted out yet, and anything we pick is likely to be incompatible with a future official signature.

Richard Feldman (Dec 13 2023 at 17:42):

a few possible designs come to mind here:

Only support gzip for now, wait until brotli has an official signature to identify it. This means downloads will be some amount larger, because brotli generally compresses better than gzip.
Look for the magic gzip bytes, and if they aren't there, assume brotli. A problem with this is that I don't know whether the magic gzip bytes happen to be valid brotli bytes, which could lead to misidentifying some brotli-encoded files as gzip, leading to an error when decoding as gzip failed.
Have some Roc-specific way to encode inside the file the information about how it's compressed. One idea for how to do that is to have it be a tarball containing 1 file, and that file has a file extension we can check. That works, but it adds ~1.5K to every download because that's the minimum size of a tarball.

Richard Feldman (Dec 13 2023 at 17:51):

https://blog.cloudflare.com/results-experimenting-brotli/ says that based on Cloudflare's benchmarks:

On average, Brotli at the maximal quality setting produces 1.19X smaller results than [gzip] at the maximal quality.

Richard Feldman (Dec 13 2023 at 18:12):

also maybe file extensions are just the way to go for now :stuck_out_tongue:

Kevin Gillette (Dec 15 2023 at 05:48):

Another potential security issue comes to mind (but might apply regardless): compression implementations often have security problems of their own, such as remote code execution. Does the decision of a pre-compression vs post-compression hash change our susceptibility to such attacks?

Kevin Gillette (Dec 15 2023 at 05:50):

Also, iiuc, zstd was designed with similar goals to brotli, but often has slightly better compression ratios, and perhaps wider adoption. It definitely has a magic byte sequence at the beginning.

Brendan Hansknecht (Dec 15 2023 at 06:47):

It is kinda interesting that brotli is targeted for web, but zstd is not really. I wonder if there is any meaningful difference because of that.

Last updated: Jul 23 2026 at 13:15 UTC