Hacker News new | past | comments | ask | show | jobs | submit login
What is the space overhead of Base64 encoding? (lemire.me)
85 points by ingve on Jan 30, 2019 | hide | past | favorite | 42 comments



I think the angle he's going for here is that if you have inline base64 blobs in your html/css code, and that's then served by a HTTP(S) server using gzip compression, are you wasting lots of space/bandwidth? and the answer is +5% to +2.5% overhead, which isn't much considering the larger advantage of a single round trip to the server as all the assets are embedded in the html document.

Of course there's also the overhead of decoding all those base64 blobs in the browser, but I'm sure that's a topic for another blog post in the future :)


For the first load of the page, sure. But if you're embedding HTML and CSS, then you can't cache it separately, which means having to fetch it all on every request where anything has changed, in addition to the decoding etc on every request.


That can also be an advantage, if the connection has a high throughput but a high latency (e.g. cellular networks, still), you'd much rather fetch everything in a single request. There's a slight inefficiency in the inability to independently cache assets, but even that's not necessarily a big issue: you'd mostly inline small assets.


It still feels like a hack to work around a problem that should be solved at the protocol level. There's no technical reason that multiple assets couldn't be streamed over a single TCP connection. If it's a common use-case that a number of assets need to be loaded at once to display a web page, then this should be supported by HTTP.


Great idea. Let's call it HTTP/2.

https://en.wikipedia.org/wiki/HTTP/2


You can bump that scenario down a layer to http2 push and get the best of both worlds.


Yes. I think it's a bit ironic the OP didn't mention the CPU overhead to essentially go from

binary -> ASCII -> binary


I wonder if gzip has improved? 15 years ago when I tested this, it was more efficient to gzip base16 encoded data than base64 encoded data. (At least, the English dictionary.) I assume that was because the 3:4 encoding broke up patterns in the source text and messed with the compressor.

But I just tested this again and it's not true anymore.

  $ wc -c american-english  
  971578 american-english
  
  $ gzip < american-english | wc -c  
  259977

  $ base64 < american-english | gzip | wc -c  
  429263

  $ hexdump -e '"%x"' < american-english | gzip | wc -c  
  478411


Actually, it is still true. hexdump has some confusing format options... what you were using was converting it to little-endian first before printing the hex representation, which really messed with gzip. Try this:

    $ hexdump -e '"%x"' < american-english | gzip | wc -c
    463871
    $ hexdump -v -e '/1 "%02x"' < american-english | gzip | wc -c
    302515
    $ base64 < american-english | gzip | wc -c
    415737


> converting it to little-endian first

Ack, is there anything little-endian doesn't ruin?

/me misses 68000 assembler …


Thanks! That'll teach me to cut and paste from StackExchange. Also why I got different results last time I looked at this; was using a different base 16 encoder.


If you just compress the base64, the overhead is 5%. But if you embed within an HTML file, with a different set of character frequencies, that increases.

    $ cat index.html | gzip -9 | wc -c
    14116
    $ cat index.html bing.base64 | gzip -9 | wc -c
    15994
    $ expr 15994 - 14116
    1878
    $ cat bing.base64 | gzip -9 | wc -c
    1432


If you’re going to do an experiment like that you will probably want the file to be a whole lot bigger for more accurate results.


I wrote some revised radix-based text encoding specifications to make them interoperate better with modern text processor standards (SGML, string literals, filenames, URIs, etc). I also included a representative test for how they fare on uncompressed vs pre-compressed data when compressed with gzip:

https://github.com/kstenerud/safe-encoding/blob/master/READM...

radix-64 of course fares the best, but radix-85 isn't much different, and is 10% smaller uncompressed.


The article compares raw vs base64+gzip, but I'd be interested to see gzip vs base64+gzip


Did some further investigation:

i included some text files, hex encoding, and other compression as well :)

raw file list: https://hastebin.com/vazonowuvo.txt

tsv (c/p to spreadsheet: https://hastebin.com/ewohafucem.tsv

---

(interesting) using bzip2, compression is better when the following file are encoded first with base64 or hex: bing.png googlelogo.png peppers_color.jpg

Useless takeaways:

- prefer base64 over hex when encoding already compressed images before further compression

- prefer hex over base64 when encoding plain text / low entropy data before further compression


Heh, so gzipped png's are generally larger than non-gzipped. Samples are prob. heavily compressed though.


Not just that, PNG and gzip use the same compression algorithm:

https://en.wikipedia.org/wiki/DEFLATE


The formats he uses as raw are already compressed by default, gzipping them wouldn't help. Gzipping the new encoding makes sense.


Theoretically base64+gzip and gzip should be almost equivalent. For more aggressive compressors they should be equivalent.


Well they will never be equivalent since the compressor has to learn and encode the set of 64 characters used, and passing along that information has some cost. In practice, that cost is either sending the probabilities up front in a compressor that does that, or the ramp up cost of using the wrong probabilities for a compressor that just uses the existing frequencies as the implicit probabilities.

Otherwise we could pass along some information "for free" in any base-64 encoding scheme by choosing some set of 64 characters (there are lots of choices and whichever set we choose encodes a message), encoding the original message with it, and then compressing it back to the original size - leading to "infinite" compression.

Other reasons base64 can't be compressed exactly back to its original form include the presence of arbitrary newlines, and trailing padding characters.

This doesn't matter in practice for compressing one large base64 encoded file where the overheads go to approximately zero, but for compressing a larger non-base64 file (e.g., HTML file) that contains embedded base64 chunks it is actually a real problem the compressor has to delimit the base64 encoded regions and communicate the new symbol probabilities somehow for each region.


paging BeeOnRope. Can you email me? address in profile. you answered a question i had a while ago and wanted to follow up. thx


We used to argue about the cost of enforcing a linewrap under 80 in email days, without really discussing how elision of the \r\n embeds in the PEM encoded or Base64 would make this a non-issue.

=Endmarker


That reminds me of when someone asked if strings that are base64-encoded twice are regular. They are.

https://stackoverflow.com/questions/49650847/determine-if-st...


What makes base64 encoded random data not regular?


Nothing, the top answer in the link covers that question trivially. The question on if twice encoded strings are regular is simply the more interesting one as it is not as trivial.



> Privacy-wise, base64 encoding can have benefits since it hides the content you access in larger encrypted bundles.

Uh, no.


The author is assuming a certain context here: that we are interested in the specific problem of transferring binary resources to a web browser. I appreciate in isolation this sentence seems incorrect, but if we assume good faith on the part of the author it's clear what they're trying to get at.

They are using 'base64 encoded' as a shorthand for 'as an embedded base64 blob inside an HTML page.' - to distinguish it from 'as an individually requested resource'.

So what they are referring to is that by embedding resources as base64 blobs inside HTML pages, and transferring those over SSL, an observer sees one large encrypted bundle being requested. If you transfer the resources as separate response entities, then an observer sees the large HTML bundle request, followed by the series of specific resources - and they can infer from the sizes and patterns of those requests some things about the page requested or the resources used.

(for example, if I know the size of the HTML and every image on wikipedia, perhaps by observing the set of sizes of resources being downloaded by a client over HTTPS I can determine which wikipedia pages a client is browsing?)


Based on your selected quote and short comment I think you're reading that differently than the author intended. Note the start of that paragraph (which you left out):

> In some instances, base64 encoding might even improve performance, because it avoids the need for distinct server requests.

I.e. they are arguing inlining all resources and grabbing them in a single request has a smaller fingerprint. This is probably less true with HTTP/2 or QUIC


Yes, but it's not encrypted.

Furthermore the article begins with the premise of using text-only data transfer protocols such as MIME, then goes on to talk about base64-encoding then gzipping data. If he were to continue talking about text-only data transfer, then he should've talked about gzipping, then base64-encoding the data or if he were talking about reducing the size of the data he should've been talking about gzipping instead of base64-encoding the data. Instead he seems to be talking about something which isn't compatible with MIME. So the article doesn't really seem to have a clear direction. If the point was to cut down on a single round-trip to the server then congratulations -- you've increased your page size by 2.5% and added the overhead of compression to the request -- a much higher toll than the cost of a 2nd request for non-trivial assets of the sort gzip would be effective for since it has its own overhead requirements.


> Yes, but it's not encrypted.

Would your opinion change if you reread the content with the understanding HTTPS is assumed when talking about web privacy in 2019? The author made no claim whatsoever base64 was encrypting the data, just bundling it into one generic request.

> Furthermore the article begins with the premise of using text-only data transfer protocols such as MIME, then goes on to talk about base64-encoding then gzipping data...

Would your opinion change if you reread the content with the assumption the author means to set "Content-Encoding: gzip" in the server config rather than literally gzipping the content?

I think both of these are fair assumptions for the target audience of the article to assume but your disagreements with the article are only true without them.


> Would your opinion change if you reread the content with the understanding HTTPS is assumed when talking about web privacy in 2019?

Well he began by talking about email, so no. If we want to talk about HTTPS and 2019, then let's serve the whole shebang with HTTP/2 and not worry about reducing the number of requests, which seems to be the only advantage offered.

> Would your opinion change if you reread the content with the assumption

That's the assumption I was forced to make, and the crux of my argument. Content-Encoding: gzip works in most servers by compressing the content on the fly -- not by precompiling. Hence my comment about adding the overhead of compression to the request.


Another gem, is the practice of seeking the minimal encoded instance which is correctly identified as "legal" by magic. People do this for a.out, JPG, PNG &c..


The site is down, but the answer is 33%. (four output bytes for every three input bytes).

It could be slightly less if you want to remove the padding and calculate what the padding should be on the decoding side: base64string.length % 4.

Edit: If no entropy is added, it's still ~33%. C'mon :P


When it's up, you'll see the article goes a bit beyond a literal reading of the title: the topic is the overhead of base64 under the assumption that the output ends up compressed with something like gzip.


Figured as much. Interested to see the results. My assumption is that it doesn't change much. However, I suppose that depends on the compression algorithm.


In principle a good byte-wise entropy coder should recover nearly 100% of the base-64 "inflation" since no entropy is added. In practice gzip doesn't get all of it since it is an imperfect entropy coder for several reasons.


You are correct, but gzip is a generally happy medium of cost to encode/decode and encoded size when transmitting over a fast network. For a 10 Mbps or greater link, if you use something like bzip2 or 7z, you'll likely spend more time encoding than transmitting a payload.

This is from my personal experience utilizing gzip and bzip2 via Boost Iostreams to compress network payloads over a 1Gbps link. End to end latency was far superior with gzip than bzip2, despite bzip2 having a smaller transmission size.


gzip is only a happy medium if your other candidates are 7z (LZMA) or bzip2, which are both stronger but slower compressors. bzip2 is essentially obsolete (off the pareto frontier), and LZMA is good but slow so will only be best if your transmission speed is low.

Near the space/time tradeoff point that gzip lives, however, it is thoroughly outclassed by more modern compressors such as zstd or brotli with the appropriate settings.


From personal experimentation, a gzipped base64 jpeg is close to the same size as the jpeg.

So if you want to efficiently store some emails, it doesn't really matter whether you decode and store attachments separately or just compress the raw text. (Although decoding and separating allows deduping.)


This article compares base64 with gzipped base64 and posits that there isn't too much difference in size. That's not terribly insightful. The only reason base64 exists is because of a lack of standardized binary distribution formats, especially over the internet. There is literally no substantive difference between data, and base64 data. I'm actually quite surprised that gzipped data is 2% larger, I'd expect a smaller margin.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: