What is the space overhead of Base64 encoding?

Jaruzel · on Jan 31, 2019

I think the angle he's going for here is that if you have inline base64 blobs in your html/css code, and that's then served by a HTTP(S) server using gzip compression, are you wasting lots of space/bandwidth? and the answer is +5% to +2.5% overhead, which isn't much considering the larger advantage of a single round trip to the server as all the assets are embedded in the html document.

Of course there's also the overhead of decoding all those base64 blobs in the browser, but I'm sure that's a topic for another blog post in the future :)

_petronius · on Jan 31, 2019

For the first load of the page, sure. But if you're embedding HTML and CSS, then you can't cache it separately, which means having to fetch it all on every request where anything has changed, in addition to the decoding etc on every request.

masklinn · on Jan 31, 2019

That can also be an advantage, if the connection has a high throughput but a high latency (e.g. cellular networks, still), you'd much rather fetch everything in a single request. There's a slight inefficiency in the inability to independently cache assets, but even that's not necessarily a big issue: you'd mostly inline small assets.

skohan · on Jan 31, 2019

It still feels like a hack to work around a problem that should be solved at the protocol level. There's no technical reason that multiple assets couldn't be streamed over a single TCP connection. If it's a common use-case that a number of assets need to be loaded at once to display a web page, then this should be supported by HTTP.

jenscow · on Jan 31, 2019

Great idea. Let's call it HTTP/2.

https://en.wikipedia.org/wiki/HTTP/2

TheCycoONE · on Jan 31, 2019

You can bump that scenario down a layer to http2 push and get the best of both worlds.

danesparza · on Jan 31, 2019

Yes. I think it's a bit ironic the OP didn't mention the CPU overhead to essentially go from

binary -> ASCII -> binary

NelsonMinar · on Jan 31, 2019

I wonder if gzip has improved? 15 years ago when I tested this, it was more efficient to gzip base16 encoded data than base64 encoded data. (At least, the English dictionary.) I assume that was because the 3:4 encoding broke up patterns in the source text and messed with the compressor.

But I just tested this again and it's not true anymore.

  $ wc -c american-english  
  971578 american-english
  
  $ gzip < american-english | wc -c  
  259977

  $ base64 < american-english | gzip | wc -c  
  429263

  $ hexdump -e '"%x"' < american-english | gzip | wc -c  
  478411

kbaker · on Jan 31, 2019

Actually, it is still true. hexdump has some confusing format options... what you were using was converting it to little-endian first before printing the hex representation, which really messed with gzip. Try this:

    $ hexdump -e '"%x"' < american-english | gzip | wc -c
    463871
    $ hexdump -v -e '/1 "%02x"' < american-english | gzip | wc -c
    302515
    $ base64 < american-english | gzip | wc -c
    415737

eadmund · on Jan 31, 2019

> converting it to little-endian first

Ack, is there anything little-endian doesn't ruin?

/me misses 68000 assembler …

NelsonMinar · on Jan 31, 2019

Thanks! That'll teach me to cut and paste from StackExchange. Also why I got different results last time I looked at this; was using a different base 16 encoder.

tln · on Jan 31, 2019

If you just compress the base64, the overhead is 5%. But if you embed within an HTML file, with a different set of character frequencies, that increases.

    $ cat index.html | gzip -9 | wc -c
    14116
    $ cat index.html bing.base64 | gzip -9 | wc -c
    15994
    $ expr 15994 - 14116
    1878
    $ cat bing.base64 | gzip -9 | wc -c
    1432

tinus_hn · on Jan 31, 2019

If you’re going to do an experiment like that you will probably want the file to be a whole lot bigger for more accurate results.

kstenerud · on Jan 31, 2019

I wrote some revised radix-based text encoding specifications to make them interoperate better with modern text processor standards (SGML, string literals, filenames, URIs, etc). I also included a representative test for how they fare on uncompressed vs pre-compressed data when compressed with gzip:

https://github.com/kstenerud/safe-encoding/blob/master/READM...

radix-64 of course fares the best, but radix-85 isn't much different, and is 10% smaller uncompressed.

hrldcpr · on Jan 30, 2019

The article compares raw vs base64+gzip, but I'd be interested to see gzip vs base64+gzip

leetbulb · on Jan 30, 2019

Did some further investigation:

i included some text files, hex encoding, and other compression as well :)

raw file list: https://hastebin.com/vazonowuvo.txt

tsv (c/p to spreadsheet: https://hastebin.com/ewohafucem.tsv

---

(interesting) using bzip2, compression is better when the following file are encoded first with base64 or hex: bing.png googlelogo.png peppers_color.jpg

Useless takeaways:

- prefer base64 over hex when encoding already compressed images before further compression

- prefer hex over base64 when encoding plain text / low entropy data before further compression

spurgu · on Jan 30, 2019

Heh, so gzipped png's are generally larger than non-gzipped. Samples are prob. heavily compressed though.

zrm · on Jan 31, 2019

Not just that, PNG and gzip use the same compression algorithm:

https://en.wikipedia.org/wiki/DEFLATE

qqqqqqqqqqqq23 · on Jan 30, 2019

The formats he uses as raw are already compressed by default, gzipping them wouldn't help. Gzipping the new encoding makes sense.

emilfihlman · on Jan 30, 2019

Theoretically base64+gzip and gzip should be almost equivalent. For more aggressive compressors they should be equivalent.

BeeOnRope · on Jan 31, 2019

Well they will never be equivalent since the compressor has to learn and encode the set of 64 characters used, and passing along that information has some cost. In practice, that cost is either sending the probabilities up front in a compressor that does that, or the ramp up cost of using the wrong probabilities for a compressor that just uses the existing frequencies as the implicit probabilities.

Otherwise we could pass along some information "for free" in any base-64 encoding scheme by choosing some set of 64 characters (there are lots of choices and whichever set we choose encodes a message), encoding the original message with it, and then compressing it back to the original size - leading to "infinite" compression.

Other reasons base64 can't be compressed exactly back to its original form include the presence of arbitrary newlines, and trailing padding characters.

This doesn't matter in practice for compressing one large base64 encoded file where the overheads go to approximately zero, but for compressing a larger non-base64 file (e.g., HTML file) that contains embedded base64 chunks it is actually a real problem the compressor has to delimit the base64 encoded regions and communicate the new symbol probabilities somehow for each region.

jnordwick · on Feb 1, 2019

paging BeeOnRope. Can you email me? address in profile. you answered a question i had a while ago and wanted to follow up. thx

ggm · on Jan 30, 2019

We used to argue about the cost of enforcing a linewrap under 80 in email days, without really discussing how elision of the \r\n embeds in the PEM encoded or Base64 would make this a non-issue.

=Endmarker

sshine · on Jan 30, 2019

That reminds me of when someone asked if strings that are base64-encoded twice are regular. They are.

https://stackoverflow.com/questions/49650847/determine-if-st...

tedunangst · on Jan 30, 2019

What makes base64 encoded random data not regular?

zamadatix · on Jan 30, 2019

Nothing, the top answer in the link covers that question trivially. The question on if twice encoded strings are regular is simply the more interesting one as it is not as trivial.

aboutruby · on Jan 31, 2019

From archive.org: https://web.archive.org/web/20190131000036/https://lemire.me...

bdhess · on Jan 31, 2019

> Privacy-wise, base64 encoding can have benefits since it hides the content you access in larger encrypted bundles.

Uh, no.

jameshart · on Jan 31, 2019

The author is assuming a certain context here: that we are interested in the specific problem of transferring binary resources to a web browser. I appreciate in isolation this sentence seems incorrect, but if we assume good faith on the part of the author it's clear what they're trying to get at.

They are using 'base64 encoded' as a shorthand for 'as an embedded base64 blob inside an HTML page.' - to distinguish it from 'as an individually requested resource'.

So what they are referring to is that by embedding resources as base64 blobs inside HTML pages, and transferring those over SSL, an observer sees one large encrypted bundle being requested. If you transfer the resources as separate response entities, then an observer sees the large HTML bundle request, followed by the series of specific resources - and they can infer from the sizes and patterns of those requests some things about the page requested or the resources used.

(for example, if I know the size of the HTML and every image on wikipedia, perhaps by observing the set of sizes of resources being downloaded by a client over HTTPS I can determine which wikipedia pages a client is browsing?)

zamadatix · on Jan 31, 2019

Based on your selected quote and short comment I think you're reading that differently than the author intended. Note the start of that paragraph (which you left out):

> In some instances, base64 encoding might even improve performance, because it avoids the need for distinct server requests.

I.e. they are arguing inlining all resources and grabbing them in a single request has a smaller fingerprint. This is probably less true with HTTP/2 or QUIC

maxk42 · on Jan 31, 2019

Yes, but it's not encrypted.

Furthermore the article begins with the premise of using text-only data transfer protocols such as MIME, then goes on to talk about base64-encoding then gzipping data. If he were to continue talking about text-only data transfer, then he should've talked about gzipping, then base64-encoding the data or if he were talking about reducing the size of the data he should've been talking about gzipping instead of base64-encoding the data. Instead he seems to be talking about something which isn't compatible with MIME. So the article doesn't really seem to have a clear direction. If the point was to cut down on a single round-trip to the server then congratulations -- you've increased your page size by 2.5% and added the overhead of compression to the request -- a much higher toll than the cost of a 2nd request for non-trivial assets of the sort gzip would be effective for since it has its own overhead requirements.

zamadatix · on Jan 31, 2019

> Yes, but it's not encrypted.

Would your opinion change if you reread the content with the understanding HTTPS is assumed when talking about web privacy in 2019? The author made no claim whatsoever base64 was encrypting the data, just bundling it into one generic request.

> Furthermore the article begins with the premise of using text-only data transfer protocols such as MIME, then goes on to talk about base64-encoding then gzipping data...

Would your opinion change if you reread the content with the assumption the author means to set "Content-Encoding: gzip" in the server config rather than literally gzipping the content?

I think both of these are fair assumptions for the target audience of the article to assume but your disagreements with the article are only true without them.

maxk42 · on Jan 31, 2019

> Would your opinion change if you reread the content with the understanding HTTPS is assumed when talking about web privacy in 2019?

Well he began by talking about email, so no. If we want to talk about HTTPS and 2019, then let's serve the whole shebang with HTTP/2 and not worry about reducing the number of requests, which seems to be the only advantage offered.

> Would your opinion change if you reread the content with the assumption

That's the assumption I was forced to make, and the crux of my argument. Content-Encoding: gzip works in most servers by compressing the content on the fly -- not by precompiling. Hence my comment about adding the overhead of compression to the request.

ggm · on Jan 30, 2019

Another gem, is the practice of seeking the minimal encoded instance which is correctly identified as "legal" by magic. People do this for a.out, JPG, PNG &c..

leetbulb · on Jan 30, 2019

The site is down, but the answer is 33%. (four output bytes for every three input bytes).

It could be slightly less if you want to remove the padding and calculate what the padding should be on the decoding side: base64string.length % 4.

Edit: If no entropy is added, it's still ~33%. C'mon :P

BeeOnRope · on Jan 30, 2019

When it's up, you'll see the article goes a bit beyond a literal reading of the title: the topic is the overhead of base64 under the assumption that the output ends up compressed with something like gzip.

leetbulb · on Jan 30, 2019

Figured as much. Interested to see the results. My assumption is that it doesn't change much. However, I suppose that depends on the compression algorithm.

BeeOnRope · on Jan 30, 2019

In principle a good byte-wise entropy coder should recover nearly 100% of the base-64 "inflation" since no entropy is added. In practice gzip doesn't get all of it since it is an imperfect entropy coder for several reasons.

hermitdev · on Jan 30, 2019

You are correct, but gzip is a generally happy medium of cost to encode/decode and encoded size when transmitting over a fast network. For a 10 Mbps or greater link, if you use something like bzip2 or 7z, you'll likely spend more time encoding than transmitting a payload.

This is from my personal experience utilizing gzip and bzip2 via Boost Iostreams to compress network payloads over a 1Gbps link. End to end latency was far superior with gzip than bzip2, despite bzip2 having a smaller transmission size.

BeeOnRope · on Jan 31, 2019

gzip is only a happy medium if your other candidates are 7z (LZMA) or bzip2, which are both stronger but slower compressors. bzip2 is essentially obsolete (off the pareto frontier), and LZMA is good but slow so will only be best if your transmission speed is low.

Near the space/time tradeoff point that gzip lives, however, it is thoroughly outclassed by more modern compressors such as zstd or brotli with the appropriate settings.

tedunangst · on Jan 30, 2019

From personal experimentation, a gzipped base64 jpeg is close to the same size as the jpeg.

So if you want to efficiently store some emails, it doesn't really matter whether you decode and store attachments separately or just compress the raw text. (Although decoding and separating allows deduping.)

speedplane · on Jan 31, 2019

This article compares base64 with gzipped base64 and posits that there isn't too much difference in size. That's not terribly insightful. The only reason base64 exists is because of a lack of standardized binary distribution formats, especially over the internet. There is literally no substantive difference between data, and base64 data. I'm actually quite surprised that gzipped data is 2% larger, I'd expect a smaller margin.