DeepZip: Lossless Compression Using Recurrent Networks [pdf]

danielbigham · on Dec 29, 2017

I have a suspicion that in the future computers will use a vaguely DNN-like approach for compression, but I haven't seen any super compelling examples of this yet. I can say "imagine a photo taken at 2000 feet and 2 miles away of the Eifel Tower with a beautiful orange sunset in the background", and your mind can use it's internal representation to render something along those lines. I feel like computers some day should be able to do the same thing, except that with another kilobyte of constraints I should be able to make the final result really close to a particular photo that matches that description.

anonfunction · on Dec 29, 2017

Though it has little bearing on your point, many people including myself do not possess the ability to "imagine a photo". It is known as aphantasia and Blake Ross has a great article[1] on the subject.

1. https://www.facebook.com/notes/blake-ross/aphantasia-how-it-...

exikyut · on Dec 29, 2017

Well, in an even more distant future, a more developed version of the kind of neural networks that the GP describes could be hooked up to your brain to provide this functionality.

arnioxux · on Dec 29, 2017

There was another submission recently that kind of does what you described. Click on the visualizations to see the iterations: https://dmitryulyanov.github.io/deep_image_prior. The ELI5 is that prior knowledge of what images look like can be used to reconstruct images w/o corruption.

For non-images, this idea is pretty old and sounds a lot like compression algorithms that share a pre-defined dictionary. For example: https://en.wikipedia.org/wiki/Brotli, "improved the compression ratio by using a pre-defined dictionary of frequently-used words and phrases." Of course words/phrases are a lot easier to predefine.

I wonder how large the predefined weights of the NN has to be to effectively compress real world images?

visarga · on Dec 30, 2017

On the one hand, DeepZip does neural net based compression.

On the other hand, in "The Case for Learned Index Structures" [1] paper, they replace B-Trees with neural nets:

> by using neural nets we are able to outperform cache optimized B-Trees by up to 70% in speed while saving an order of magnitude in memory over several real world data sets

Basically ML/DL is coming for the classic data structures, replacing some of them with learning based versions.

[1] https://arxiv.org/abs/1712.01208

brian_cloutier · on Dec 29, 2017

Can someone who knows more about neural networks tell me how this works? Apparently the model's weights don't need to be sent over the network. The only thing which is shared is some seed used to initialize the model with random weights. While the network encodes data it simultaneously updates its weights. To decode, somehow you can run the same process and magically get back the original data?

The paper doesn't explain how this is possible, how would I go about implementing something like this?

john_moscow · on Dec 29, 2017

The naive solution would be to split the input data into the batches. Compress the first batch using a random model, then train a new model based on it. Compress the second batch using the model trained on the first batch and so on. Depending on your data structure, you can use a model trained on all the previous batches (or just N previous batches) when compressing the i-th batch.

This can be likely optimized (e.g. 7-Zip's arithmetical coder doesn't split data into batches, but rather uses exponentially filtered conditional probabilities that are instantaneously updated each time a new sample is processed). I'm no expert in RNNs, but my hunch is that you can adjust the model incrementally using backpropagation right after it processes another sample instead of completely retraining it.

brian_cloutier · on Dec 30, 2017

Interesting.

So, to decode you would start with the unoptimized model and decode the first batch. You then train using that first batch to get a better model and use your improved model to decode the next batch.

Okay, that makes sense; thanks for the response!

NegatioN · on Dec 29, 2017

"We know that Recurrent Neural Networks (LSTM/GRU) based models are good at capturing long term dependencies"

This seems like a bold statement. We know they are good at capturing some structure from sequential data at least, but they seem lacking in some regards [0].

However it seems to yield some interesting results. The architecture is like an autoencoder, but with RNNs connected at each layer? Have you tried using NTMs?

[0]: https://rylanschaeffer.github.io/content/research/neural_tur... (section on LSTM Copy Performance)

rytill · on Dec 30, 2017

Does anyone have a link to a good replacement resource for the 2nd source? The link is broken.

[2] EE376C Lecture Notes: Universal Schemes in Information Theory: http://web.stanford.edu/class/ee376c/lecturenotes/intro_lect...

zitterbewegung · on Dec 29, 2017

Why aren’t they comparing it to parq or even bzip?

lern_too_spel · on Dec 29, 2017

Because the author compares to the best domain-specific compressors (by compression ratio) for each dataset instead. For text, he used CMIX[1]. For DNA sequences, he used MFCompress[2].

[1] http://mattmahoney.net/dc/text.html

[2] http://bioinformatics.ua.pt/software/mfcompress/

Torkel · on Dec 29, 2017

nit: an abstract should describe the obtained result

YetAnotherNick · on Dec 29, 2017

Correct me if I am wrong, but it seems they haven't counted the network storage.

batmansmk · on Dec 29, 2017

Quoting: " If the model is trained for more than an epoch, the weights of the RNN Estimator Block also need to be stored, and would count in the compression size (MDL Principle [14])"

YetAnotherNick · on Dec 30, 2017

After another read, they are only storing seed and not the weights, needing entire data to be decoded, and so the compressed format cannot be transferred without all the data.

> Only random seed is Stored: The model is initialized using a stored random seed, which is also communicated to the decoder. Thus, the effectively the model weights do not contribute to the total compressed data size

> Single Pass: As we do not store the weights of the model, we perform a single pass (1 epoch) through the data

maliker · on Dec 29, 2017

Could the Weissman score be higher than middle-out methods?