CELT: Next-generation low-latency audio codec from xiph.org

Groxx · on Dec 24, 2010

Having listened to the 64bit comparisons of the audio on the page (that's a particularly cruel sample set; is it a common one?):

Wow. The worst audio chunks on CELT are significantly better than the worst on the other codecs. Overall, I hear a bit of a loss on the low-end of wider sounds (the one that pops out to my ear is the entry starting at 0:37), but for such an incredible improvement in the high ranges that's a wonderful price to pay. Still a decent bit of that "anti-pre-echo", which I despise, but generally less than the others.

All in all: an epic improvement over the other samples, especially when taken across the board. All the other encoders had huge distortion on one or more of the sound bites (especially the third, 0:22); CELT had very little, even at its worst. Phenomenal work.

/listens to 32kbps. shall update!

edit: yep, it hurts most on that 0:37 entry. Unfortunately, I put it as noticeably worse than all but the AAC-LC (but that one's horrible in everything) at 32kbps. It does handle the voice at the end very well, though. That, with the latency, means it'd probably be great for voice use.

/tries 48; maybe it survives at the middle?

edit2: yeah, about middle. Almost catches up to the HE-AAC entries, maybe passes Vorbis.

Voice-only or required-low-latency aside (both important qualities), I think I'm not going to use or recommend this for sub-64kbps, though it could of course improve yet. Vorbis and HE-AACs beat it. At 64 though, maybe above, or for respectable quality telepresence, it sounds like a clear winner.

nullc · on Dec 25, 2010

FWIW, mono at32kbit/sec is fairly close to stereo at 64. (Of course, there is some joint coding gain from stereo— but stereo's wost case is just as bad as 2x mono and it's the often the worst cases that limit the quality).

For example, mono speech at 32kbit/sec: https://people.xiph.org/~greg/celt/spec.orig.wav https://people.xiph.org/~greg/celt/spec.MP3.wav https://people.xiph.org/~greg/celt/spec.vorbis.wav https://people.xiph.org/~greg/celt/spec.CELT.wav

The SILK+CELT hybrid does better still, but even CELT alone is still fairly useful at lower rates.

anigbrowl · on Dec 24, 2010

Great work and great documentation - as usual from Xiph. CELT looks like a big step forward for real time internet audio. Sub-10ms latency is very impressive even if the quality is somewhat compromised - for many purposes it will be sufficient, and where it's not, lossless recordings can be transferred later.

The only thing that I wish they would change is the name 'Ogg' for their container format. It sounds like a character in a bad children's movie and I always feel slightly embarrassed introducing it into conversation.

InclinedPlane · on Dec 24, 2010

http://www.amazon.com/Secret-World-OG/dp/B000MX7UEY

>_>

anigbrowl · on Dec 24, 2010

cheald · on Dec 27, 2010

Mumble (http://mumble.sourceforge.net/) uses CELT, and it's amazing - when I first started using it, I was a little unsettled, because I could hear inflections and intonations in others' voices that I didn't realize were being stripped out by Speex.

CELT is an amazing product, and Mumble has built an amazing product on top of it. I'm pushing very hard to try to help Mumble topple Ventrilo/Skype, at least for gamers. I'm really excited to see how CELT has grown and changed over the last couple of years, and it's just getting better.

(Full disclosure: I run a Mumble hosting service. But that's mostly because I love the product so much.)

Groxx · on Dec 24, 2010

>low-bitrate performance ('sweet spot' >= 32kbps for 48kHz stereo)

That seems to be contradictory. or is it just me? It seems to imply the "sweet spot" (which I'm interpreting as the minimum good-sounding point) exists only at or above 32kbps; which is lowish, but not all that low. And why the ">"? Surely an upper bound is more useful to people interested in low bitrates, not a lower bound.

>flexible streaming with the ability to change most codec parameters mid-stream

Fantastic news. As to the rest... I'd love to understand all that. I'll have to read through with Wikipedia some time.

anigbrowl · on Dec 24, 2010

Sweet spot is more 'best bang for the buck' - you can go lower than that, but you'll have to sacrifice either latency or frequency resolution. Have a look at the comparison chart for some context: http://www.celt-codec.org/comparison/

32kbps is not the lowest possible, obviously, but it's still very very low. Like that's near-realtime encoding at quite high quality at a bitrate low enough to go over an old modem. Normal uncompressed 16-bit 48khz stereo audio (the most popular yardstick since the establishment of DVD) is 1536 kbps. Remember that's kiloBITS per second, and two channels of 16-bit audio are taking up 32 bits per sample. At 32 kbps without any compression you'd be limited to a 1 khz sample rate which about as smooth as a cheesegrater. Compressing audio in almost real time by a factor of 48 and still having it sound this good is astonishing, trust me. If you didn't do so already, find the picture of the spectrogram and look at the uncompressed and mp3 plots for a while. Then look at how much more faith the CELT codec is to the source material. Their insight about maintaining energy of each coding band at unity is extremely impressive, one of the cleverest things I've seen in DSP since perceptual coding (which is what mp3 does).

If digital signal processing and the like stimulates your intellectual curiosity then I urge you to learn more about it - it's a really interesting and very useful field of study, with all sorts of interesting applications and lots of territory still unexplored. The Scientist's and Engineer's Guide to DSP is a fairly basic introductory text, but has two massive advantages of all its competitors: it is available for free at the author's website, and it is extremely well written. Other books can tell you what you need to know. The DSP guide tells you why you need to know, and why the fundamental algorithms are so elegant. http://www.dspguide.com/

Groxx · on Dec 24, 2010

Comparing it to uncompressed audio is a bit of a red herring. MP3s can easily do 32kbps - they sound like crap from a musical standpoint, but they do it just fine. Heck, mp3s at 8kbps still sound significantly better than my cell phone - you could run 7 audio streams at the same time on dial-up with that quality. Similarly, uncompressed video is huge, when a high-quality h.264 pass will look almost identical with massive size savings. From the several charts, it looks like CELT could be a very large improvement over the encoders they compared against (I'm assuming a relevant sampling), especially where speed is concerned, and I'll definitely poke at it and see what I think.

I very much liked the spectrograms, that looks to be a massive improvement. I've got to test it on my good pair of headphones to see just what it sounds like.

Nearly everything stimulates my intellectual curiosity; this is high-ish on my list, but that might mean years yet. Many thanks for the suggestions on info though, I'll most certainly keep that handy!

nullc · on Dec 24, 2010

Exactly.

CELT at 2.5kbit/sec: http://myrandomnode.dyndns.org:8080/~gmaxwell/celt/16k_60ms_...

So sure, CELT can be coerced to run at obnoxiously low rates— it's far more flexible than MP3 in this regard, every frame size which is an integer number of bytes greater than 8 or so should more or less work (and it should use every bit effectively).

This doesn't mean that it'll actually be useful at very low rates. 32kbps is about the limit for 20ms frames where things really start to come apart and everything starts sounding pretty poor.

The CELT _decoder_ is more computationally complex than you might guess. We started the design in 2007, and the decoder is pretty comparable to MP3 or Vorbis decoding (though requiring a _lot_ less memory). Using substantially less CPU than that, if there was quality to be gained would have been a sin. So although the overall design is nice and simple— we use a number of 'more optimal' techniques which are decent CPU sinks. E.g. Everything is range coded (so we're not constrained to coding things with probabilities of the 1/2^n form) and we use high dimensionality vector quantization.

The encoder, however, doesn't need a psy-model at all nor does the current one have much of a psy-model (it has some simple hacks for a couple psycho-acoustic tweaks, but nothing too complex) Reasonable perceptual performance is implicit in the format. This means that a good CELT encoder can be much faster than, say, a good vorbis encoder.

On the subject of DSP stimulating curiosity, checkout http://www.xiph.org/video/

anigbrowl · on Dec 24, 2010

I kinda like the artifacts on that :-) You weren't kidding about the impact on pitch tracking; and yet it's remarkable how subtle nuances of timbre and intonation remain, particularly on the woman's voice. I'm still wrapping my head around the n-dimensional vector quantization - it seems to me like some of the techniques you have refined for this codec would have more general application for things like time stretching and noise fingerprinting.

Thanks for sharing your insights with us, as well as all the fine work you guys have been doing for many years. You're very generous. PS: I can't recommend that video highly enough - there are things in there that took me years to pick up the hard way, and the presentation is great too. Looking forward to more!

Groxx · on Dec 24, 2010

Nabbing the video right now, thanks for the link! I doubt I would've run across that one any time soon.

charlesdm · on Dec 24, 2010

Where/when can I get a library for this?

anigbrowl · on Dec 24, 2010

http://www.celt-codec.org/downloads/

chipsy · on Dec 24, 2010

The upper bound is going to change with the material, because the quality is most fragile at high frequencies - the place where the most data is needed. Speech uses a different frequency range from distorted guitar, and a different range from a synthesized bass tone. When you hear the metallic-swirl artifacts in low bitrate MP3s, it's always coming from the highs of cymbals and electric guitars.

With the lower bound recommendation, they are guaranteeing that you will affect _any_ material by going below 32kbps, which is a useful piece of data; one can optimize material for size by starting there and gradually moving upwards until the sound is as transparent as necessary.

DarkShikari · on Dec 24, 2010

because the quality is most fragile at high frequencies - the place where the most data is needed

Funnily enough, for CELT, it's largely the opposite: high frequencies cost practically nothing to compress because of CELT's algorithm, whereas low frequencies are the main bandwidth-eater.

slug · on Dec 25, 2010

http://www.ekiga.org/ (ekiga / ubuntu maverick) comes with support for CELT, among others (Speex,etc), although I didn't compare them for quality.