It's much easier for a computer to recognize audio than video content. I'm sure in a couple of years, YouTube will automatically flag violent terrorist propaganda.
Not just any audio but known audio. Finding known video is something they can do already too. That's why some people upload clips from their favorite shows and flip them horizontally to avoid automatic detection. Think of it like a hash for the audio/video portion. Pretty easy to check if it's in the database.
I'm surprised whatever feature detection they use isn't robust against trivial changes like that. Rotation, flipping, and scaling/cropping should all result in the same hash.
We only brushed over how hashes work in college, so I may be totally ignorant, but-- wouldn't rearranging the order of the data (rotate, flip, etc) necessarily change the resulting hash?