It's just a standard vision convnet like ResNet-18, or ResNet-50. It gets fed facecam with an audio spectrogram concatenated to it (pretty hacky, but seems to help). All it does is binary prediction of {interesting, not interesting}, and I use some heuristics to pick regions of video based on how many frames were "interesting" to the model.
Feel free to take a look at the (research quality at best) code: https://github.com/eqy/autotosis