I am also curious about that limitation.
I tried the lib with a 10 mins audio and it gave me some unrelated words but when I tried a short clip of 20 seconds, it was able transcribe it properly. I think I have to dig deeper to know what caused this behaviour.