Thanks for the write up! In the beginning you mention that you ran into some bugs in the Python version that would have been caught by the Haskell type checker. Can you go into more detail about what those bugs were?
The most serious one is that we were submitting jobs (as json) that were missing a few metadata fields. In Python we passed around dictionaries, and even though we had json schema validation in place, this slipped through. In Haskell, we define a record type and the corresponding serializers. It is more code, but what you get is that invalid data cannot exist at runtime: it simply cannot be represented.
Also, a compiler refuses to compile your code if you make a typo in a field name.
The real issue with the Python scheduler was its algorithmic complexity. Using a faster implementation would have bought us a few extra months or maybe even year, but it would only have postponed the need for a real solution.
"Our lead developer Robert usually comes in a bit later, so we had about an hour to build a working prototype" - why did you only have an hour? What would have happened if the working prototype was not done when he came in?
The GHC runtime has built-in support for memory profiling, it can produce a graph that shows a breakdown of the heap over time. After trying various combinations of flags I managed to produce a graph where one part was clearly growing over time. The corresponding function was a recursive function with two arguments, the first never changing in recursive calls. I rewrote that to a single-argument nested function, and that made the leak go away.