They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).