I have spotted a couple of ways to improve performance:
Increase the size of the bitmap in each node from 32 bits to 64 bits. You are wasting 4 bytes per node, and wider nodes mean fewer indirections for each lookup.
Change the recursion in the lookup to iteration. You used iteration in other traversals, and it should be much more efficient.
Thanks! Yes, basically log_64(n) vs. log_32(n) and it would also require to switch to a 64 bit hash function and adjust hash exhaustion and bit fiddling arithmetic accordingly.
Re the impact of the recursion, that's actually zero for the search code since clang does a tail call optimization; it's a fair point for the removal code.
Increase the size of the bitmap in each node from 32 bits to 64 bits. You are wasting 4 bytes per node, and wider nodes mean fewer indirections for each lookup.
Change the recursion in the lookup to iteration. You used iteration in other traversals, and it should be much more efficient.