Assuming the YC guys get about 3000 applications (based on published acceptance rates and class sizes), they can probably either do some a/b tests in the reviewing process, or just substitute extra manual labor to evaluate based on other metrics if something goes wrong. It's not all all-or-nothing single test.
Metrics only get you so far. They don't really tell you why you fail. They only tell you how far you are along your plan to succeed, and whether it is matching expectations.
But if things are failing, they can't tell you why, particularly in an experiment like this. Maybe it is genuinely a bad idea and nothing will make it succeed. Maybe it is a good idea but needs to be gone about differently.
In other words, you can show that you are doing what you think you need to do to be successful, and that you are not being successful, but you can;t show what changes you need to make in order to be successful.
The solutions do require manual labor, but they require creative labor. This area is more art than science.
If I was doing this I would assemble a list of reasons why this experiment will fail up front. I would revisit the lists in a year or so. I would ask which ones played into the problem. I would then ask what could be done about it. But that's me. Others have to find their own ways.
I meant for the application-reviewing process, which has much faster feedback, vs. the "are the applicants qualified" process, which takes waiting at least through the session, and possibly a few more years.
They should be able to tune the app-review and interview process pretty easily, at least to get the same amount of information as from teams with ideas. Figuring out if those teams, if accepted, are markedly different from those who came with existing ideas, that's a much harder problem.