You do it by comparing the state voting results to pre-election polling. If the pre-election polling said D+2 and your final result was R+1, then you have to look at your polls and individual polling firms and determine whether some bias is showing up in the results.
Is there selection bias or response bias? You might find that a set of polls is randomly wrong, or you might find that they’re consistently wrong, adding 2 or 3 points in the direction of one party but generally tracking with results across time or geography. In that case, you determine a “house effect,” in that either the people that firm is calling or the people who will talk to them lean 2 to 3 points more Democratic than the electorate.
All of this is explained on the website and it’s kind of a pain to type out on a cellphone while on the toilet.
You are describing how to evaluate polling methods. And I agree: you do this by comparing an actual election outcome (eg statewide vote totals) to the results of your polling method.
But I am not talking about polling methods, I am talking about Silver’s win probability. This is some proprietary method takes other people’s polls as input (Silver is not a pollster) and outputs a number, like 28%. There are many possible ways to combine the poll results, giving different win probabilities. How do we evaluate Silver’s method, separately from the polls?
I think the answer is basically the same: we compare it to an actual election outcome. Silver said Trump had a 28% win probability in 2016, which means he should win 28% of the time. The actual election outcome is that Trump won 100% of his 2016 elections. So as best as we can tell, Silver’s win probability was quite inaccurate.
Now, if we could rerun the 2016 election maybe his estimate would look better over multiple trials. But we can’t do that, all we can ever do is compare 28% to 100%.