# Polls - Sep. 27, 2020

As the 2020 election draws nearer, national and state polls will gain increasing attention. Frequently, these polls seek to estimate the popular vote share of the presidential election by asking U.S. citizens their voting preference. This week, I will **examine if polls can be used as a good predictor for the actual election outcome.** I will do this by analyzing historical polling and voting data to see if polling has been an effective predictive model in the past. I will then see if polling data has been an effective predictor of historical elections at the state level. Finally, **I will use the models I create and 2020 polling data to predict this year’s election.**

When creating these scatter plots, which visualize the relationship between polling averages and popular vote share, I chose to use historical polling data from 6-10 weeks before the election. I chose to use this month of polling because, as of the writing of this blog, we are approximately six weeks away from the election. I wanted to use this entire month so that there would be a larger sample of polls when estimating the polling average. I also chose to divide incumbents and challengers into separate plots to see if there was a different relationship based on this characteristic. These plots make it clear that **there is some relationship between polling and voting,** but next we must ask how strong this relationship is.

In order to quantify the relationship between polling data and popular vote share, I created a linear regression. I also created a linear regression which used a combination of GDP growth in the second quarter and poll data in order to compare the effectiveness of these two models. The choice to use GDP growth in the second quarter was based on my previous modeling which showed GDP growth to be a good predictor of election outcomes. When comparing the models in the table above, **it appears that the relationship between polling averages and vote share is more prevalent for the incumbent than the challenger.** Additionally, it appears that using the combined economic and polling data is a more accurate predictor based on the slightly higher R squared scores. However, before concluding that the combined model is better, we should validate the models using out-of-sample evaluation.

## Model Validation

To avoid the negative effects of overfitting a model it is important to evaluate the model using out-of-sample evaluation. This process involves removing one year of data from the data set, remaking the predictive model, and then testing how well the model predicts the removed data. By repeating this process many times, and seeing if the model correctly predicted the popular vote share winner, we can see how valid each model is.

The model using just polls predicted the popular vote share winner correctly **90%** of the time while the model using both economic and polling data was correct only **80%** of the time. This suggests that **using only polling data may actually be a better predictive model than using both economic and polling data** despite the difference in R squared values discussed above.

## State Level Models

While considering the national popular vote is certainly important, the actual election outcome is decided by the electoral college. In order to accurately predict an electoral college outcome, it will be necessary to predict the outcome of each state through a model. After determining that using poll data alone on the national scale was an effective model, I decided to use this same approach at the state level. The charts below depict the **R squared values of predictive models based on polling data for 48 states.** I removed Nebraska and Maine from the model because the sample size of elections was too small to create a reasonable model.

From these graphs, it is apparent that the quality of the polling data model varies drastically by state. In some states, polling data explains a large percentage of the variation in the popular vote, while in others it explains almost no variation. Additionally, it appears that on average, **polling data is better at predicting incumbent vote shares than challenger vote shares at the state level.**

## Predicting 2020

Now, using the previously created predictive models, I will predict the vote shares in each of the major swing states identified in my introduction post. I will estimate the democrat’s win margin by using the models to predict the incumbent vote share, the challenger vote share, and then the difference.

According to the predictive state vote share models, **Biden will win Colorado, Florida, Nevada, and Virginia.** In this scenario, Biden would almost certainly win the general election due to the large boost in electoral votes provided by Florida. While this model is still not perfect, using polling data greatly increases the predictive power of election models during this presidential election.