Opinion polls for the 2021 German elections

We want to see the past trends of polling results for different parties in the 2021 German federal election.

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia.

Scrape the data

We first scrape the wikipedia page and import the table in a dataframe, named as german_election_polls.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"

# similar graphs and analyses can be found at 
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )

Glimpse and restructure the data

We then take a look at the generated German Election Polls data framework. The missing value is caused by data structure and we eliminated it.

glimpse(german_election_polls)

## Rows: 250
## Columns: 16
## $ polling_firm   <chr> "2021 federal election", "Wahlkreisprognose", "Ipsos", ~
## $ fieldwork_date <chr> "26 Sep 2021", "22–24 Sep 2021", "22–23 Sep 2021", "22–~
## $ samplesize     <chr> "–", "1,400", "2,000", "1,273", "2,002", "1,554", "10,0~
## $ abs            <chr> "23.4", "–", "–", "–", "26", "–", "–", "–", "–", "–", "~
## $ union          <dbl> 24.1, 22.5, 22.0, 23.0, 22.0, 25.0, 23.0, 21.0, 21.5, 2~
## $ spd            <dbl> 25.7, 25.5, 26.0, 25.0, 25.0, 26.0, 25.0, 25.0, 25.0, 2~
## $ af_d           <dbl> 10.3, 11.0, 11.0, 10.0, 10.0, 10.0, 10.0, 12.0, 11.0, 1~
## $ fdp            <dbl> 11.5, 12.0, 12.0, 11.0, 12.0, 10.5, 12.0, 11.0, 12.5, 1~
## $ linke          <dbl> 4.9, 7.0, 7.0, 6.0, 6.0, 5.0, 6.0, 7.0, 6.5, 7.0, 6.5, ~
## $ grune          <dbl> 14.8, 14.0, 16.0, 16.5, 17.0, 16.0, 16.0, 14.0, 15.0, 1~
## $ fw             <chr> "2.5", "–", "–", "3", "3", "–", "–", "2", "–", "–", "–"~
## $ others         <chr> "6.2", "8", "6", "5.5", "5", "7.5", "8", "7", "8.5", "9~
## $ lead           <chr> "1.6", "3", "4", "2", "3", "1", "2", "4", "3.5", "4", "~
## $ end_date       <date> 2021-09-26, 2021-09-24, 2021-09-23, 2021-09-23, 2021-0~
## $ month          <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
## $ week           <dbl> 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 37,~

skim(german_election_polls)

(#tab:glimpse into poll data)Data summary

Name	german_election_polls
Number of rows	250
Number of columns	16
_______________________
Column type frequency:
character	7
Date	1
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
polling_firm	1	3	32	17
fieldwork_date	1	11	24	236
samplesize	1	1	6	199
abs	1	1	4	10
fw	1	1	3	5
others	1	1	3	14
lead	1	1	4	33

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
end_date	0	1	2021-01-04	2021-09-26	2021-06-10	146

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
union	1	27.08	4.65	19.0	23.00	27.0	29.0	37	<U+2585><U+2587><U+2587><U+2582><U+2583>
spd	1	18.01	4.00	13.0	15.00	16.0	19.0	28	<U+2587><U+2583><U+2581><U+2582><U+2581>
af_d	1	10.60	0.94	8.0	10.00	11.0	11.0	13	<U+2582><U+2585><U+2587><U+2582><U+2581>
fdp	1	10.82	1.98	5.0	10.00	11.0	12.0	14	<U+2581><U+2582><U+2583><U+2587><U+2583>
linke	1	7.01	0.92	4.9	6.00	7.0	7.5	12	<U+2585><U+2587><U+2583><U+2581><U+2581>
grune	1	19.70	3.00	14.0	17.50	19.2	21.0	29	<U+2585><U+2587><U+2586><U+2582><U+2581>
month	1	5.59	2.53	1.0	3.25	6.0	8.0	9	<U+2585><U+2585><U+2583><U+2587><U+2587>
week	1	22.33	11.04	1.0	13.00	23.0	32.0	38	<U+2585><U+2583><U+2586><U+2586><U+2587>

#Using skim, We found 1 NA
german_election_polls = na.omit(german_election_polls)

To make the data more readable, we restructure the dataset and melt the 6 parties polls (Union, SPD, AfD, FDP, Linke, and Grune). We select the Party and Vote variables, and add the ‘end_date’ as x variable for plotting.

#Restructure the dataframe, select 6 parties polls and melt
library(reshape2)
german_election_polls_restr <- german_election_polls %>%
  mutate(Union=union, SPD=spd, AfD=af_d, FDP=fdp, Linke=linke, Grune=grune) %>%
  melt(german_election_polls, id.vars=c("Union","SPD","AfD", "FDP", "Linke", "Grune"),
                                        measure.vars = c("Union","SPD","AfD", "FDP", "Linke", "Grune"),
                                        variable.name = "Party",value.name = "Vote") 


#Select the Party and Vote variables, and add the end_date as x variable for plotting
german_election_polls_restr <- german_election_polls_restr %>%
  select(Party, Vote) %>%
  mutate(end_date=rep(german_election_polls$end_date,times=6))

Plot the data

Then we use geom_smooth to generate the 2021 German Federal Election Polling Results plot of the 6 parties.

ggplot(german_election_polls_restr, aes(x=end_date, y=Vote, color=Party)) +
  geom_point()+
  geom_smooth(se=F,span=0.2)+ #span=0.2 to make the line less smoothed
  scale_color_manual(values=c("Union"="black","SPD"="#AF1513","AfD"="#330D9D","FDP"="#FFC300","Linke"="#650D9D","Grune"="#0D9D36"))+ #manually select colors for different parties
  scale_x_date(date_breaks = "1 month" , date_labels = c("%b %y"),limits=as.Date(c("2021-01-01","2021-10-01")))+
  theme_minimal()  +
  theme(legend.position = "right",
        legend.title = element_text(size=14),
        legend.text = element_text(size=14),
        legend.key = element_rect(fill = "white", color = "white"),
        plot.title =element_text(size=20, face='bold',hjust = 0.5,margin = margin(10,0,10,0)),
        plot.subtitle =element_text(size=16, face='bold',hjust = 0.5), #put titles in the middle
        axis.text.x = element_text(size=10),
        axis.text.y = element_text(size=12),
        axis.ticks.x = element_line(),
        axis.ticks.y=element_line(),
        axis.title.x = element_text(size=16,face='bold'),
        axis.title.y = element_text(size=16,face='bold'),
        ) +
  labs(title = "2021 German Federal Election Polling Results",
         subtitle = "Combined results from different polling firms: INSA, Forsa, Kantar etc.",
         x = "End date of fieldwork",
         y = "",
         caption = "Source: https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election") +
  ylab("Vote %")

The pulling result trends are illustrated in the graph, with the data of each party spreading around the generated lines.

Another method: rolling mean

Finally, we try to use another method - rolling mean of 14 records instead of directly applying smoothing lines to make the plot.

german_election_polls_rolling <- german_election_polls_restr %>%   
  group_by(Party) %>%
  mutate( 
    #calculate 14-record rolling average, align = left
    Vote14 = zoo::rollmean(Vote, k=14, fill = NA,align="left"))%>%
  ungroup() %>%
  na.omit(german_election_polls_rolling) #remove the NA caused by rolling average

ggplot(german_election_polls_rolling, aes(x=end_date, y=Vote14, color=Party)) +
  geom_point()+
  geom_smooth(se=F,span=0.2)+ #span=0.2 to make the line less smoothed
  scale_color_manual(values=c("Union"="black","SPD"="#AF1513","AfD"="#330D9D","FDP"="#FFC300","Linke"="#650D9D","Grune"="#0D9D36"))+ #manually select colors for different parties
  scale_x_date(date_breaks = "1 month" , date_labels = c("%b %y"),limits=as.Date(c("2021-01-01","2021-10-01")))+
  theme_minimal()  +
  theme(legend.position = "right",
        legend.title = element_text(size=14),
        legend.text = element_text(size=14),
        legend.key = element_rect(fill = "white", color = "white"),
        plot.title =element_text(size=20, face='bold',hjust = 0.5,margin = margin(10,0,10,0)),
        plot.subtitle =element_text(size=16, face='bold',hjust = 0.5), #put titles in the middle
        axis.text.x = element_text(size=10),
        axis.text.y = element_text(size=12),
        axis.ticks.x = element_line(),
        axis.ticks.y=element_line(),
        axis.title.x = element_text(size=16,face='bold'),
        axis.title.y = element_text(size=16,face='bold'),
        ) +
  labs(title = "2021 German Federal Election Polling Results (14-record rolling mean)",
         subtitle = "Combined results from different polling firms: INSA, Forsa, Kantar etc.",
         x = "End date of fieldwork",
         y = "",
         caption = "Source: https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election") +
  ylab("Vote %")

Conclusion

We can infer from this plot that the rolling average reached similar trendlines with the former method, but looking at the points, it also eliminated the biases of different polling firms.
The Union almost keeps the highest vote percentage over the election period although shows decending trend. Then followed by Grune,and SPD. Noticeably, the voting percentage of SPD shows significant rise since Auguest 2021, and ranked the highest percentage record over the 6 parties.

Wrong content? Edit on Github.

Scrape the data

Glimpse and restructure the data

Plot the data

Another method: rolling mean

💬 Comment: