We want to see the past trends of polling results for different parties in the 2021 German federal election.
The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia.
Scrape the data
We first scrape the wikipedia page and import the table in a dataframe, named as german_election_polls.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# similar graphs and analyses can be found at
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)
Glimpse and restructure the data
We then take a look at the generated German Election Polls data framework. The missing value is caused by data structure and we eliminated it.
glimpse(german_election_polls)
## Rows: 250
## Columns: 16
## $ polling_firm <chr> "2021 federal election", "Wahlkreisprognose", "Ipsos", ~
## $ fieldwork_date <chr> "26 Sep 2021", "22β24 Sep 2021", "22β23 Sep 2021", "22β~
## $ samplesize <chr> "β", "1,400", "2,000", "1,273", "2,002", "1,554", "10,0~
## $ abs <chr> "23.4", "β", "β", "β", "26", "β", "β", "β", "β", "β", "~
## $ union <dbl> 24.1, 22.5, 22.0, 23.0, 22.0, 25.0, 23.0, 21.0, 21.5, 2~
## $ spd <dbl> 25.7, 25.5, 26.0, 25.0, 25.0, 26.0, 25.0, 25.0, 25.0, 2~
## $ af_d <dbl> 10.3, 11.0, 11.0, 10.0, 10.0, 10.0, 10.0, 12.0, 11.0, 1~
## $ fdp <dbl> 11.5, 12.0, 12.0, 11.0, 12.0, 10.5, 12.0, 11.0, 12.5, 1~
## $ linke <dbl> 4.9, 7.0, 7.0, 6.0, 6.0, 5.0, 6.0, 7.0, 6.5, 7.0, 6.5, ~
## $ grune <dbl> 14.8, 14.0, 16.0, 16.5, 17.0, 16.0, 16.0, 14.0, 15.0, 1~
## $ fw <chr> "2.5", "β", "β", "3", "3", "β", "β", "2", "β", "β", "β"~
## $ others <chr> "6.2", "8", "6", "5.5", "5", "7.5", "8", "7", "8.5", "9~
## $ lead <chr> "1.6", "3", "4", "2", "3", "1", "2", "4", "3.5", "4", "~
## $ end_date <date> 2021-09-26, 2021-09-24, 2021-09-23, 2021-09-23, 2021-0~
## $ month <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
## $ week <dbl> 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 37,~
skim(german_election_polls)
Name | german_election_polls |
Number of rows | 250 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 7 |
Date | 1 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
polling_firm | 0 | 1 | 3 | 32 | 0 | 17 | 0 |
fieldwork_date | 0 | 1 | 11 | 24 | 0 | 236 | 0 |
samplesize | 0 | 1 | 1 | 6 | 0 | 199 | 0 |
abs | 0 | 1 | 1 | 4 | 0 | 10 | 0 |
fw | 0 | 1 | 1 | 3 | 0 | 5 | 0 |
others | 0 | 1 | 1 | 3 | 0 | 14 | 0 |
lead | 0 | 1 | 1 | 4 | 0 | 33 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
end_date | 0 | 1 | 2021-01-04 | 2021-09-26 | 2021-06-10 | 146 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
union | 0 | 1 | 27.08 | 4.65 | 19.0 | 23.00 | 27.0 | 29.0 | 37 | <U+2585><U+2587><U+2587><U+2582><U+2583> |
spd | 0 | 1 | 18.01 | 4.00 | 13.0 | 15.00 | 16.0 | 19.0 | 28 | <U+2587><U+2583><U+2581><U+2582><U+2581> |
af_d | 0 | 1 | 10.60 | 0.94 | 8.0 | 10.00 | 11.0 | 11.0 | 13 | <U+2582><U+2585><U+2587><U+2582><U+2581> |
fdp | 0 | 1 | 10.82 | 1.98 | 5.0 | 10.00 | 11.0 | 12.0 | 14 | <U+2581><U+2582><U+2583><U+2587><U+2583> |
linke | 0 | 1 | 7.01 | 0.92 | 4.9 | 6.00 | 7.0 | 7.5 | 12 | <U+2585><U+2587><U+2583><U+2581><U+2581> |
grune | 0 | 1 | 19.70 | 3.00 | 14.0 | 17.50 | 19.2 | 21.0 | 29 | <U+2585><U+2587><U+2586><U+2582><U+2581> |
month | 0 | 1 | 5.59 | 2.53 | 1.0 | 3.25 | 6.0 | 8.0 | 9 | <U+2585><U+2585><U+2583><U+2587><U+2587> |
week | 0 | 1 | 22.33 | 11.04 | 1.0 | 13.00 | 23.0 | 32.0 | 38 | <U+2585><U+2583><U+2586><U+2586><U+2587> |
#Using skim, We found 1 NA
german_election_polls = na.omit(german_election_polls)
To make the data more readable, we restructure the dataset and melt the 6 parties polls (Union, SPD, AfD, FDP, Linke, and Grune). We select the Party and Vote variables, and add the βend_dateβ as x variable for plotting.
#Restructure the dataframe, select 6 parties polls and melt
library(reshape2)
german_election_polls_restr <- german_election_polls %>%
mutate(Union=union, SPD=spd, AfD=af_d, FDP=fdp, Linke=linke, Grune=grune) %>%
melt(german_election_polls, id.vars=c("Union","SPD","AfD", "FDP", "Linke", "Grune"),
measure.vars = c("Union","SPD","AfD", "FDP", "Linke", "Grune"),
variable.name = "Party",value.name = "Vote")
#Select the Party and Vote variables, and add the end_date as x variable for plotting
german_election_polls_restr <- german_election_polls_restr %>%
select(Party, Vote) %>%
mutate(end_date=rep(german_election_polls$end_date,times=6))
Plot the data
Then we use geom_smooth to generate the 2021 German Federal Election Polling Results plot of the 6 parties.
ggplot(german_election_polls_restr, aes(x=end_date, y=Vote, color=Party)) +
geom_point()+
geom_smooth(se=F,span=0.2)+ #span=0.2 to make the line less smoothed
scale_color_manual(values=c("Union"="black","SPD"="#AF1513","AfD"="#330D9D","FDP"="#FFC300","Linke"="#650D9D","Grune"="#0D9D36"))+ #manually select colors for different parties
scale_x_date(date_breaks = "1 month" , date_labels = c("%b %y"),limits=as.Date(c("2021-01-01","2021-10-01")))+
theme_minimal() +
theme(legend.position = "right",
legend.title = element_text(size=14),
legend.text = element_text(size=14),
legend.key = element_rect(fill = "white", color = "white"),
plot.title =element_text(size=20, face='bold',hjust = 0.5,margin = margin(10,0,10,0)),
plot.subtitle =element_text(size=16, face='bold',hjust = 0.5), #put titles in the middle
axis.text.x = element_text(size=10),
axis.text.y = element_text(size=12),
axis.ticks.x = element_line(),
axis.ticks.y=element_line(),
axis.title.x = element_text(size=16,face='bold'),
axis.title.y = element_text(size=16,face='bold'),
) +
labs(title = "2021 German Federal Election Polling Results",
subtitle = "Combined results from different polling firms: INSA, Forsa, Kantar etc.",
x = "End date of fieldwork",
y = "",
caption = "Source: https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election") +
ylab("Vote %")
The pulling result trends are illustrated in the graph, with the data of each party spreading around the generated lines.
Another method: rolling mean
Finally, we try to use another method - rolling mean of 14 records instead of directly applying smoothing lines to make the plot.
german_election_polls_rolling <- german_election_polls_restr %>%
group_by(Party) %>%
mutate(
#calculate 14-record rolling average, align = left
Vote14 = zoo::rollmean(Vote, k=14, fill = NA,align="left"))%>%
ungroup() %>%
na.omit(german_election_polls_rolling) #remove the NA caused by rolling average
ggplot(german_election_polls_rolling, aes(x=end_date, y=Vote14, color=Party)) +
geom_point()+
geom_smooth(se=F,span=0.2)+ #span=0.2 to make the line less smoothed
scale_color_manual(values=c("Union"="black","SPD"="#AF1513","AfD"="#330D9D","FDP"="#FFC300","Linke"="#650D9D","Grune"="#0D9D36"))+ #manually select colors for different parties
scale_x_date(date_breaks = "1 month" , date_labels = c("%b %y"),limits=as.Date(c("2021-01-01","2021-10-01")))+
theme_minimal() +
theme(legend.position = "right",
legend.title = element_text(size=14),
legend.text = element_text(size=14),
legend.key = element_rect(fill = "white", color = "white"),
plot.title =element_text(size=20, face='bold',hjust = 0.5,margin = margin(10,0,10,0)),
plot.subtitle =element_text(size=16, face='bold',hjust = 0.5), #put titles in the middle
axis.text.x = element_text(size=10),
axis.text.y = element_text(size=12),
axis.ticks.x = element_line(),
axis.ticks.y=element_line(),
axis.title.x = element_text(size=16,face='bold'),
axis.title.y = element_text(size=16,face='bold'),
) +
labs(title = "2021 German Federal Election Polling Results (14-record rolling mean)",
subtitle = "Combined results from different polling firms: INSA, Forsa, Kantar etc.",
x = "End date of fieldwork",
y = "",
caption = "Source: https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election") +
ylab("Vote %")
Conclusion
We can infer from this plot that the rolling average reached similar trendlines with the former method, but looking at the points, it also eliminated the biases of different polling firms.
The Union almost keeps the highest vote percentage over the election period although shows decending trend. Then followed by Grune,and SPD. Noticeably, the voting percentage of SPD shows significant rise since Auguest 2021, and ranked the highest percentage record over the 6 parties.
Wrong content? Edit on Github.
π¬ Comment: