Analyzing Provincial Employment With The Cansim Package In R
In this example, we will use the get_cansim_vector function from Mountain Math to retrieve the monthly LFS Employment data. Vectors for total and full- and part-time employment will be retrieved in a loop. A standard bar chart will be produced but a methodology for showing similar information in a bubble chart will be introduced. The advantage of the bubble chart is that the size of the absolute change in each province can be shown by resizing the bubbles.
Packages USED
The key package used in this vignette is the cansim package which accesses the NDM API. Key packages include
- dplyr – for processing the tibble prior to conversion to xts
- ggplot2 – part of tidyverse for plotting the results
- cansim – package to retrieve metadata and series from Statistics Canada’s NDM
- lubridate – for managing date variables
- gridExtra – for grouping plots and
- gdata - renaming plots within a loop for subsequent work.
Retrieving the DATA
The initial stage, as always, is to load the key packages. A list of three character vectors containing the Vnumbers for the types of employment with Canada first is defined. The main structure runs through a loop which processes the character vectors sequentially. The major steps are:
- process the meta data for documentation purposes,
- retrieve the series,
- create a data structure for processing the last three months,
- create two plots, a bar and a bubble plot,
- rename the plot outputs for later grouping.
The actual vectors could be found using the search functions of the cansim package but in this case, the Statcan NDM website was used.
#This vignette retrieves vectors from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
library(dplyr)
library(cansim)
library(lubridate)
library(ggplot2)
library(gdata)
#some routines make use of a cache which can be usefully set in options
options(cache_path="d:\\mm_data_cache")
# https://www12.statcan.gc.ca/census-recensement/2011/ref/dict/table-tableau/table-tableau-8-eng.cfm
geo_abreviations<-c("CAN","NL","PE","NS","NB","QC","ON","MB","SK","AB","BC")
#these are standard geo_codes
pruid<-c(99,10, 11,12,13,24,35,46,47,48,59)
total_vectors<-c(
"v2062811",
"v2063000",
"v2063189",
"v2063378",
"v2063567",
"v2063756",
"v2063945",
"v2064134",
"v2064323",
"v2064512",
"v2064701"
)
full_time_vectors<-c(
"v2062812",
"v2063001",
"v2063190",
"v2063379",
"v2063568",
"v2063757",
"v2063946",
"v2064135",
"v2064324",
"v2064513",
"v2064702"
)
part_time_vectors<-c(
"v2062813",
"v2063002",
"v2063191",
"v2063380",
"v2063569",
"v2063758",
"v2063947",
"v2064136",
"v2064325",
"v2064514",
"v2064703"
)
#make up list of vectors
vector_list<-list(total_vectors,full_time_vectors,part_time_vectors)
title_modifier<-c("Total","Full-time","Part-time")
numb_sets<-length(vector_list)
for(this_set in 1:numb_sets){
canada_lfs_vbls<-vector_list[[this_set]]
employment_type<-title_modifier[this_set]
print(canada_lfs_vbls)
At this point, we are the top of the loop and being the analysis of the meta data.
#analyze meta data
meta1<-get_cansim_vector_info(canada_lfs_vbls)
cube_meta<-get_cansim_cube_metadata(unique(meta1$table))
common_meta<-left_join(meta1,cube_meta,by=c("table"="productId")) %>%distinct()%>%
select(VECTOR,table,title_en,cubeTitleEn,cubeEndDate)
print(as.data.frame(common_meta))
#The coordinate is part of the meta data. The first level gives a numeric geography
#we want to make sure that the vector is organized canada followed by provinces
coordinate_split<-strsplit(meta1$COORDINATE,"[.]")
#the result is a list with a character vector entry for row
#when retrieved from cansim with the meta data it is always 10 levels long
numb_levels<-length(coordinate_split[[1]])
id_data<-as.data.frame(matrix(as.integer(unlist(coordinate_split)),ncol=numb_levels,byrow=TRUE))
geo_order<-id_data$V1
print(data.frame(geo_order,meta1$title_en))
#the geo_order could be used to reorder the data columns below if needed
#make a table to map the abbreviations
abreviation_table<-data_frame(VECTOR=canada_lfs_vbls,geo=geo_abreviations,geo_code=geo_order)
One key feature of the code above is to make use of the vector coordinate variable. This is different for each vector. It can be quite useful for ordering things. In this case, the first level of the coordinate is extracted to provide order codes for Canada and the provinces.
The next stage started the retrieval process. The REF_DATE field is mutated into an extra Date field using the R date conventions. The abbreviation table is joined to the data tibble to provide identification for the series.
#now retrieve the data, creating a date and adding on the abreviation
canada_lfs <- get_cansim_vector(canada_lfs_vbls,start_time=as.Date("1995-01-01"))%>%
mutate(Date=as.Date(REF_DATE))%>%
left_join(abreviation_table,by="VECTOR")
The basic canada_lfs tibble is then extended to include year over year percentage change and the difference from one year ago. A new tibble is created. The dplyr slice operator is used to select the last three data points.
#now add on the percentage change
canada_lfs_pch<-canada_lfs %>%
group_by(geo_code) %>%
arrange(Date) %>%
mutate(pch=100*((VALUE/lag(VALUE,12))-1),annual_change=VALUE-lag(VALUE,12)) %>%
slice((n()-2):n())%>%ungroup()
print(as_data_frame(select(canada_lfs_pch,REF_DATE,VALUE,geo,pch,annual_change)))
The next stage is to create the plot tibble for the standard bar chart. The geo abbreviation is turned into a factor so that order can be preserved in the chart.
#now start a bar plot
plot_data<-select(canada_lfs_pch,REF_DATE,geo,pch)
geo_stuff<-unique(plot_data$geo)
#create a factor variable for the limits on the x axis
geo_factor<-factor(geo_stuff,levels=geo_stuff)
text_adjust_factor<-.07*max(abs(plot_data$pch))
plot1<-ggplot(plot_data,aes(x=geo,y=pch,group=substr(REF_DATE,1,7),
label=round(pch,1),fill=substr(REF_DATE,1,7)))+
geom_bar(position=position_dodge(0.9),stat="identity")+
labs(title=paste(employment_type,"Year over Year Percentage Change"),subtitle="Employment 15+ Seasonally Adjusted",
x=NULL,y="Year over Year Percentage Change",
caption="Statcan, JCI")+
scale_x_discrete(limits=geo_factor)+
geom_text(size=3,position=position_dodge(0.9),aes(y=(sign(pch)*
(text_adjust_factor+abs(pch)))))+
theme(legend.title=element_blank(),legend.position="bottom")
ggsave(plot1,file=paste(employment_type,"LFS15_annual_change.png"))
This chart plots the year-over-year percentage change. A text adjustment factor is calculated to position the numeric value labels for each bar far enough above on the y value axis. Each of the months is printed as a separate bar by using position_dodge. The first seven characters of the REF_DATE are used to group the data by a discrete value. The file name is defined appropriately. The total example is shown below.
The example shows the deficiency of bar charts because there is no distinction by the size of the province or the change in employment.
The next step is to prepare the bubble chart. Essentially this identical to the bar chart but using the geo_point rather than the geom_bar to place a point at the appropriate point on the chart. The size of the point is determined by the absolute year over year change. It is necessary to take the absolute value because size cannot be negative.
#now select the last date
last_date<-max(canada_lfs_pch$Date)
#now try bubble charts
plot_data_2<-select(canada_lfs_pch,REF_DATE,geo,pch,annual_change)
plot2<-ggplot(plot_data_2,aes(x=geo,y=pch,group=substr(REF_DATE,1,7),
label=round(pch,1),size=abs(annual_change),
fill=substr(REF_DATE,1,7),colour=substr(REF_DATE,1,7)))+
geom_point(stat="identity",position=position_dodge(0.9))+
labs(title=paste(employment_type,"Year over Year Percentage Change"),
subtitle="Employment 15+ Seasonally Adjusted\nBy Size of Change",
x=NULL,y="Year over Year Percentage Change",
caption="Statcan, JCI")+
scale_x_discrete(limits=geo_factor)+
geom_hline(yintercept=0)+
geom_text(size=3,position=position_dodge(0.9),aes(y=(sign(pch)*
(text_adjust_factor+abs(pch)))))+
theme(legend.title=element_blank(),legend.position="bottom")
ggsave(plot2,file=paste0(employment_type,"_LFS15_annual_change_point.png"))
For appearance’s sake, a horizontal line is inserted at the zero intercept. The Total example is shown below.
In this example, it can be seen that the important contributors to employment change were Ontario, Alberta and BC in this time period.
The next block of code joins the two plots vertically with the bubble plot first. The new library gdata is used to supply the mv function. This allows the renaming of the plot variable to something useful. This is the bottom of the loop over the types of employment.
library(gridExtra)
plot_list<-mget(c("plot2","plot1"))
two_plots<-marrangeGrob(plot_list,ncol=1,nrow=2,top=NULL)
two_plot_name<-paste0(employment_type,"_lfs_change_analysis.png")
ggsave(file=two_plot_name,width=8,height=10.5,two_plots)
#now rename the plot files to save theme
mv(from="plot1",to=paste0(employment_type,"_plot1"))
mv(from="plot2",to=paste0(employment_type,"_plot2"))
}
A example of the two plots joined together is shown below.
The final piece of code groups the two bubble plots for full- and part-time employment into one vertical plot. Note that the plot is sized to slightly less than a normal page to get good proportions.
plot_list<-mget(c("Full-time_plot2","Part-time_plot2"))
two_plots<-marrangeGrob(plot_list,ncol=1,nrow=2,top=NULL)
two_plot_name<-paste0("Full_Part","_lfs_bubble_analysis.png")
ggsave(file=two_plot_name,width=8,height=10.5,two_plots)
The example is shown below.