Gender-Based Wage Rate Charts With Cansim Using R
In this example, we will use the cansim package to access gender-specific data by vector from Statistics Canada’s NDM using the cansim package from Mountain Math.
Packages USED
The key package used in this vignette is the cansim package which accesses vectors, in this case, from the NDM database of Statistics Canada. Key packages include
- dplyr – for processing the tibble
- ggplot2 – part of tidyverse for plotting the results
- cansim – package to access data using the NDM api.[1]
The script uses a pre-defined list of vectors to produce charts by gender for the occupations included in the vector list. The focus in this example is on full-time wages by gender for two management occupations.
In corporate environments, it may be necessary to use a proxy setting. This is documented in the citation for the cansim package above.
Retrieve the data
The initial stage is to retrieve the vectors. The script below assumes that the user’s R profile defines the appropriate working folder or else this should be defined in an initial use of the function setwd. The initial code processes a list of vectors delimited only with carriage returns such as would be obtained from a paste operation from some file. The character vector is split into a regular character vector with one entry per V number using strsplit. The “n” symbol is “escaped’ by a backslash so as to be treated as a carriage return. This process just saves enclosing each vnumber in quotes.
-
library(dplyr)
-
library(cansim)
-
library(tidyr)
-
library(lubridate)
-
library(stringr)
-
library(ggrepel)
-
library(ggplot2)
-
#series list (management, full time)
-
#list separate by carriage returns (i.e. copied from Excel)
-
base_list <- "
-
v103555126
-
v103555130
-
v103555138
-
v103555142
-
v103558798
-
v103558802
-
v103558810
-
v103558814
-
"
-
raw_list <- strsplit(base_list, "\n")
-
vbl_list <- unlist(raw_list)
-
vbl_list <- vbl_list[vbl_list != ""]
-
#now do the retrieval
-
#analyze meta data
-
meta1 <- get_cansim_vector_info(vbl_list)
-
cube_meta <- get_cansim_cube_metadata(unique(meta1$table))
-
common_meta <-
-
left_join(meta1, cube_meta, by = c("table" = "productId")) %>% distinct() %>%
-
select(VECTOR, table, title_en, cubeTitleEn, cubeEndDate)
-
print(as.data.frame(common_meta))
-
#retrieve data
-
cansim_vbls <-
-
get_cansim_vector(vbl_list, start_time = as.Date("1995-01-01"))
-
print(colnames(cansim_vbls))
In the code above, the vector list (vbl_list) is used to first get the meta data for the vectors. Unique table identifiers are used to retrieve the meta data for the tables involved. In this example, the vnumbers are all in one table. The print statement would show the following information.
The data are retrieved using the get_cansim_vector function. A start date must be supplied.
The next code parses the title_en variable in common_meta which is the description for each vector. This is simply the text attributes of the vector concatenated with a semi-colon between each one.
#now parse the title_en for key fields specific to the variables chosen
#title en substrings are in the order specific to the NDM tables
title1 <- str_split_fixed(common_meta$title_en, fixed(";"), n = Inf)
#need to supply column names
title1_names <- paste0("col", 1:ncol(title1))
colnames(title1) <- title1_names
title_tbl <- tbl_df(title1) %>%
rename(
region = 1,
wage_type = 2,
employee_type = 3,
occupation = 4,
gender = 5,
age_range = 6
)
title_table <- bind_cols(VECTOR = common_meta$VECTOR, title_tbl)
print(title_table)
class_vbls <- colnames(title_table)
The vector class_vbls contains the column names provided in the rename statement above. This will be used below in a variable select. The next code uses dplyr to join this classification information with the VECTOR identifier to the cansim_vbls tibble retrieved above.
cansim_vbls %>%
left_join(title_table, by = "VECTOR") %>%
mutate(date = as.Date(REF_DATE)) %>%
select(VECTOR, REF_DATE, date,
VALUE, one_of(class_vbls[-1])) -> working_data
#the vector in class variables is dropped
head(working_data)
Doing the plots
In the next set of code, lists are prepared for the various classification variables which are used to select the data to be plotted. We use a string to select the wage type because we want to include both average and median wage rates.
#lets do separate plots for each occupation
occupation_list <- unique(title_table$occupation)
wage_list <- unique(title_table$wage_type)
print(wage_list)
type_of_wage <- "weekly"
employee_type_list <- unique(title_table$employee_type)
type_of_employee <- employee_type_list[1]
age_list <- unique(title_table$age_range)
age_group <- age_list[1]
plot_list <- character()
plot_counter <- 0
The occupation list is used to define the structure of the “for” loop to control the occupations to be plotted. More occupations could be included in the vector list. The first stage is to build the plot_data tibble by filtering on all of the classification variables that can vary. If multiple geographies are included in the vector list, then the region classification must be included in the filter.
for (occupation_counter in seq_along(occupation_list)) {
this_occupation <- occupation_list[occupation_counter]
#we filter the working data for the attributes that we are not using in the chart
#we are selecting all weekly wage types
working_data %>%
filter(
occupation == this_occupation,
str_detect(wage_type, "weekly"),
employee_type == type_of_employee,
age_range == age_group
) -> plot_data
The next piece of code prepares various text strings which are also used in the plot for documentation including the range of the data. Then, a last_values tibble is constructed containing the data for only the first and last data points. This will be used for plot labelling.
vector_list <- paste(unique(plot_data$VECTOR), collapse = ",")
#get first and last date
last_date <- substr(max(plot_data$date), 1, 4)
first_date <- substr(min(plot_data$date), 1, 4)
#calculate a tibble for the values of the first and last datapoints
# for labeling the plot
plot_data %>%
group_by(occupation, wage_type, employee_type, age_range, gender) %>%
arrange(date) %>%
slice(c(1, n())) -> last_values
Now the actual plot is started. This is relatively standard with color attributes being assigned to gender and the linetype attribute used to distinguish average and median wages in this example. The geom_label_repel function is used to apply the label layer because it positions the labels in more aesthetically pleasing places on the chart.
wage_plot1 <-
ggplot(plot_data,
aes(
x = date,
y = VALUE,
colour = gender,
linetype = wage_type
)) +
labs(
title = this_occupation,
subtitle = paste(
type_of_employee,
"\n",
"Annual Average of Weekly wages",
first_date,
"-",
last_date
),
y = "$",
x = NULL,
caption = paste("Statcan:", vector_list, "JCI ", Sys.Date())
) +
theme(legend.title = element_blank()) +
geom_line() +
geom_label_repel(data = last_values, aes(label = round(VALUE, 0), colour =
gender))
print(wage_plot1)
ggsave(
wage_plot1,
file = paste0("wage_", type_of_wage, "_", this_occupation, ".png"),
width = 12,
height = 7.5
)
plot_counter <- plot_counter + 1
unique_plot_name <- paste0("plot_", plot_counter)
assign(unique_plot_name, wage_plot1)
plot_list <- c(plot_list, unique_plot_name)
}
Note that the vector list is included in the plot caption. The last portion of the loop creates a unique name for each plot and builds up a text list of plot names for subsequent processing. The final piece of code simply groups the plots into a single two-plot png.
library(gridExtra)
these_plots <- mget(plot_list)
stacked_plots <- marrangeGrob(these_plots,
nrow = 2,
ncol = 1,
top = NULL)
ggsave(stacked_plots,
file = "wage_rate_plots.png",
width = 12,
height = 15)
The resulting two-plot png is shown below.