Processing NDM Tables

This note provides an example of retrieving and processing a complete NDM table in R. The script and log are included in a corresponding PDF in the downloads.

Packages USED

The key package used in this vignette is the dplyr package. Key packages include:

  • dplyr – for processing the csv.
  • ggplot2 – part of tidyverse for plotting the results

Retrieving the DATA

The first step is to load the packages, define the NDM product ID, retrieve the required zip file and unzip it to access the csv file. The code checks for the existence of the csv and whether it should be refreshed.

#This vignette retrieves matrices from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
project_folder<-getwd()
data_folder<-paste0(project_folder,"/")
library(lubridate)
library(tidyverse)
library(dplyr)
target_product<-"14-10-0017"
file_refresh<-TRUE
file_refresh<-FALSE
#first step is to test if the csv file is already available in unzipped in the working folder
target_csv<-paste0(gsub("-","",target_product),".csv")
#see if it exists
if((!file.exists(target_csv))|file_refresh){
      #create url
      url_file<-gsub(".csv","-eng.zip",target_csv)
      calc_url<-paste0(url_prefix,url_file)
      download_target<-paste0(project_folder,"/",url_file)
      download.file(calc_url,download_target,mode="wb",quiet=FALSE)
unzip(download_target,overwrite=TRUE,exdir=substr(data_folder,1,nchar(data_folder)-1))
      }

The next stage reads the CSV with the read_csv routine that is part of the tidyverse readr component. The dplyr package is used to process the file, selecting only the Canada geography and selected variables. The column names of the csv file are best reviewed by opening the csv one time with a package such as notepad++ or notepad. Many CSVs cannot be fully loaded into Excel and there is no need to use that package. In the code below, the mutate verb is used to create a new variable, Date, that is a standard R date. To do this, it is necessary to convert the REF_DATE string to a regular date by added a day (“01”) to the end of each value. Overly long variables and those containing blanks are renamed for convenience. Such variables must be enclosed in back ticks (`) in order to be recognized by dplyr. The initial read, the select, mutate and rename verbs are all connected using pipes (%>%).

#now we are going to great a base tibble with just the estimate for canada
#column names with spaces must be enclosed in back tick marks
#the date has to be modified to include a day to get the as.Date function to recognize it
#read_csv will also recognize a zip file and unzip
canada_lfs_estimate<-read_csv(file=target_csv,progress=FALSE) %>%filter(GEO=="Canada") %>%
select(REF_DATE,GEO,`Labour force characteristics`,Sex,`Age group`,VALUE) %>%
mutate(Date=as.Date(paste0(REF_DATE,"-01"))) %>%
rename(LFS_characteristics=`Labour force characteristics`,Age_group=`Age group`)

The next stage is to calculate annual data from the averages of the seasonally unadjusted data. The advantage of this is that the current year is calculated as an average to date. The data are grouped by the key variables required and by the Year variable which is created by a prior mutate verb. The year function from the lubridate package is required to get the year. An ungroup verb is used as the last stage of the piped set of commands to remove the grouping. Not doing that will confuse later dplyr processes which will then unnecessarily require the grouping variables.

#calculating annual averages this way gives us the year-to-date value for the current year
annual_data<-canada_lfs_estimate%>%
mutate(Year=year(Date))%>%
group_by(GEO,LFS_characteristics,Age_group,Sex,Year) %>%
summarize(Annual_average=mean(VALUE))%>%
ungroup()

The next set of code gets the unique age group strings and selects the lowest level ones. The appropriate selection must be determined by prior inspection. Extracting the strings from the data file avoids problems from future edits by Statistics Canada of its classifications.

#now get the age_groups
available_age_groups<-unique(annual_data$Age_group)
print(available_age_groups)
#by inspection, the zero-level groups in this table are 1,5,6,10,11,12,13,16,17,19,21
zero_indices<-c(1,5,6,10,11,12,13,16,17,19,21)
zero_level_age_groups<-available_age_groups[zero_indices]
available_genders<-unique(annual_data$Sex)
#This version loops through gender
numb_genders<-length(available_genders)

The next stage is to start a loop over genders of a process which selects three tibbles for source population, full- and part-time employment. The filter verb selects the data by age group, LFS characteristic and gender.

for (this_gender in 1:numb_genders){
      Gender<-available_genders[this_gender]
      source_population<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Population",Sex==Gender)
      full_time_employment<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Full-time employment",Sex==Gender)
      part_time_employment<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Part-time employment",Sex==Gender)

The next segment of code calculates the full- and part-time rates by using a left_join to link the population with the corresponding employment on the basis of year and age group.

#now calculate th full_time rate as part of a left join
      full_time_rate<-left_join(source_population,(select(full_time_employment,Year,Age_group,Annual_average)),
      by=c("Year","Age_group"),
      suffix=c(".pop",".fe")) %>%
      mutate(employment_rate=100*Annual_average.fe/Annual_average.pop)%>%
      select(Year,Age_group,employment_rate)
      #now part time
      part_time_rate<-left_join(source_population,(select(part_time_employment,Year,Age_group,Annual_average)),by=c("Year","Age_group"),
      suffix=c(".pop",".pe")) %>%
      mutate(employment_rate=100*Annual_average.pe/Annual_average.pop)%>%
      select(Year,Age_group,employment_rate)

The next set of code applies a negative sign to the full-time employment for the purposes of plotting. Essentially the plots below include two layers of bar charts with the full-time measure as negative to get the pyramid appearance. The rates, calculated above, have the same column names so that they can be row bound using the bind_rows command. A list of the tibbles is used to facilitate renaming them to something useful. The .id value of Type is used to hold the list element names to distinguish the data sets combined into the rate_set tibble.

The test years are defined in a vector in a filter command to select the data for plotting. Breaks for the value (y) axis are defined. However, since some are negative, labels have to be defined by taking the absolute values.

#for purposes of pyramid, full-time rate is set negative
      full_time_rate<-mutate(full_time_rate,employment_rate=-employment_rate)
      bind_list<-list(full_time_rate,part_time_rate)
      names(bind_list)<-c("Full time","Part time")
      rate_set<-bind_rows(bind_list,.id="Type") %>%
      mutate(Age_group=gsub(" years","",Age_group))
      #as a test plot select one year
      plot_data<-filter(rate_set,Year %in% c(2000,2007,2015,2018))
      break_list<-seq(-80,40,by=20)
      break_labels<-as.character(abs(break_list))

The final set of code defines the plot. The data are grouped by Year so that there is one plot of each selected year and the facet_wrap verb creates the required two column aggregate plot of the separate year plots. The coordinates are flipped so that the value (y) is at the bottom.

p1<-ggplot(plot_data,aes(x=Age_group,y=employment_rate,group=Year,fill=Type,
      label=abs(round(employment_rate,1))))+
      geom_bar(data=dplyr::filter(plot_data,Type=="Full time"),stat="identity")+
      geom_bar(data=dplyr::filter(plot_data,Type=="Part time"),stat="identity")+
      labs(title="LFS Employment Rates by Age",caption="Statistics Canada, JCI",
      subtitle=Gender,
      y="Percentage Share of Population",x="Age Group")+geom_text(size=3)+
      scale_y_continuous(breaks=break_list,labels=break_labels)+
      coord_flip()+
      facet_wrap(~Year,ncol=2)+theme(legend.position="bottom")
  ggsave(p1,file=paste0(Gender,"_employment_rate_pyramids_labels.png"))
}

The last curly bracket terminates the for loop. There will be three facet plots produced, one for each gender. The gender name is included as part of the output file name as well as the subtitle of the plot. The critical part is the data specification in each geom_bar layer. This separates the input dataset into the left and right halves required for the pyramid. The dplyr filter is used to define the data required. The breaks calculated above are included in the arguments to the scale_y_continuous function in the plot development. The legend for the employment types is set to be at the bottom rather than the default side position.

The plots are shown below.

Both sexes employment rate pyramids labels

Males employment rate pyramids labels

Females employment rate pyramids labels