This category of material includes notes introducing R and providing some vignettes on its use that may be of interest to economists. Additional material is available in PDF form in the downloads. These scripts are provided with no support or warranty or guarantee as to their general or specific usability. Most recent articles are shown first
Analyzing Provincial Employment With The Cansim Package In R
In this example, we will use the get_cansim_vector function from Mountain Math to retrieve the monthly LFS Employment data. Vectors for total and full- and part-time employment will be retrieved in a loop. A standard bar chart will be produced but a methodology for showing similar information in a bubble chart will be introduced. The advantage of the bubble chart is that the size of the absolute change in each province can be shown by resizing the bubbles.
Packages USED
The key package used in this vignette is the cansim package which accesses the NDM API. Key packages include
- dplyr – for processing the tibble prior to conversion to xts
- ggplot2 – part of tidyverse for plotting the results
- cansim – package to retrieve metadata and series from Statistics Canada’s NDM
- lubridate – for managing date variables
- gridExtra – for grouping plots and
- gdata - renaming plots within a loop for subsequent work.
Retrieving the DATA
The initial stage, as always, is to load the key packages. A list of three character vectors containing the Vnumbers for the types of employment with Canada first is defined. The main structure runs through a loop which processes the character vectors sequentially. The major steps are:
- process the meta data for documentation purposes,
- retrieve the series,
- create a data structure for processing the last three months,
- create two plots, a bar and a bubble plot,
- rename the plot outputs for later grouping.
The actual vectors could be found using the search functions of the cansim package but in this case, the Statcan NDM website was used.
#This vignette retrieves vectors from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
library(dplyr)
library(cansim)
library(lubridate)
library(ggplot2)
library(gdata)
#some routines make use of a cache which can be usefully set in options
options(cache_path="d:\\mm_data_cache")
# https://www12.statcan.gc.ca/census-recensement/2011/ref/dict/table-tableau/table-tableau-8-eng.cfm
geo_abreviations<-c("CAN","NL","PE","NS","NB","QC","ON","MB","SK","AB","BC")
#these are standard geo_codes
pruid<-c(99,10, 11,12,13,24,35,46,47,48,59)
total_vectors<-c(
"v2062811",
"v2063000",
"v2063189",
"v2063378",
"v2063567",
"v2063756",
"v2063945",
"v2064134",
"v2064323",
"v2064512",
"v2064701"
)
full_time_vectors<-c(
"v2062812",
"v2063001",
"v2063190",
"v2063379",
"v2063568",
"v2063757",
"v2063946",
"v2064135",
"v2064324",
"v2064513",
"v2064702"
)
part_time_vectors<-c(
"v2062813",
"v2063002",
"v2063191",
"v2063380",
"v2063569",
"v2063758",
"v2063947",
"v2064136",
"v2064325",
"v2064514",
"v2064703"
)
#make up list of vectors
vector_list<-list(total_vectors,full_time_vectors,part_time_vectors)
title_modifier<-c("Total","Full-time","Part-time")
numb_sets<-length(vector_list)
for(this_set in 1:numb_sets){
canada_lfs_vbls<-vector_list[[this_set]]
employment_type<-title_modifier[this_set]
print(canada_lfs_vbls)
At this point, we are the top of the loop and being the analysis of the meta data.
#analyze meta data
meta1<-get_cansim_vector_info(canada_lfs_vbls)
cube_meta<-get_cansim_cube_metadata(unique(meta1$table))
common_meta<-left_join(meta1,cube_meta,by=c("table"="productId")) %>%distinct()%>%
select(VECTOR,table,title_en,cubeTitleEn,cubeEndDate)
print(as.data.frame(common_meta))
#The coordinate is part of the meta data. The first level gives a numeric geography
#we want to make sure that the vector is organized canada followed by provinces
coordinate_split<-strsplit(meta1$COORDINATE,"[.]")
#the result is a list with a character vector entry for row
#when retrieved from cansim with the meta data it is always 10 levels long
numb_levels<-length(coordinate_split[[1]])
id_data<-as.data.frame(matrix(as.integer(unlist(coordinate_split)),ncol=numb_levels,byrow=TRUE))
geo_order<-id_data$V1
print(data.frame(geo_order,meta1$title_en))
#the geo_order could be used to reorder the data columns below if needed
#make a table to map the abbreviations
abreviation_table<-data_frame(VECTOR=canada_lfs_vbls,geo=geo_abreviations,geo_code=geo_order)
One key feature of the code above is to make use of the vector coordinate variable. This is different for each vector. It can be quite useful for ordering things. In this case, the first level of the coordinate is extracted to provide order codes for Canada and the provinces.
The next stage started the retrieval process. The REF_DATE field is mutated into an extra Date field using the R date conventions. The abbreviation table is joined to the data tibble to provide identification for the series.
#now retrieve the data, creating a date and adding on the abreviation
canada_lfs <- get_cansim_vector(canada_lfs_vbls,start_time=as.Date("1995-01-01"))%>%
mutate(Date=as.Date(REF_DATE))%>%
left_join(abreviation_table,by="VECTOR")
The basic canada_lfs tibble is then extended to include year over year percentage change and the difference from one year ago. A new tibble is created. The dplyr slice operator is used to select the last three data points.
#now add on the percentage change
canada_lfs_pch<-canada_lfs %>%
group_by(geo_code) %>%
arrange(Date) %>%
mutate(pch=100*((VALUE/lag(VALUE,12))-1),annual_change=VALUE-lag(VALUE,12)) %>%
slice((n()-2):n())%>%ungroup()
print(as_data_frame(select(canada_lfs_pch,REF_DATE,VALUE,geo,pch,annual_change)))
The next stage is to create the plot tibble for the standard bar chart. The geo abbreviation is turned into a factor so that order can be preserved in the chart.
#now start a bar plot
plot_data<-select(canada_lfs_pch,REF_DATE,geo,pch)
geo_stuff<-unique(plot_data$geo)
#create a factor variable for the limits on the x axis
geo_factor<-factor(geo_stuff,levels=geo_stuff)
text_adjust_factor<-.07*max(abs(plot_data$pch))
plot1<-ggplot(plot_data,aes(x=geo,y=pch,group=substr(REF_DATE,1,7),
label=round(pch,1),fill=substr(REF_DATE,1,7)))+
geom_bar(position=position_dodge(0.9),stat="identity")+
labs(title=paste(employment_type,"Year over Year Percentage Change"),subtitle="Employment 15+ Seasonally Adjusted",
x=NULL,y="Year over Year Percentage Change",
caption="Statcan, JCI")+
scale_x_discrete(limits=geo_factor)+
geom_text(size=3,position=position_dodge(0.9),aes(y=(sign(pch)*
(text_adjust_factor+abs(pch)))))+
theme(legend.title=element_blank(),legend.position="bottom")
ggsave(plot1,file=paste(employment_type,"LFS15_annual_change.png"))
This chart plots the year-over-year percentage change. A text adjustment factor is calculated to position the numeric value labels for each bar far enough above on the y value axis. Each of the months is printed as a separate bar by using position_dodge. The first seven characters of the REF_DATE are used to group the data by a discrete value. The file name is defined appropriately. The total example is shown below.
The example shows the deficiency of bar charts because there is no distinction by the size of the province or the change in employment.
The next step is to prepare the bubble chart. Essentially this identical to the bar chart but using the geo_point rather than the geom_bar to place a point at the appropriate point on the chart. The size of the point is determined by the absolute year over year change. It is necessary to take the absolute value because size cannot be negative.
#now select the last date
last_date<-max(canada_lfs_pch$Date)
#now try bubble charts
plot_data_2<-select(canada_lfs_pch,REF_DATE,geo,pch,annual_change)
plot2<-ggplot(plot_data_2,aes(x=geo,y=pch,group=substr(REF_DATE,1,7),
label=round(pch,1),size=abs(annual_change),
fill=substr(REF_DATE,1,7),colour=substr(REF_DATE,1,7)))+
geom_point(stat="identity",position=position_dodge(0.9))+
labs(title=paste(employment_type,"Year over Year Percentage Change"),
subtitle="Employment 15+ Seasonally Adjusted\nBy Size of Change",
x=NULL,y="Year over Year Percentage Change",
caption="Statcan, JCI")+
scale_x_discrete(limits=geo_factor)+
geom_hline(yintercept=0)+
geom_text(size=3,position=position_dodge(0.9),aes(y=(sign(pch)*
(text_adjust_factor+abs(pch)))))+
theme(legend.title=element_blank(),legend.position="bottom")
ggsave(plot2,file=paste0(employment_type,"_LFS15_annual_change_point.png"))
For appearance’s sake, a horizontal line is inserted at the zero intercept. The Total example is shown below.
In this example, it can be seen that the important contributors to employment change were Ontario, Alberta and BC in this time period.
The next block of code joins the two plots vertically with the bubble plot first. The new library gdata is used to supply the mv function. This allows the renaming of the plot variable to something useful. This is the bottom of the loop over the types of employment.
library(gridExtra)
plot_list<-mget(c("plot2","plot1"))
two_plots<-marrangeGrob(plot_list,ncol=1,nrow=2,top=NULL)
two_plot_name<-paste0(employment_type,"_lfs_change_analysis.png")
ggsave(file=two_plot_name,width=8,height=10.5,two_plots)
#now rename the plot files to save theme
mv(from="plot1",to=paste0(employment_type,"_plot1"))
mv(from="plot2",to=paste0(employment_type,"_plot2"))
}
A example of the two plots joined together is shown below.
The final piece of code groups the two bubble plots for full- and part-time employment into one vertical plot. Note that the plot is sized to slightly less than a normal page to get good proportions.
plot_list<-mget(c("Full-time_plot2","Part-time_plot2"))
two_plots<-marrangeGrob(plot_list,ncol=1,nrow=2,top=NULL)
two_plot_name<-paste0("Full_Part","_lfs_bubble_analysis.png")
ggsave(file=two_plot_name,width=8,height=10.5,two_plots)
The example is shown below.
Plotting LFS Standard Errors
Introduction
This script shows how to retrieve several LFS series using the Mountainmath CANSIM retrieval package and to plot the monthly total employment change with the associated standard errors. This plot is coupled with a plot of the underlying trend cycle estimates of the levels of total employment. The latter is probably the most appropriate estimate to use for planning purposes.
In this example the key packages are:
- dplyr – to process the retrieved data – part of tidyverse
- ggplot2 – to do the plots – part of tidyverse
- cansim – to retrieve the series from NDM
- xts – to manage time series
- tbl2xts – to convert the retrieved tibble to an xts data frame
- lubridate – to manipulate date – part of tidyverse.
Data Retrieval
The first stage is to set a working directory, load the initial required packages, define the list of Labour Force Survey (LFS) vectors required and retrieve and print the meta data.
#This vignette retrieves vectors from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
library(dplyr)
library(cansim)
library(tbl2xts)
library(xts)
library(lubridate)
#some routines make use of a cache which can be usefully set in options
options(cache_path="d:\\mm_data_cache")
#the series are defined, the date is adjusted to suit the conventions of the xts time series package
canada_lfs_vbls<-c("v2062811","v100304302","v101884904", "v101884905",
"v101884906"
)
#define the plot window in terms of periodicity units
plot_window<-"24 months"
#analyze meta data
meta1<-get_cansim_vector_info(canada_lfs_vbls)
cube_meta<-get_cansim_cube_metadata(unique(meta1$table))
common_meta<-left_join(meta1,cube_meta,by=c("table"="productId")) %>%distinct()%>%
select(VECTOR,table,title_en,cubeTitleEn,cubeEndDate)
print(as.data.frame(common_meta))
The get_cansim_vector_info function retrieves the title information for each vector and the table number. The get_cansim_cube_metadata function retrieves the metadata for the table(s). In this case, there is only one table. The dplyr left_join function is used to link the series and table metadata. The select operator is used to simplify the resulting tibble to just the vector, table number (NDM productID), as well as the series and table titles. The common_meta tibble is coerced to a data frame for printing purposes so that all fields are shown in the log file. This meta data provides the documentation for the run.
The next code retrieves the data series using get_cansim_vector which returns a tibble with all the series. This tibble is a tall data structure with one row per data point. The NDM REF_DATE field is returned as a character date by the get_cansim_vector routine but is converted to an R date variable for use in time-related calculations.
canada_lfs_xts <- get_cansim_vector(canada_lfs_vbls,start_time=as.Date("1995-01-01")) %>%
mutate(Date=as.Date(REF_DATE)) %>%
tbl_xts(spread_by="VECTOR",cols_to_xts="VALUE")
tail(canada_lfs_xts,20)
#use index function or dates to select rows in xts data frame.
print(colnames(canada_lfs_xts))
print(data.frame(select(meta1,VECTOR,title_en)))
The dplyr mutate function is used to create the vector of R dates required by the tbl_xts function from the tbl2xts package. The latter function creates a data frame with columns for each unique vector and data derived from the VALUE field in the initial tibble. The tail function is used to show the last 20 observations of the resulting data frame and the column names and meta data are printed again.
The xts package was loaded after the dplyr package so that the xts version of merge is used in the next set of code which calculates the level month to month difference. When using diff and lag functions in R, check the documentation for the package being used because different conventions may be used on the sign of the lag. A xts data structure is created using the xts version of merge to maintain the xts attributes. The column names must be reset to be useful for the plot.
#the first vector is lfs employment 15+
#the fourth vector is the standard error of the month to month change
lfs_change<-diff(canada_lfs_xts[,1],k=1,lag=1)
#the data have to be merged together to maintain the xts attributes
data1<-merge(canada_lfs_xts[,1],lfs_change,canada_lfs_xts[,4])
colnames(data1)<-c("LFS15+","DeltaLFS15+","SEDeltalfs15+")
tail(data1,12)
plot_data<-merge(data1[,2],data1[,2]+.5*data1[,3],data1[,2]-.5*data1[,3])
colnames(plot_data)<-c("DeltaLFS15","Upper","Lower")
The data1 xts data frame contains three series, the level of employment, the monthly change and the standard error. For the purposes of the initial plot, a plot data structure is created, with the change series and the change series plus half the standard error added to provide the upper limit. The lower limit is also added to the plot data xts data frame. Appropriate column names are defined to be useful in the plot.
The next piece of code converts the xts data frame back to a tibble after using the last function, from the xts package, to select the last portion of the data. The plot window is defined above in terms of the periodicity of the data, in this case, months.
#making it into a wide tibble makes it easier to manage in ggplot because the date is accessible
#last 2 years - use units of series such as months or days as in example below
plot_data2<-xts_tbl(last(plot_data,plot_window))
#now calculate a date string for the caption
date_string<-paste0(plot_data2$date[1],"/",tail(plot_data2$date,1))
head(plot_data2)
A date string is calculated from the first and last values of the date in the plot_data2 frame. The first rows of the plot data tibble are printed for documentation purposes.
The next set of code plots the lfs change variable with the standard-error augmented differences as the upper and lower bounds. The basic level change is marked with a point. The point range geom plots the augmented differences as a line between upper and lower. The geom_line portion of the plot draws a line between the change points.
#load the plot library
library(ggplot2)
plot1<-ggplot(plot_data2,aes(x=date,y=DeltaLFS15))+geom_point()+geom_pointrange(aes(ymin=Lower,ymax=Upper))+
geom_line()+
labs(title="Monthly Change in LFS 15+ Employment",y="Monthly Change (000)",
x=NULL,
subtitle="Seasonally Adjusted with Standard Error Bars",
caption=paste(date_string," ","Statcan:",paste(canada_lfs_vbls[c(1,4)],collapse=", "),"JCI"))
ggsave(plot1,file="LFS15_monthly_change.png")
The caption string includes the date string calculated above, and the two vector numbers which as merged as a string using a comma. Then the plot is saved to a png file. The example is shown below.
The next set of code plots the level of the employment series along with the trend-cycle component. The latter series provides a better base of analysis than the somewhat more volatile timeseries of LFS point estimates. We want to plot the last 6 months of the trend cycle with a different line type so one variant is created with the last 6 months set to NA and the other is just the last 6 months. Note the combination of three series with xts merge.
#now start a plot of the trend cycle and the total employment
# review this link about trend cycle https://www.statcan.gc.ca/eng/dai/btd/tce-faq
#we want to drop the last 12 observations from the trend cycle and add a third series which is just the tail of the trend cycle.
dotted_period<-6
data_chart2<-merge(canada_lfs_xts[,1],head(canada_lfs_xts[,2],length(canada_lfs_xts[,2])-dotted_period),
tail(canada_lfs_xts[,2],dotted_period),fill=NA)
tail(data_chart2,24)
colnames(data_chart2)<-c("LFS15plus","Trend-Cycle","Trend Recent")
plot_data_trend<-last(data_chart2,plot_window)
Again, the last plot_window observations are selected for the plot.
The next set of code calculates the date string for the caption using the index variable for the xts plot_data_trend frame. The index variable is essentially the time vector for the frame.
library(broom)
date_string_trend<-paste0(as.Date(index(plot_data_trend))[1],"/",as.Date(tail(index(plot_data_trend),1)))
trend_tibble<-tidy(plot_data_trend)
#linetype and colur are managed in the alphabetic order of the series group in tibble.
trend_plot<-ggplot(trend_tibble,aes(x=index,y=value,group=series,
colour=series))+geom_line(aes(linetype=series))+
scale_linetype_manual(values=c("solid","solid","dashed"))+
theme(legend.position="bottom",legend.title=element_blank())+
labs(title="Monthly LFS 15+ Employment",x=NULL,y="Monthly Level (000)",
subtitle="Seasonally Adjusted",
caption=paste(date_string_trend," ","Statcan:",paste(canada_lfs_vbls[c(1,2)],collapse=", "),"JCI"))
ggsave(trend_plot,file="LFS15_monthly_trend.png")
library(gridExtra)
plot_list<-mget(c("plot1","trend_plot"))
two_plots<-marrangeGrob(plot_list,nrow=1,ncol=2,top=NULL)
two_plot_name<-"lfs_se_trend.png"
ggsave(file=two_plot_name,width=15,height=8,two_plots)
The tidy function from the broom package (part of tidyverse) is used to create the required tibble.
The series are processed in alphabetical order so the line types should be appropriate specified. The line type is varied by the series variable which is created by the tidy function. This variable contains the names of the three series that we are plotting.
The gridExtra package is used to merge the two plots into one larger graphic for easier inclusion in a web site or word document with both plots side by side. The dimensions of the ggsave are defined in inches.
The merged plots are shown below.
GDP Growth Decomposition
In this example, we will use the get_cansim function from Mountain Math to retrieve the monthly GDP $K data. The Laspeyres time series (fix weighted) will be used to analyze the relative role that major aggregates played in the monthly results for total gdp growth. The meta data now available in the NDM will be used to flexibly select the aggregates to be analyzed.
Packages USED
The key package used in this vignette is the cansim package which access the Statcan NDM api. Key packages include
- dplyr – for processing the tibble prior to conversion to xts
- ggplot2 – part of tidyverse for plotting the results
- cansim – package to retrieve metadata and series from Statistics Canada’s NDM
- grid, gridExtra, gtable – for developing a table of the growth rates to be plotted.
Retrieving the DATA
The first major step is to use the get_cansim function to retrieve the complete table for analysis. This table is saved using the saveRDS function in R to the local project folder. This facilitates documentation as well as easy processing. A flag is used to determine whether to retrieve a fresh copy of the table for processing.
library(cansim)
library(tidyverse)
library(lubridate)
#get gdp table and save it in the working directory
gdp_table_id<-"36-10-0434"
file_refresh<-TRUE
file_refresh<-FALSE
save_file<-paste0(gdp_table_id,".rds")
if(!file.exists(save_file)|file_refresh){
gdp_table<-get_cansim(gdp_table_id)
#now save it in the working directory
saveRDS(gdp_table,file=save_file)
} else {
gdp_table<-readRDS(save_file)
}
print(colnames(gdp_table))
#The current column names are
# 1 REF_DATE
# 2 GEO
# 3 DGUID
# 4 Seasonal adjustment
# 5 Prices
# 6 North American Industry Classification System (NAICS)
# 7 UOM
# 8 UOM_ID
# 9 SCALAR_FACTOR
# 10 SCALAR_ID
# 11 VECTOR
# 12 COORDINATE
# 13 VALUE
# 14 STATUS
# 15 SYMBOL
# 16 TERMINATED
# 17 DECIMALS
# 18 GeoUID
# 19 Classification Code for Seasonal adjustment
# 20 Hierarchy for Seasonal adjustment
# 21 Classification Code for Prices
# 22 Hierarchy for Prices
# 23 Classification Code for North American Industry Classification System (NAICS)
# 24 Hierarchy for North American Industry Classification System (NAICS)
#now we will rename columns as needed. We can use numbers for the original columns
The interactive analysis of the columns of the retrieved table should always be an initial step in the process of script development. The column numbers are shown in the R comments. Many are long enough to be unwieldly when trying to write easy to manage code. As noted before, column names with blanks or special characters should be enclosed with back ticks (`) when using dplyr and other tidyverse packages. The file_refresh flag should be set to TRUE to force re-retrieval of the table.
In the next stage the table will be simplified, columns to be used will be renamed as required. The column number is used to simplify the renaming process. An r-standard date will also be created. Columns are also selected using their numbers rather than names for simplicity.
#the working matrix with be set to reduced columns
#an R standard date is also created
gdp_monthly<-gdp_table %>%
rename(`sadj_type`=4,
`NAICS`=6,
`price_type`=21,
`naics_code`=23
) %>%
select(c(1,2,4:6,12,13,23)) %>%
mutate(data_date=as.Date(paste0(REF_DATE,"-1")))
#gdp table could be removed at this point
The next stage is to start the processing to develop the data for the initial and subsequent charts. We will use selected special aggregates as well as a few key service series. These series will be defined using the NAICS code in the meta data provided by the get_cansim retrieval. This variable, defined as naics_code in the rename above is easier to work with than the NAICS text strings for selecting series. In the code below, the text strings are used to select the seasonal adjustment type required and the price measure. For this analysis, we will be using the 2007 fixed weighted Laspeyres measures. The latter are referred to as “2007 constant”. The select verb in the code above uses column numbers rather than variable names for coding convenience.
chart1 - composition of growth special aggregates
target_aggregates<-c("[T001]",
"[T002]",
"[T003]",
#"[T004]",
"[T005]",
"[T006]",
"[T007]",
"[T011]",
"[T012]",
"[T016]",
"[T018]",
"[41]",
"[44-45]",
"[52]","[53]"
)
#build tibble of distinct naics_codes
naics_table<-select(gdp_monthly,NAICS,naics_code)%>%distinct()
#calculate last date
last_date<-max(gdp_monthly$data_date)
date_vector_1<-c(last_date-months(4),last_date-months(2),last_date-months(1),last_date)
#get unique_prices and the string for constant dollars
price_measures<-unique(gdp_monthly$Prices)
laspeyres_flag<-price_measures[grep("constant",price_measures)]
#get_flag for seasonally adjusted at annual rates
seasonal_measures<-unique(gdp_monthly$sadj_type)
seasonal_flag<-seasonal_measures[grep("annual",seasonal_measures)]
#get the data for the growth table with the lag required for percentage change
In the initial analysis, the naics table calculated here was examined for the target aggregate codes defined in the character array in the code above. The next set of code develops the main table for processing. Data are selected only for the required months, in this case, the last 4 months are used since we are going to report 3 months in one of the tables.
The data must be grouped by the naics_code and the COORDINATE then sorted (“arranged”) by date so that lags, percentage changes and contributions can be calculated. The COORDINATE variable preserves ordering.
growth_data<-filter(gdp_monthly,data_date %in% date_vector_1,
Prices==laspeyres_flag,sadj_type==seasonal_flag) %>%
group_by(COORDINATE,naics_code) %>%arrange(data_date)%>%
mutate(monthly_change=100*((VALUE/lag(VALUE,1))-1),
delta_change=VALUE-lag(VALUE,1))%>%ungroup()
head(as.data.frame(growth_data))
The head function uses a data.frame to force the display of the initial rows of all columns for the tibble. The monthly percentage change as well as the delta change (level difference) are calculated. The latter is required for the composition of growth calculations below.
The next set of code defines a tibble for the total all-industries sector. The ratio of the delta change in the target sectors to that of the total sector defines the composition of change. The latter calculation is part of the calculation of the full_table which involves a left_join of the growth_data table to the total table.
#now select the total only
total_only<-filter(growth_data,naics_code=="[T001]")%>%select(data_date,monthly_change,delta_change)%>%
rename(total_change=delta_change,total_pch=monthly_change)
#now the base table and total to calculate the composition of change
full_table<-left_join(growth_data,total_only,by="data_date")%>%
filter(!is.na(delta_change)) %>%
mutate(point_contribution=((delta_change/total_change)*total_pch))
The point contribution is defined as the change share of the percentage change for the total series.
The next code selects the data for just the target aggregates and for the key variables required in the subsequent charts.
#select the chart data for the last months
chart_data_set<-filter(full_table,naics_code %in% target_aggregates) %>%
select(data_date,naics_code,NAICS,point_contribution,monthly_change)
#now plot the last monthly_change
#get_industries
# naics_text<-as.data.frame(filter(naics_table,naics_code %in% target_aggregates)%>%
# select(NAICS))
reverse_index<-seq(length(target_aggregates),1,by=-1)
chart_last_month<-filter(chart_data_set,data_date==last_date)%>%
mutate(naics_text=paste(naics_code,NAICS))
The next code starts the initial chart which is only for the last date.
plot1<-ggplot((chart_last_month),aes(y=point_contribution,x=naics_code,group=data_date,
label=round(point_contribution,2)))+
labs(title="Percentage Point Contribution to Total GDP Change",
caption=paste("NDM:",gdp_table_id,"JCI"),
x=NULL,y="Percentage point contribution",
subtitle=paste("Laspeyres data for selected aggregates for",substr(last_date,1,7)))+
geom_bar(stat="identity",fill="light green")+
geom_text(size=3)+scale_x_discrete(limits=chart_last_month$naics_code[reverse_index],
labels=(chart_last_month$NAICS[reverse_index]))+coord_flip()
ggsave(plot1,file="gdp_laspeyres_contribution.png")
The coordinates of the chart are flipped so that the x axis is plotted on the vertical portion of the chart. Data points are labeled with the rounded value of the point contribution. The X axis is labelled with the text even though the plot is done with the naics_code variable. The labelling is reversed so that the all industries is at the top of the chart. The date is turned into a text string for the purposes of the subtitle. An example of this initial chart is shown below.
The next stage creates an alternative chart to show the data for two months. The data must be grouped to separate it for preparing the bars. The default bar chart is a stacked bar so the “dodge” capability is used to place the bars side by side.
#now lets do it for 2 months
chart_last_two<-filter(chart_data_set,data_date %in% c(last_date,last_date-months(1)))%>%
mutate(naics_text=paste(naics_code,NAICS))
#calculate an adjustment factor to move the text outside the bar
#R considers a date object as a potentially contiguous variabble
#therefore, we convert it to a discrete text string for the group and fill values
text_adjust_factor<-.1*max(abs(chart_last_two$point_contribution))
plot2<-ggplot((chart_last_two),aes(y=point_contribution,x=naics_code,group=substr(data_date,1,7),
fill=substr(data_date,1,7),label=round(point_contribution,2)))+
labs(title="Percentage Point Contribution to Total GDP Change",
caption=paste("NDM:",gdp_table_id,"JCI"),
x=NULL,y="Percentage point contribution",
subtitle="Laspeyres data for selected aggregates")+
geom_bar(stat="identity",position="dodge")+
scale_fill_discrete(name="Month")+
geom_text(size=3,position=position_dodge(0.9),aes(y=(sign(point_contribution)*
(text_adjust_factor+abs(point_contribution)))))+
theme(legend.title=element_blank(),legend.position="bottom")+
scale_x_discrete(limits=chart_last_two$naics_code[reverse_index],
labels=(chart_last_two$NAICS[reverse_index]))+coord_flip()
ggsave(plot2,file="gdp_laspeyres_contribution_2mon.png")
The dodge position is specified in the geom_bar definition. A date construct in R is considered to be a continuous variable because it is represented as number. Therefore, group and fill variable definitions convert the date to text string which is, by definition, discrete. The group defines the separation of the bars. The fill defines the colour choices from the default set for the bars. A text adjustment factor is calculated relative to the maximum of the values plotted to facilitate the positioning of the value labels above the end of the bars. This is used in the geo_text definition which supplies a text layer on top of the bar layer from geom_bar. The theme function is used to move the legend from the default side position to the bottom and remove the superfluous title. An example of the chart is inserted below.
The final chart example uses data for the last 3 months and is defined similarly.
#last three
chart_last_three<-filter(chart_data_set,data_date %in% c(last_date,last_date-months(1),last_date-months(2)))%>%
mutate(naics_text=paste(naics_code,NAICS))
#calculate an adjustment factor to move the text outside the bar
#R considers a date object as a potentially contiguous variabble
#therefore, we convert it to a discrete text string for the group and fill values
text_adjust_factor<-.1*max(abs(chart_last_three$point_contribution))
plot3<-ggplot((chart_last_three),aes(y=point_contribution,x=naics_code,group=substr(data_date,1,7),
fill=substr(data_date,1,7),label=round(point_contribution,2)))+
labs(title="Percentage Point Contribution to Total GDP Change",
caption=paste("NDM:",gdp_table_id,"JCI"),
x=NULL,y="Percentage point contribution",
subtitle="Laspeyres data for selected aggregates")+
geom_bar(stat="identity",position=position_dodge(0.9),width=.8)+
scale_fill_discrete(name="Month")+
geom_text(size=2.5,position=position_dodge(0.9),aes(y=(sign(point_contribution)*
(text_adjust_factor+abs(point_contribution)))))+
theme(legend.title=element_blank(),legend.position="bottom")+
scale_x_discrete(limits=chart_last_two$naics_code[reverse_index],
labels=(chart_last_two$NAICS[reverse_index]))+coord_flip()
ggsave(plot3,file="gdp_laspeyres_contribution_3month.png")
In the two multi-date charts, the geom_text function utilizes an aes (“aesthetics”) specification to supply a y-axis (value) position which is calculated by adding the text adjustment to the absolute value of the point being plotted and then applying the original sign. In this three-date version, the width option in geom_bar is used to slightly narrow the width to provide a bit of space between the bars. A sample chart is shown below.
The final stage in this script is to develop a table that can be plotted with the plotting software. GGPLOT2 is based on the grid package. Several extensions to the grid package, gridExtra and gtable are used to provide a title to the standard table grob (graphic object). The initial step is to develop a tibble which includes a modified character version of the date, converts the naics_text into a factor to preserve order in the table and reduces the dataset to only the columns required, the sector text, the date and the contribution value. The tidyverse spread function is used to create a data frame with columns for each date.
#now build up a table for a chart table
# the date is shortened and the text is converted to factor to preserve order
table_data<-mutate(chart_last_three,date_text=substr(data_date,1,7),
row_text=factor(naics_text,levels=unique(naics_text)),contribution=round(point_contribution,2))%>%
select(row_text,date_text,contribution)%>%
rename(Sectors=row_text)
#now use spread to create a data frame with dates as columns
table_frame<-spread(table_data,key=date_text,value=contribution)
The final segment creates the actual table graphic and finally combines it with the third version of the charts above. The default tableGrob creates a nice table but with centered values. For this purpose, we want to right justify everything. This is done with a modification to the default theme. Then gtable commands are used to put a title at the top of the table and a footnote at the bottom. A link is supplied in the comments to a stack overflow discussion of modifying a table grob in this way. The comments explain what is going on.
library(gridExtra)
t1<-ttheme_default()
#refer to the vignette documentation
tt2 <- ttheme_default(core=list(fg_params=list(hjust=1, x=0.9)),
rowhead=list(fg_params=list(hjust=1, x=0.95)))
grob1<-tableGrob(table_frame,rows=NULL,theme=tt2)
#now add a title based on a stack overflow discussion
library(grid)
library(gtable)
grob_title<-textGrob("Percentage Contributions to Laspeyres Growth Rate",
gp=gpar(fontsize=15))
padding<-unit(.5,"line")
#create a new grob with room for the title at top (pos=0)
full_grob<-gtable_add_rows(grob1,heights=grobHeight(grob_title)+padding,pos=0)
full_grob<-gtable_add_grob(full_grob,list(grob_title),t=1,l=1,r=ncol(full_grob))
grob_footnote<-textGrob(paste("STATCAN NDM: ",gdp_table_id,"JCI"),
gp=gpar(fontsize=8))
full_grob<-gtable_add_rows(full_grob,heights=grobHeight(grob_footnote)+padding,
pos=nrow(full_grob))
full_grob<-gtable_add_grob(full_grob,list(grob_footnote),t=nrow(full_grob),l=2,r=ncol(full_grob))
plot(full_grob)
ggsave(full_grob,file="gdp_contributions_table_3months.png")
plot_list<-mget(c("plot3","full_grob"))
two_plots<-marrangeGrob(plot_list,ncol=2,nrow=1,top=NULL)
two_plot_name<-"gdp_contributions_analysis.png"
ggsave(file=two_plot_name,width=15,height=8,two_plots)
A grob is just a special data frame with instructions for plotting. There is one row for each row of the table and the code simply adds a row at the top and later at the bottom. A graphic version of the title and footnote are inserted in the appropriate position.
The table example is shown below.
An example of the merged graphic with chart 3 and the table is shown below.
The important point in the analysis is to show if there is significant variability in the sources of growth for aggregate GDP in recent months.
Term Structure Analysis
In this example, we will use the fredr package to retrieve one of the most popular series from the US Federal Reserve database (FRED). This is the spread between the returns on 10 year and 2 year constant maturity bonds.[1] In this example, we will use fredr which includes access to the basic series documentation. It requires that the user open a one-time account to obtain an API key. The new CANSIM package from Mountain Math will be used to retrieve the 10-year and 2-year benchmark bonds for Canada.
Packages USED
The key package used in this vignette is the fredr package which accesses the FRED API. Key packages include
- dplyr – for processing the tibble prior to conversion to xts
- ggplot2 – part of tidyverse for plotting the results
- cansim – package to retrieve metadata and series from Statistics Canada’s NDM
- broom, tbl2xts – utility packages for working with tibbles
- fredr – the retrieval package to be used.[2]
Retrieving the DATA
The first step in the analysis is to go the FRED site at https://fred.stlouisfed.org to open and account and obtain an API key. This is done only once and can be used for multiple R scripts. In this case, the key is saved in a CSV file in the folder being used for this example. It must be retrieved from this CSV file for each run to initialize the calls to the API.
#Retrieving from FRED and CANSIM
library(tidyverse)
library(fredr)
library(ggplot2)
library(xts)
library(tbl2xts)
setwd("D:\\OneDrive\\fred_test")
#set up the API key
user_api_key<-read.csv("fred_api_key.csv",stringsAsFactors=FALSE)
fredr_set_key(user_api_key$fredapi)
The next stage is to define the names for the series to be retrieved. The fredr package has search options which would facilitate the lookup of series ids. For simplicity, this was done in this example using the St. Louis web site. The next set of code retrieves the information about the series first. Each call to the routine fredr_series returns a tibble for one series. The routine fredr_series_observations returns the actual data as a tibble. The function tbl_xts from the package tbl2xts is used to translate the tibble to the xts time series construct.
us_series<-"t10y2y"
us_series_info<-fredr_series(us_series)
print(as.data.frame(us_series_info))
starting_point<-"2004-01-01"
us_data1<-fredr_series_observations(us_series,observation_start=as.Date(starting_point))
#start building up an xts data frame
us_xts1<-tbl_xts(us_data1)
colnames(us_xts1)<-c("us10y2y")
The next stage is to use the CANSIM procedure to retrieve two interest rates series and calculate the difference. The meta data for both series are retrieved to document the series in the log file. A Date column is constructed using a mutate command because the date, as retrieved from NDM, is not recognized as a regular date by R.
#now get Canadian data-we start in xts to get dates straight
cdn_irate_vbls<-c("v39055","v39051")
library(cansim)
meta1<-get_cansim_vector_info(cdn_irate_vbls)
print(data.frame(meta1))
cdn_irate_xts <- get_cansim_vector(cdn_irate_vbls,start_time=as.Date(starting_point)) %>%
mutate(Date=as.Date(REF_DATE)) %>%
tbl_xts(spread_by="VECTOR",cols_to_xts="VALUE")
tail(cdn_irate_xts)
#need to clean out the zero observations
cdn_irate_xts[rowSums(cdn_irate_xts)==0,]<-NA
#otherwise, we get an area plot
#calculate differences
cdn_yield<-cdn_irate_xts$v39055 - cdn_irate_xts$v39051
colnames(cdn_yield)<-c("cdn10y2y")
In the tbl_xts invocation, the spread_by is used to identify the variable which contains the series names. There are a number of data edits required. Any zeros should be filled with NA to avoid problems in the plot because zero is considered a legitimate data point. The resulting series is given the label cdn10y2y to match the rename of the US series above.
The next stage removes the NAs, particularly the trailing ones, so that a legitimate last date can be calculated to use in graph labelling. The US and Canadian series are merged using the tidy procedure from the broom package.
#calculate last date
#squeeze out nas
cdn_spread<-na.omit(cdn_yield$cdn10y2y)
last_date<-as.Date(time(tail(cdn_spread,1)))
print(last_date)
us_cdn_xts<-merge(us_xts1,cdn_yield)
#use broom and tidy to make a plot tibble
library(broom)
plot_data<-tidy(us_cdn_xts)
print(plot_data)
The final stage is to prepare the plot using the tibble as input. The economist_white theme is used from ggthemes as a convenience. The latter package has other interesting themes but there are numerous other examples available around the internet.
library(ggthemes)
term_plot<-ggplot(plot_data,aes(x=index,y=value,colour=series))+
theme_economist_white()+
theme(legend.position="bottom",legend.title=element_blank())+
labs(y="Difference in Percent",x=NULL,
title="Difference between 10-Year and 2-Year Bonds",
subtitle="Canada and US",
caption=paste("FRED: t10y2y","CANSIM:",paste(cdn_irate_vbls,collapse=" - "),"End:",last_date))+
scale_x_date(date_breaks="2 years",date_labels="%Y")+
geom_line()
ggsave(term_plot,file="term_plot.png")
The caption is noteworthy because the collapse option in paste is used to connect the two Canadian series using “ – “. The last date is also dynamically inserted. The scale_x_date is used because the X data are defined in R date format. If the date_labels is not used, the days will be shown because dates in R always include a day.
The resulting chart is shown below.
[1] https://fredblog.stlouisfed.org/2018/08/whats-up-or-down-with-the-yield-curve/
[2] https://www.rdocumentation.org/packages/fredr/versions/1.0.0
Accessing the Development Version of Mountainmath’s Cansim package R
This note provides a vignette showing how to install the development version of the new CANSIM package for accessing NDM. This vignette uses R utilities to iknstal the GITHUB version of the CANSIM package. More information about the developers can be found on the Mountainmath website at https://www.mountainmath.ca I commend the blog posts to you for interesting information.
Packages USED
The key package installed in this vignette is the CANSIM package developed by Mountain Math. This is the key package for retrieving data using the web services released in the New Dissemination Model version of CANSIM. The data are retrieved and converted to time series stored as XTS objects. Key packages include
- cansim – package developed by Mountain Math to access the NDM CANSIM
- devtools – R utilities
Installing the Package
The devtools package is a necessity to easily access GITHUB development sites
devtools::install_github("mountainmath/cansim")
library(cansim)
The library statement will check that the cansim package is correctly installed.