Processing PUMF Files
This note provides a vignette showing how the Statistics Canada public use microdata file (PUMF) for the Survey of Financial Security 2016 can be converted into a usable format for processing in R to produce weighted cross tabulations and charts..
Packages USED
The key package used in this vignette is the MEMISC package developed by Martin Elff at the Zeppelin University in Friedrichshafen Germany where he teaches political science. His package provides an infrastructure for the management of survey data including value labels, definable missing values, recoding of variables, production of code books, and import of (subsets of) 'SPSS' and 'Stata' files is provided. Further, the package allows to produce tables and data frames of arbitrary descriptive statistics and (almost) publication-ready tables of regression model estimates, which can be exported to 'LaTeX' and HTML[1]
Dr. Elff’s website contains a detailed discussion about how to use the package for analyzing microdata with an emphasis on categorical variables.[2] Managing items in a survey data set with value labels and recoding them is one of its many functions. However, of particular importance to this vignette is the infrastructure that he supplies to process raw tax files and variable and value labels defined in the SPSS format.[3]
In the example discussed below, we will utilize the SFS2016 released in the summer of 2018. A code listing and log are included in the accompanying PDF.
We will also use the ggplot2 package from the tidyverse system to produce plots based on weighted cross tabulations of the data.[4] One of the characteristics of the Statistics Canada pumf files is that they are weighted surveys. Although not done in this example, descriptive and other analysis using weights can be done with many other packages including HMISC[5] and descr[6].
Converting the Data
The first stage is always to process the flat text files provided by Statistics Canada in the PUMF distribution. There is always a data file and control files for several packages including SPSS. In this example, we will use the SPSS version of the control files.
In this vignette, the conversion process has been done in a separate script from the analysis function. In this initial processing script, the first stage is to load the packages and define the location of the text file.
options(echo=TRUE,descr.plot=FALSE)
require(memisc)
sessionInfo()
#minimum version should be memisc_0.99.14.12
#define locations of input spss files
input_folder<-"D:\\STATCAN\\SFS2016\\SpssCard\\"
columns_file<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_i.sps"
variable_labels<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_vare.sps"
variable_values<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_vale.sps"
missing_values<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_miss.sps"
data_file<-"D:\\STATCAN\\SFS2016\\Data\\SFS2016_EFAM_PUMF.txt"
project_folder<-"D:\\OPS\\sfs2016_example\\"
new_dir<-setwd(project_folder)
print(paste("Current directory is",getwd()))
current_directory<-getwd()
The next stage is to create a definition of the dataset using the spss.fixed.file import procedure saved in the variable sfs2016_efam.
sfs2016_efam<-spss.fixed.file(data_file,
columns_file,
varlab.file=variable_labels,
codes.file=variable_values,
missval.file=missing_values,
count.cases=TRUE,
to.lower=TRUE
)
print(sfs2016_efam)
The next stage is to use the dataset definition, the importer object, to create a dataset in R’s memory, write a codebook listing the variables. Value labels were not supplied for all variables in the dataset but were supplied for the weight variable. Because of the presence of these value labels, MEMISC considers the weight variable (pweight) to be a nominal categorical variable. The measurement of the variable is changed to “ratio” so that MEMISC will consider it as a numerical variable. The modified memory object (sfs2016_efam.ds) is then saved in an R save file for easy reloading in other R scripts. Other data edits cold be made before this save as well. However, in this example, such edits will be made in the analytical part of the example.
sfs2016_efam.ds<-as.data.set(sfs2016_efam)
write_html(codebook(sfs2016_efam.ds),file="sfs2016_efam_codebook.html")
#modify the type for pweight because it is initial set to be a nominal
# categorical variable because value labels are supplied
measurement(sfs2016_efam.ds$pweight)<-"ratio"
outlist<-c("sfs2016_efam.ds")
save(list=outlist,file="sfs2016_efam_save.Rsave")
Doing a codebook analysis for the pweight variable now shows that it is regarded as being numerical.
sfs2016_efam.ds$pweight 'Survey weights - PUMF'
--------------------------------------------------------------------------------
Storage mode: double
Measurement: ratio
Missing values: 9999999.9996-9999999.9999
Values and labels N Percent
9999999.9996 M 'Valid skip' 0 0.0
9999999.9997 M 'Don't know' 0 0.0
9999999.9998 M 'Refusal' 0 0.0
9999999.9999 M 'Not stated' 0 0.0
(unlab.vld.) 12428 100.0 100.0
Min: 10.000
Max: 12254.083
Mean: 1235.023
Std.Dev.: 1055.169
Skewness: 3.711
Kurtosis: 25.079
The results from this script are shown in the PDFs. Any problems are likely result of incorrectly formatted SPSS control cards. The data file as delivered was longer than the record definition. This required some modifications to MEMISC. The minimum version for this example to work is memisc_0.99.14.12.
Some Basic Analysis
In the example script, we will define a new age variable grouping the survey respondents by birth year and then analyze some aspects of their mortgages. The value labels for several of the variables used will have to be supplied because these were omitted in the Statistics Canada deliverables.
The first stage sets up the directory and loads the saved data set from the previous script.
This creates a second copy of the DS object for additional modifications.
library(memisc)
#loading tidyverse with mask some similarly named functions in memisc
library(ggplot2)
#options(error=recover)
project_folder<-"D:\\OPS\\sfs2016_example\\"
new_dir<-setwd(project_folder)
print(paste("Current directory is",getwd()))
load("sfs2016_efam_save.Rsave")
sfs2016_2.ds<-sfs2016_efam.ds
I
t is this copy (sfs2016_2.ds) that will be modified and saved in this script. Value labels will be added for several variables because they were omitted from Statistics Canada’s package. The definition of the value labels is found in the PUMF documentation provided by Statistics Canada.
The first stage is to create new age variable grouping the respondents into Boomers (born up to and including 1965), Generation Xers (born between 1966 and 1979 and the Millennial group born after 1979. The cut command is used to break a birth year variable into groups (1-3) on the basis of the break years. The latter are treated as closed to the right by default. Appropriate labels are provided.
#now to look at age
summary(sfs2016_2.ds$pagemie)
#now for the classification
birth_year<-(2016-sfs2016_2.ds$pagemie)
summary(birth_year)
year_breaks<-c(0,1965,1979,2016)
#right=TRUE is default for cut - means groups are closed on the right
birth_year_group<-cut(birth_year,year_breaks,labels=FALSE)
summary(birth_year_group)
cohort_name<-c("Boomer","Genx","Millenial")
grouped_birth_year<-factor(cohort_name[birth_year_group])
The variable is defined as a factor or ordered categorical variable.
The next set of code adds the grouped_birth_year as the variable age_cohort to the data set and supplies a description. Value labels are added for the type of mortgage interest and the term of the mortgage because these were omitted from the supplied definitions by Statistics Canada.
#now define some sample variables
own_home_with_mortage<-sfs2016_2.ds$pftenur==2
#now start some variable edits in the sfs2016_2.ds object
sfs2016_2.ds<-within(sfs2016_2.ds,{
age_cohort<-grouped_birth_year
description(age_cohort)<-paste(description(pagemie),"By Birth Year")
own_mortgaged=own_home_with_mortage
description(own_mortgaged)<-"Own home but have mortgage"
#we have to supply labels for a number of variables because Statcan's label file was incomplete
labels(pasrint)<-c("Fixed"=1,
"Variable"=2,
"Combination (hybrid)"=3,
"Valid Skip"=6)
measurement(pasrint)<-"nominal"
labels(pasrcurg)<-c("3-6 months"=1,
"1 year"=2,
"2 years"=3,
"3 years"=4,
"4 years"=5,
"5 years"=6,
"7-10 years"=7,
"Other"=8,
"Valid Skip"=96)
measurement(pasrcurg)<-"nominal"
#we could subtract 2016 to show the number of years
years_to_renewal<-sfs2016_2.ds$pasrmryg
description(years_to_renewal)<-"Year of Renewal from 2016"
})
#save the calculations so we don't have to recreate the variables we modified
save(sfs2016_2.ds,file="sfs2016_2_save.Rsave")
Now we can start the analysis. The basic approach in this simple example is to use the xtabs function to produce weighted cross tabulations. By default, not-available (NA) observations are not processed.
The first cross tabulation looks at tenure of dwelling.
#now do a weighted frequency counts
tab_age<-xtabs(pweight~age_cohort,data=sfs2016_2.ds)
print(tab_age)
tab_tenur<- xtabs(pweight~age_cohort+pftenur,data=sfs2016_2.ds)
print(tab_tenur)
#convert to a data frame for ggplot
tenur_data<-as.data.frame(tab_tenur/1000)
tenur_plot<-ggplot(tenur_data,aes(x=age_cohort,y=Freq,fill=pftenur))+
geom_bar(position="dodge",stat="identity")+
labs(title="Economic Family Housing Tenure - 2016",
fill="Tenure",y="Economic Families (000)",x="Birth Year Cohort",caption="Source: SFS2016, JCI")
ggsave(plot=tenur_plot,file="Tenure_plot.png")
The data are in economic family units so the resulting xtab object is sc
The appendices show the scripts and logs. Note that comments may wrap because of formatting.
[1] https://www.rdocumentation.org/packages/memisc/versions/0.99.14.12
[2] http://www.elff.eu/software/#memisc
[3] http://www.elff.eu/memisc/import/
[4] https://ggplot2.tidyverse.org/
[5] https://www.rdocumentation.org/packages/Hmisc/versions/4.1-1
[6] https://www.rdocumentation.org/packages/descr/versions/1.1.4