Processing PUMF Files

This note provides a vignette showing how the Statistics Canada public use microdata file (PUMF) for the Survey of Financial Security 2016 can be converted into a usable format for processing in R to produce weighted cross tabulations and charts..

Packages USED

The key package used in this vignette is the MEMISC package developed by Martin Elff at the Zeppelin University in Friedrichshafen Germany where he teaches political science. His package provides an infrastructure for the management of survey data including value labels, definable missing values, recoding of variables, production of code books, and import of (subsets of) 'SPSS' and 'Stata' files is provided. Further, the package allows to produce tables and data frames of arbitrary descriptive statistics and (almost) publication-ready tables of regression model estimates, which can be exported to 'LaTeX' and HTML[1]

Dr. Elff’s website contains a detailed discussion about how to use the package for analyzing microdata with an emphasis on categorical variables.[2] Managing items in a survey data set with value labels and recoding them is one of its many functions. However, of particular importance to this vignette is the infrastructure that he supplies to process raw tax files and variable and value labels defined in the SPSS format.[3]

In the example discussed below, we will utilize the SFS2016 released in the summer of 2018.   A code listing and log are included in the accompanying PDF.

We will also use the ggplot2 package from the tidyverse system to produce plots based on weighted cross tabulations of the data.[4] One of the characteristics of the Statistics Canada pumf files is that they are weighted surveys. Although not done in this example, descriptive and other analysis using weights can be done with many other packages including HMISC[5] and descr[6].

Converting the Data

The first stage is always to process the flat text files provided by Statistics Canada in the PUMF distribution. There is always a data file and control files for several packages including SPSS. In this example, we will use the SPSS version of the control files.

In this vignette, the conversion process has been done in a separate script from the analysis function. In this initial processing script, the first stage is to load the packages and define the location of the text file.

options(echo=TRUE,descr.plot=FALSE)

require(memisc)

sessionInfo()

#minimum version should be memisc_0.99.14.12

#define locations of input spss files

input_folder<-"D:\\STATCAN\\SFS2016\\SpssCard\\"

columns_file<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_i.sps"

variable_labels<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_vare.sps"

variable_values<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_vale.sps"

missing_values<-"D:\\STATCAN\\SFS2016\\SpssCard\\sfs2016_efam_pumf_miss.sps"

data_file<-"D:\\STATCAN\\SFS2016\\Data\\SFS2016_EFAM_PUMF.txt"

project_folder<-"D:\\OPS\\sfs2016_example\\"

new_dir<-setwd(project_folder)

print(paste("Current directory is",getwd()))

current_directory<-getwd()

The next stage is to create a definition of the dataset using the spss.fixed.file import procedure saved in the variable sfs2016_efam.

sfs2016_efam<-spss.fixed.file(data_file,

columns_file,

varlab.file=variable_labels,

codes.file=variable_values,

missval.file=missing_values,

count.cases=TRUE,

to.lower=TRUE

)

print(sfs2016_efam)

The next stage is to use the dataset definition, the importer object, to create a dataset in R’s memory, write a codebook listing the variables. Value labels were not supplied for all variables in the dataset but were supplied for the weight variable. Because of the presence of these value labels, MEMISC considers the weight variable (pweight) to be a nominal categorical variable. The measurement of the variable is changed to “ratio” so that MEMISC will consider it as a numerical variable. The modified memory object (sfs2016_efam.ds) is then saved in an R save file for easy reloading in other R scripts. Other data edits cold be made before this save as well. However, in this example, such edits will be made in the analytical part of the example.

sfs2016_efam.ds<-as.data.set(sfs2016_efam)

write_html(codebook(sfs2016_efam.ds),file="sfs2016_efam_codebook.html")

#modify the type for pweight because it is initial set to be a nominal

# categorical variable because value labels are supplied

measurement(sfs2016_efam.ds$pweight)<-"ratio"

outlist<-c("sfs2016_efam.ds")

save(list=outlist,file="sfs2016_efam_save.Rsave")

Doing a codebook analysis for the pweight variable now shows that it is regarded as being numerical.

sfs2016_efam.ds$pweight 'Survey weights - PUMF'

--------------------------------------------------------------------------------

   Storage mode: double

   Measurement: ratio

   Missing values: 9999999.9996-9999999.9999

                 Values and labels     N     Percent

                                                    

       9999999.9996 M 'Valid skip'     0           0.0

       9999999.9997 M 'Don't know'     0           0.0

       9999999.9998 M 'Refusal'       0           0.0

       9999999.9999 M 'Not stated'     0           0.0

                     (unlab.vld.) 12428   100.0 100.0

                                                    

           Min:     10.000                          

           Max: 12254.083                          

           Mean:   1235.023                          

       Std.Dev.:   1055.169                          

       Skewness:     3.711                          

       Kurtosis:    25.079                          

The results from this script are shown in the PDFs. Any problems are likely result of incorrectly formatted SPSS control cards. The data file as delivered was longer than the record definition. This required some modifications to MEMISC. The minimum version for this example to work is memisc_0.99.14.12.

Some Basic Analysis

In the example script, we will define a new age variable grouping the survey respondents by birth year and then analyze some aspects of their mortgages. The value labels for several of the variables used will have to be supplied because these were omitted in the Statistics Canada deliverables.

The first stage sets up the directory and loads the saved data set from the previous script.

This creates a second copy of the DS object for additional modifications.

library(memisc)

#loading tidyverse with mask some similarly named functions in memisc

library(ggplot2)

#options(error=recover)

project_folder<-"D:\\OPS\\sfs2016_example\\"

new_dir<-setwd(project_folder)

print(paste("Current directory is",getwd()))

load("sfs2016_efam_save.Rsave")

sfs2016_2.ds<-sfs2016_efam.ds

It is this copy (sfs2016_2.ds) that will be modified and saved in this script. Value labels will be added for several variables because they were omitted from Statistics Canada’s package. The definition of the value labels is found in the PUMF documentation provided by Statistics Canada.

The first stage is to create new age variable grouping the respondents into Boomers (born up to and including 1965), Generation Xers (born between 1966 and 1979 and the Millennial group born after 1979. The cut command is used to break a birth year variable into groups (1-3) on the basis of the break years. The latter are treated as closed to the right by default. Appropriate labels are provided.

#now to look at age

summary(sfs2016_2.ds$pagemie)

#now for the classification

birth_year<-(2016-sfs2016_2.ds$pagemie)

summary(birth_year)

year_breaks<-c(0,1965,1979,2016)

#right=TRUE is default for cut - means groups are closed on the right

birth_year_group<-cut(birth_year,year_breaks,labels=FALSE)

summary(birth_year_group)

cohort_name<-c("Boomer","Genx","Millenial")

grouped_birth_year<-factor(cohort_name[birth_year_group])

The variable is defined as a factor or ordered categorical variable.

The next set of code adds the grouped_birth_year as the variable age_cohort to the data set and supplies a description. Value labels are added for the type of mortgage interest and the term of the mortgage because these were omitted from the supplied definitions by Statistics Canada.

#now define some sample variables

own_home_with_mortage<-sfs2016_2.ds$pftenur==2

#now start some variable edits in the sfs2016_2.ds object

sfs2016_2.ds<-within(sfs2016_2.ds,{

age_cohort<-grouped_birth_year

description(age_cohort)<-paste(description(pagemie),"By Birth Year")

own_mortgaged=own_home_with_mortage

description(own_mortgaged)<-"Own home but have mortgage"

#we have to supply labels for a number of variables because Statcan's label file was incomplete

labels(pasrint)<-c("Fixed"=1,

"Variable"=2,

"Combination (hybrid)"=3,

"Valid Skip"=6)

measurement(pasrint)<-"nominal"

labels(pasrcurg)<-c("3-6 months"=1,

"1 year"=2,

"2 years"=3,

"3 years"=4,

"4 years"=5,

"5 years"=6,

"7-10 years"=7,

"Other"=8,

"Valid Skip"=96)

measurement(pasrcurg)<-"nominal"

#we could subtract 2016 to show the number of years

years_to_renewal<-sfs2016_2.ds$pasrmryg

description(years_to_renewal)<-"Year of Renewal from 2016"

})

#save the calculations so we don't have to recreate the variables we modified

save(sfs2016_2.ds,file="sfs2016_2_save.Rsave")

Now we can start the analysis. The basic approach in this simple example is to use the xtabs function to produce weighted cross tabulations. By default, not-available (NA) observations are not processed.

The first cross tabulation looks at tenure of dwelling.

#now do a weighted frequency counts

tab_age<-xtabs(pweight~age_cohort,data=sfs2016_2.ds)

print(tab_age)

tab_tenur<- xtabs(pweight~age_cohort+pftenur,data=sfs2016_2.ds)

print(tab_tenur)

#convert to a data frame for ggplot

tenur_data<-as.data.frame(tab_tenur/1000)

tenur_plot<-ggplot(tenur_data,aes(x=age_cohort,y=Freq,fill=pftenur))+

geom_bar(position="dodge",stat="identity")+

labs(title="Economic Family Housing Tenure - 2016",

fill="Tenure",y="Economic Families (000)",x="Birth Year Cohort",caption="Source: SFS2016, JCI")

ggsave(plot=tenur_plot,file="Tenure_plot.png")

The data are in economic family units so the resulting xtab object is sc

sfs2016 plots

The appendices show the scripts and logs. Note that comments may wrap because of formatting.

 


[1] https://www.rdocumentation.org/packages/memisc/versions/0.99.14.12

[2] http://www.elff.eu/software/#memisc

[3] http://www.elff.eu/memisc/import/

[4] https://ggplot2.tidyverse.org/

[5] https://www.rdocumentation.org/packages/Hmisc/versions/4.1-1

[6] https://www.rdocumentation.org/packages/descr/versions/1.1.4