Accessing the WORLD Bank Data
This note focusses on directly accessing indicators in the database of the World Bank. Specially we are going to show a series of time series plots of Gini coefficients for several countries. This vignette follows from the earlier one on Gini coefficients. Appendices in the corresponding contain the log and a log file for a sample run.
The key package used in this vignette is the wbstats package which accesses the World Bank RESTful API. Key packages include
- dplyr – for processing the tibble prior to conversion to xts
- ggplot2 – part of tidyverse for plotting the results
- wbstats – the retrieval package to be used.
Detailed instructions for utilizing the wbstats package are available online.
Retrieving the DATA
The process starts with loading the relevant packages. Unlike many other statistical databases, data in the World Bank (WB) system are organized by indicators by countries. The structure of the database is shown in the earlier GINI vignette for the G20. Data can be retrieved by combinations of country and indicator. Ways to search for indicators and country codes are indicated in the G20 vignette. In this vignette, we will use a predefined csv file with the country codes that was saved in the previous vignette. The csv file has been modified by hand to include a flag for countries which are members of the G7.
In the initial set of code, we select the g7 countries and then go directly to the database to retrieve the data. The read_csv routine is used to import the CSV because it returns a tibble which is the tidyverse extension of a regular data frame.
#read in a table of country codes and regions with a g7 flag
The gini_indicator variable uses the indicatorID found in the previous vignette. The mutate function is used to transform the simple integer year date variable to a normal R date-class variable. All dates have days associated with them. For this purpose, it is easiest to use the first day of the year. Transforming the date vector to a true date-type vector is required so that date constructs can be used in the plotting software to position data relative to the X axis. The data retrieved is a “tall” tibble with one observation per row. This will be seen in the log print included in this PDF.
In this vignette, two charts are going to be produced. The base chart is standard multi-line plot. This is created by grouping the data by country. Japan is excluded from the initial dataset by filtering for all countries not equal to “Japan” because it was found to have only one observation. The line type is also varied by country as is the colour of the line. The title of the chart is obtained from the indicator title which is retrieved with every observation for the data.
#date is a vector of numbers therefore discrete scale should be used
#time series plot -only one observation for Japan
caption="Source:World Bank, JCI")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The comments referring to the discrete scale show how a scale would be defined if the original integer variable date had been used in the chart. It is only possible to use the date scale with a true date variable. The date scale is formatted just to be years with the date_labels attribute in the scale_x_date function. The first base plot is saved as plot1 and is shown below.
The line chart is somewhat hard to follow because many of the lines intersect. With the colours and varying line types, it is usable but can be a little difficult to see the country trends. Therefore, we are going to transform plot1 into plot2 with the facet_wrap function which breaks the plot into a separate plot for each country arranged in two columns. The resulting plot is shown below.
The advantage of the facet form is that each facet has the same scales and size. This facilitates visual comparison while also simplifying the perspective. The legend is supressed because it is not required in the facet format.