Case Study: Nor Cal
Carpenters Union

Styled Text Example

case study

Styled Heading Example
NCCU: Efficiently Extracting Data from Thousands of Messy Excel Workbooks
Author photo
Author: John D. Johnson
John is a quantitative researcher skilled in applying data science and academic techniques to policy questions.
As a consultant at Big Lake Data, he offers technical insight on the data solutions offered to clients.

Often, clients come to us with a question and ask us to find the data to answer it. Other times, the client already has the raw data they need, but they want help efficiently turning it into a useful dataset.

This was the case for the Nor Cal Carpenters Union. The union wanted to better understand the landscape of affordable housing projects in California. Hundreds of these projects are submitted to the California Tax Credit Allocation Committee each year. Fortunately, the committee shares the official applications, which contain enormous amounts of information about the location, style, financing, timing, and cost of each projec

Unfortunately, all of this information is locked into large, irregularly formatted Microsoft Excel workbooks, each of which contains more than a dozen sheets. The sheets themselves are mostly long text documents, interspersed with fillable fields and tables.

Our client’s goal was to extract about 300 data points from each application, and then receive that data as a simple spreadsheet with one row for each project.

The format of the applications was similar from year to year, but with enough individual variation to make standard ways of referencing cells in external workbooks unfeasible.

Instead, we designed a workflow in the R programming language to complete the following steps:

1.Download the original Excel workbook application files

2.Load the data files into a quicker-to-access format

3.Extract the desired data points

4.Reassemble those data points into a standardized, clean format, ready for analysis

Step 2 was crucial. To test extraction techniques and quickly accommodate new client requests, we needed a way to quickly load all of the data. The R package {readxl} lets the user read an Excel file with fine-grained control. However, loading just the 3 sheets of primary interest from each workbook took about 4.66 seconds.

We needed to load the full raw data frequently, and we obviously weren’t going to wait 3 hours each time. Our solution cut the load time down to about 1 minute.

Here’s how we did it. We loaded each sheet as a single table, called a “tibble” in R-speak. We gave each tibble its respective sheet name from the Excel workbook. We assembled all the tibbles from an individual workbook into a list, and gave that list the name of the original Excel file. 

We repeated this process for each application’s workbook. Then we placed all the workbook lists into a single giant list. We saved this in R’s native “RDS” file format. The RDS format is highly compressed, so the output file occupied 239mb of harddrive space, despite using 6.6gb once unloaded into RAM.

See the bottom of this post for a demonstration of the code.

Once we had the source data wrangled into an efficient file structure, the rest of the project was straightforward. We wrote functions that detected the desired data elements by searching the text for regular expressions. As the client conducted their analysis, they sometimes realized additional data points were needed. With our workflow in place, we could quickly accommodate those requests.

After extracting the data elements, we reformatted them to match the client’s specifications and shared the final output in CSV format, ready for use in the statistical software of their choice.

R code to convert a directory of multi-sheet Excel workbooks into a nested list.

library(tidyverse)

# the full path to each file
all.paths <- list.files(path-to-folder, full.names = T)
# the name of each file
all.names <- list.files(path-to-folder, full.names = F) %>% word(1)

# this function reads all sheets from a given workbook
read_project <- function(pathindex){
    # the names of all sheets in the workbook
    sheets <- str_squish(readxl::excel_sheets(all.paths[pathindex]))

    # this function reads an individual sheet
    read_this_sheet <- function(sheetno){
        readxl::read_excel(all.paths[pathindex], sheet = sheetno,
        col_types = "text", col_names = F,
        range = readxl::cell_limits(c(1,1), c(NA, NA)),
        .name_repair = "unique_quiet")
    }

    # read every sheet, return each sheet as an element of a list
    application.sheets <- map(1:length(sheets), read_this_sheet)
    # name each element of the list the name of the respective sheet
    names(application.sheets) <- sheets
    application.sheets # output the list of tibbles, 1 for each sheet
}

# read every workbook, return a list where each element is an output from
# the read_project() function
all.projects <- map(1:length(all.paths), read_project, .progress = T)

# name each element of the list the name of the original file
names(all.projects) <- all.names

# save the output
saveRDS(all.projects, "nested-list-of-workbooks.RDS")

Overwhelmed by tedious data management?

We can help you make sense of it. Tell us about your project and we’ll reach out shortly.

"The team at Big Lake Data transformed our overwhelming Excel data challenge into a streamlined, manageable process. Their innovative approach not only saved us countless hours but also provided us with precise, actionable insights to help us advance our affordable housing projects policy goals.”

Author photo
- Executive Research Analyst, Nor Cal Carpenters Union
Tell us about your project

Share your information here and we’ll be in touch shortly.