One of the problems with the NIH Funding Facts, which I’ve used in an earlier post, is that it has missing data in some important places. First, it only goes back to 1998 and you might want to look at trends even before that. Second, it has missing data in some unpredictable places. For instance, you can see below that National Cancer Institute (NCI) R01 funding is missing from 1998 to 2005. The NCI is one of the biggest NIH institutes by funding, so that’s pretty valuable information we might want to know.
One place you might try to get this data is the NIH Office of Budget. It doesn’t break down spending by activity codes (so you can’t, say, look only at spending on R01 grants), but it goes back all the way to 1983 so you can get a longer time series of NIH spending.
The problem? The data is in a PDF (oh god why?!), so we have to find a way to extract it. For the sake of reproducibility and accuracy, I want to avoid copying and posting numbers directly from the document as much as possible.
The tabulizer package works pretty well for this since the data is in a table.1 We still have to do some “manual” work because of the table’s structure. Using the tabulizer::extract_areas function, you can interactively specify the area of the PDF where you have a table to be extracted.2
You can see below what you’re supposed to get from each PDF file.
After some data cleaning, we’re done! You can find the code for loading and cleaning the data at the end of the post. Keep in mind that the table selection tool might not always work as you expect (e.g. selects an extra row), and the code is sensitive to that. The main things we need to do are
Rename the columns and standardize the funding mechanism names (e.g. Admin supp v.s. administrative supplement)
Parse the numbers e.g. so that R recognizes $1,000 as 1000; readr::parse_number makes this a cinch
Reshape the data so that it’s tidy
Let’s do some plots to make sure things look alright.
From a quick inspection the time series looks alright. We can see the rapid rise in funding due to the doubling of the NIH budget from 1998 to 2003, and followed by the flatness in funding since then. But even before the doubling, funding had already been on an upward trajectory from 1983, except between 1993 to 1995. From a historical perspective, the funding stagnation in funding since 2003 is an unusual event.
To make the same point in a slightly different way, I plotted the percentage change in funding from the previous year. There is only one year with a decrease in nominal funding pre-2003, compared to six instance of funding decreases after 2003.
Code for extracting and cleaning PDF data
tabulizer is based on tabula. Another option for PDF extraction in R is pdftools. ↩
In theory, you might be able to partially automate this over multiple pages using tabulizer’s locate_areas() function, but I haven’t got it to work yet. ↩
Leave a Comment