Stata’s reshape in R

3 minute read

Update: There is an R function called reshape from the stats package that does the same thing, just not within the tidyverse framework

If you’ve used Stata you might be familiar with its reshape command. reshape makes a wide dataset long and vice versa. The equivalent in the tidyverse would be the gather (wide to long) and spread (long to wide) functions from the tidyr package. The difference is that gather and spread work on key-value pairs, emphasis on the singular “value”, while reshape is fine with having multiple values associated with a single key. For example, to reshape the following (fake) wide dataset from this:

id age2000 age2010 scores2000 scores2010
A 11 21 96 100
B 12 22 97 99
C 13 23 98 98
D 14 24 99 97
E 15 25 100 96

to this:

id year age scores
A 2000 11 96
A 2010 21 100
B 2000 12 97
B 2010 22 99
C 2000 13 98
C 2010 23 98
D 2000 14 99
D 2010 24 97
E 2000 15 100
E 2010 25 96

In Stata you would do this with something like1 reshape long age scores, i(id) j(year).

With the tidyr functions, you need to first gather:

scores_vlong = scores %>% tidyr::gather("key2", "value", c(age2000:scores2010))
scores_vlong
##    id       key2 value
## 1 A age2000 11
## 2 B age2000 12
## 3 C age2000 13
## 4 D age2000 14
## 5 E age2000 15
## 6 A age2010 21
## 7 B age2010 22
## 8 C age2010 23
## 9 D age2010 24
## 10 E age2010 25
## 11 A scores2000 96
## 12 B scores2000 97
## 13 C scores2000 98
## 14 D scores2000 99
## 15 E scores2000 100
## 16 A scores2010 100
## 17 B scores2010 99
## 18 C scores2010 98
## 19 D scores2010 97
## 20 E scores2010 96

Then extract2 and spread:

scores_vlong %>% 
tidyr::extract("key2", c("colname", "year"),
regex = "([a-z]+)(\\d+)") %>%
tidyr::spread("colname", "value")
##    id year age scores
## 1 A 2000 11 96
## 2 A 2010 21 100
## 3 B 2000 12 97
## 4 B 2010 22 99
## 5 C 2000 13 98
## 6 C 2010 23 98
## 7 D 2000 14 99
## 8 D 2010 24 97
## 9 E 2000 15 100
## 10 E 2010 25 96

I’ve been thinking about writing a function to automate this process for a while and finally got some impetus to do so when Paul Goldsmith-Pinkham asked about the issue on Twitter. Plus it was a good way to practice working with quasiquotation.

gather_multivalue and spread_multivalue are basically wrappers around this sequence of steps.3 So now you only need one line:

# equivalent

gather_multivalue(scores, "year", age2000:scores2010)
gather_multivalue(scores, "year", -id)

gather_multivalue also asks you to specify a regular expression (regex) for how to extract the key and values. The default regex assumes that the columns are of the form (word)(number). I like that regex gives you some flexibility if you get columns with slightly weird or varying patterns:

id age.2000 age.2010 scores_2000 scores_2010
A 11 21 96 100
B 12 22 97 99
C 13 23 98 98
D 14 24 99 97
E 15 25 100 96

The columns have different separators, . and _.

twydyverse::gather_multivalue(scores_dumb, "year", 
age.2000:scores_2010,
regex = "([a-z]+)[\\.|_](\\d+)")
##    id year age scores
## 1 A 2000 11 96
## 2 A 2010 21 100
## 3 B 2000 12 97
## 4 B 2010 22 99
## 5 C 2000 13 98
## 6 C 2010 23 98
## 7 D 2000 14 99
## 8 D 2010 24 97
## 9 E 2000 15 100
## 10 E 2010 25 96

Of course the tradeoff is that you need to specify a regex, but for the purposes of working with column names I don’t imagine that’s likely to get too complicated.

These functions are available in my personal package:

# install.packages("devtools")
devtools::install_github("weiyangtham/twydyverse")
  1. I didn’t check this code in Stata so could be wrong! 

  2. The first part of the regex, "([a-z]+)", extracts the word, then (\\d+) extracts the digits. 

  3. Which in turn are from Hadley Wickham’s Stackoverflow answer 

Tags: ,

Updated:

Leave a Comment