Stata’s `reshape` in R

3 minute read

Update: There is an R function called reshape from the stats package that does the same thing, just not within the tidyverse framework

If you’ve used Stata you might be familiar with its reshape command. reshape makes a wide dataset long and vice versa. The equivalent in the tidyverse would be the gather (wide to long) and spread (long to wide) functions from the tidyr package. The difference is that gather and spread work on key-value pairs, emphasis on the singular “value”, while reshape is fine with having multiple values associated with a single key. For example, to reshape the following (fake) wide dataset from this:

id	age2000	age2010	scores2000	scores2010
A	11	21	96	100
B	12	22	97	99
C	13	23	98	98
D	14	24	99	97
E	15	25	100	96

to this:

id	year	age	scores
A	2000	11	96
A	2010	21	100
B	2000	12	97
B	2010	22	99
C	2000	13	98
C	2010	23	98
D	2000	14	99
D	2010	24	97
E	2000	15	100
E	2010	25	96

In Stata you would do this with something like¹ reshape long age scores, i(id) j(year).

With the tidyr functions, you need to first gather:

scores_vlong = scores %>% tidyr::gather("key2", "value", c(age2000:scores2010))
scores_vlong

##    id       key2 value
## 1   A    age2000    11
## 2   B    age2000    12
## 3   C    age2000    13
## 4   D    age2000    14
## 5   E    age2000    15
## 6   A    age2010    21
## 7   B    age2010    22
## 8   C    age2010    23
## 9   D    age2010    24
## 10  E    age2010    25
## 11  A scores2000    96
## 12  B scores2000    97
## 13  C scores2000    98
## 14  D scores2000    99
## 15  E scores2000   100
## 16  A scores2010   100
## 17  B scores2010    99
## 18  C scores2010    98
## 19  D scores2010    97
## 20  E scores2010    96

Then extract² and spread:

scores_vlong %>% 
  tidyr::extract("key2", c("colname", "year"), 
                 regex = "([a-z]+)(\\d+)") %>%
  tidyr::spread("colname", "value")

##    id year age scores
## 1   A 2000  11     96
## 2   A 2010  21    100
## 3   B 2000  12     97
## 4   B 2010  22     99
## 5   C 2000  13     98
## 6   C 2010  23     98
## 7   D 2000  14     99
## 8   D 2010  24     97
## 9   E 2000  15    100
## 10  E 2010  25     96

I’ve been thinking about writing a function to automate this process for a while and finally got some impetus to do so when Paul Goldsmith-Pinkham asked about the issue on Twitter. Plus it was a good way to practice working with quasiquotation.

For #rstats users -- how do I reshape wide / long using gather and spread with multiple variables?
— Paul G-P (@paulgp) January 13, 2018

gather_multivalue and spread_multivalue are basically wrappers around this sequence of steps.³ So now you only need one line:

# equivalent

gather_multivalue(scores, "year", age2000:scores2010)
gather_multivalue(scores, "year", -id)

gather_multivalue also asks you to specify a regular expression (regex) for how to extract the key and values. The default regex assumes that the columns are of the form (word)(number). I like that regex gives you some flexibility if you get columns with slightly weird or varying patterns:

id	age.2000	age.2010	scores_2000	scores_2010
A	11	21	96	100
B	12	22	97	99
C	13	23	98	98
D	14	24	99	97
E	15	25	100	96

The columns have different separators, . and _.

twydyverse::gather_multivalue(scores_dumb, "year", 
                              age.2000:scores_2010, 
                              regex = "([a-z]+)[\\.|_](\\d+)")

##    id year age scores
## 1   A 2000  11     96
## 2   A 2010  21    100
## 3   B 2000  12     97
## 4   B 2010  22     99
## 5   C 2000  13     98
## 6   C 2010  23     98
## 7   D 2000  14     99
## 8   D 2010  24     97
## 9   E 2000  15    100
## 10  E 2010  25     96

Of course the tradeoff is that you need to specify a regex, but for the purposes of working with column names I don’t imagine that’s likely to get too complicated.

These functions are available in my personal package:

# install.packages("devtools")
devtools::install_github("weiyangtham/twydyverse")

I didn’t check this code in Stata so could be wrong! ↩
The first part of the regex, "([a-z]+)", extracts the word, then (\\d+) extracts the digits. ↩
Which in turn are from Hadley Wickham’s Stackoverflow answer ↩

Share on

Twitter Facebook Google+ LinkedIn

Wei Yang Tham

Stata’s `reshape` in R

Share on

Leave a Comment

You May Also Enjoy

Step-by-step of my workflow

geom_errorline for intervals

Machine Learning: An Applied Econometric Approach

Making a ggplot theme