Stata’s reshape
in R
Update: There is an R function called reshape
from the stats
package that does the same thing, just not within the tidyverse framework
If you’ve used Stata you might be familiar with its reshape
command. reshape
makes a wide dataset long and vice versa. The equivalent in the tidyverse
would be the gather
(wide to long) and spread
(long to wide) functions from the tidyr
package. The difference is that gather
and spread
work on key-value pairs, emphasis on the singular “value”, while reshape
is fine with having multiple values associated with a single key. For example, to reshape the following (fake) wide dataset from this:
id | age2000 | age2010 | scores2000 | scores2010 |
---|---|---|---|---|
A | 11 | 21 | 96 | 100 |
B | 12 | 22 | 97 | 99 |
C | 13 | 23 | 98 | 98 |
D | 14 | 24 | 99 | 97 |
E | 15 | 25 | 100 | 96 |
to this:
id | year | age | scores |
---|---|---|---|
A | 2000 | 11 | 96 |
A | 2010 | 21 | 100 |
B | 2000 | 12 | 97 |
B | 2010 | 22 | 99 |
C | 2000 | 13 | 98 |
C | 2010 | 23 | 98 |
D | 2000 | 14 | 99 |
D | 2010 | 24 | 97 |
E | 2000 | 15 | 100 |
E | 2010 | 25 | 96 |
In Stata you would do this with something like1 reshape long age scores, i(id) j(year)
.
With the tidyr
functions, you need to first gather
:
Then extract
2 and spread
:
I’ve been thinking about writing a function to automate this process for a while and finally got some impetus to do so when Paul Goldsmith-Pinkham asked about the issue on Twitter. Plus it was a good way to practice working with quasiquotation.
For #rstats users -- how do I reshape wide / long using gather and spread with multiple variables?
— Paul G-P (@paulgp) January 13, 2018
gather_multivalue
and spread_multivalue
are basically wrappers around this sequence of steps.3 So now you only need one line:
gather_multivalue
also asks you to specify a regular expression (regex) for how to extract the key and values. The default regex assumes that the columns are of the form (word)(number)
. I like that regex gives you some flexibility if you get columns with slightly weird or varying patterns:
id | age.2000 | age.2010 | scores_2000 | scores_2010 |
---|---|---|---|---|
A | 11 | 21 | 96 | 100 |
B | 12 | 22 | 97 | 99 |
C | 13 | 23 | 98 | 98 |
D | 14 | 24 | 99 | 97 |
E | 15 | 25 | 100 | 96 |
The columns have different separators, .
and _
.
Of course the tradeoff is that you need to specify a regex, but for the purposes of working with column names I don’t imagine that’s likely to get too complicated.
These functions are available in my personal package:
-
I didn’t check this code in Stata so could be wrong! ↩
-
The first part of the regex,
"([a-z]+)"
, extracts the word, then(\\d+)
extracts the digits. ↩ -
Which in turn are from Hadley Wickham’s Stackoverflow answer ↩
Leave a Comment