Getting Started with R
I occasionally get asked for tips about getting started with R, so I thought I’d compile the advice I’ve given to people into a blog post. It goes without saying that there’s lots of good advice out there; this is just what happened to work for me and it’s really all about experimenting till you find what you’re comfortable with. Also, I’m writing this from the perspective of doing economics research, so you may need to adjust according to your needs, but this is generally about the basics and probably pretty widely applicable.
Resources to get started
-
Swirl is one of my favorite tools for learning R. It’s an interactive tool for learning R within the R console, so you get instant feedback and everything seems a bit more “real”.
-
If I were starting all over again, I’d also pick up Hadley Wickham’s R for Data Science. There’s a free version online and a paperback version. Jenny Bryan’s Stat 545 class website is also a great resource.
-
Twitter. The R community on twitter is really active, it’s a great way to keep track of new tools being developed. Accounts I recommend (just to get started cos there are so many more): Hadley Wickham, Jenny Bryan, David Robinson, Julia Silge, and Mara Averick.
R v.s. Stata
The language war in the data science community is R v.s. Python, but Stata is pretty much the tool of choice for economists. Here’s my hot take on how they compare across different categories:
-
Learning curve: R has traditionally been thought to have a steeper learning curve than Stata. While this was definitely so before, I believe the gap is drastically smaller today because of the tidyverse. The average Stata user can do most of what they need to do with tidyverse tools, which aren’t dramatically harder to learn than Stata (if at all), plus you no longer have to figure out what packages you should use. Dave Robinson makes the case for teaching the tidyverse before base R here.
-
Data wrangling: This one’s not even close. Just the fact that you can essentially only work with one dataframe at a time in Stata already gives R an edge. Writing functions is also much more natural in R. The tidyverse is also an important part of this: 80-90% of data prep I need to do, I can do with tidyverse functions or some customized version of them. The gist of it is that I’ve often found that Stata errs on the side of being too inflexible. 1
-
Data Visualization: R has the advantage here because it has
ggplot2
. ggplot2 is great because there is a logic to it: you map the data to features of the graph. Once you understand the underlying principles (even just partially, as I do), it’s easy to iterate over multiple graphs or add complexity to a graph. David Robinson has written probably the definitive blog post on this if you want to get into it. I don’t know that Stata is necessarily bad for visualization, ggplot2 is just great. -
Reproducibility: I believe R (perhaps more accurately, RStudio) is also ahead of Stata on this one, although Stata users have been trying to incorporate things like dynamic documents into Stata. Also, I recently discovered that Stata has a project manager tool so it might not be as far behind as I once thought.
-
Econometric Tools: I suspect Stata has the edge here, but I don’t know by how much. There are econometrics packages in R which should suffice for most purposes. Paul Goldsmith-Pinkham has a useful post comparing regression results from Stata and R. The conclusion? Identical results, with some minor differences in standard errors for certain results. You should still check and know what the differences are, but for the most part it doesn’t seem like a problem. I still do worry R econometrics packages aren’t kept as up-to-date, either compared to other R packages or econometrics functions in Stata. This may change as R and machine learning methods become more popular in economics, leading more economists to write econometric packages for R.
-
Don’t get me started on Stata’s inability to join/merge datasets on variables with different names. ↩
Leave a Comment