Why R is Hard to Learn

by Bob Muenchen

R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.

If you have experience with other analytics tools, you may at first find R very alien. Training and documentation that leverages your existing knowledge and which points out where your previous knowledge is likely to mislead you can save much frustration. This is the approach I use in my books, R for SAS and SPSS Users and R for Stata Users as well as the workshops that are based on them.

Below is a list of complaints about R that I commonly hear from people taking my R workshops. By listing these, I hope R beginners will be forewarned, will become aware that many of these problems come with benefits, and may consider the solutions offered by the add-on packages that I suggest. As many have said, R makes easy things hard, and hard things easy. However, add-on packages help make the easy things easy as well.

Unhelpful Help

R’s help files are often thorough and usually contain many working examples. However, they’re definitely not written for beginners! My favorite example of this is the help file for one of the first commands that beginners learn: print. The SAS help file for its print procedure says that it “Prints observations in a SAS data set using some or all of the variables.” Clear enough. The R help file for its print function says, “print prints its argument and returns it invisibly (via invisible(x)). It is a generic function which means that new printing methods can be easily added for new classes.” The reader is left to wonder what “invisible” output looks like and what methods and classes are. The help files will tell you more about “methods” but not “classes”. You have to know to look for help on “class” to find that.

Another confusing aspect to R’s help files stems from R’s ability to add new capabilities (called methods) to some functions as you load add-on packages. This means you can’t simply read a help file, understand it, and you’re done learning that function forever. However, it does mean that you have fewer commands to learn. For example, once you learn to use the predict function, when you load a new package, that function may gain new abilities to deal with model objects that are computed specifically within the new package.

So an R beginner has to learn much more than a SAS or SPSS beginner before he or she will find the help files very useful. However, there is a vast array of tutorials, workshops and books available, many of them free, to get beginners over this hump.

Misleading Function or Parameter Names

The most difficult time people have learning R is when functions don’t do the “obvious” thing. For example when sorting data, SAS, SPSS and Stata users all utilize commands appropriately named “sort.” Turning to R, they look for such a command and, sure enough, there’s one named exactly that. However, it does not sort datasets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the order function sorts data sets and it does so in a somewhat convoluted way. However the dplyr package has an arrange function that sorts data sets and is quite easy to use.

Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. When other packages require sorting before they can do three common tasks: (1) summarizing / aggregating data, (2) repeating an analysis for each group (“by” or “split-file” processing) and (3) merging files by key variables. R does not need the user to explicitly sort datasets before performing any of these tasks!

Another command that commonly confuses beginners is the simple “if” function. While it is used to recode variables (among other tasks) in other software, in R if controls the flow of commands, while ifelse performs tasks such as recoding.

Too Many Commands

Other statistics packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, analysis of variance (ANOVA) table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There are other commands such as “coefficients” that display only that part of the model. So there are more commands to learn, but fewer options are needed for each.

Inconsistent Syntax

Since everyone is free to add new capabilities to R, the resulting code for different R packages is often a bit of a mess. For example, the two blocks of code below do roughly the same thing but using radically different syntaxes. This type of inconsistency is common in R, but there’s no way to get around it given everyone can add to it as they like.

  vars = d(mpg,hp),
  data = mtcars,
  func.names =c("Mean", "Median",
  "St. Deviation", "Valid N",
  "25th Percentile", "75th Percentile"))
  data.frame(mtcars$mpg, mtcars$hp),
  statistics = c("mean", "sd", "quantiles"),
  quantiles = c(.25, .50, .75))

Identity Crisis

All analytics software has names for their variables, but R is unique in the fact that it also names its rows. This means you must learn how to manage row names as well as variable names. For example, when reading a comma-separated-values file, variable names often appear as the first row, and all analytics software can read those names. An identifier variable such as ID, Personnel_Num, etc. often appears in the first column of such files. In R, if that variable is not named, R will assume that it’s an ID variable and it will convert its values into row names. However if it is named – as all other software would require – then you add some options to tell R to put its values, or the values of any other variable you choose, into the row names position. Once you do that, the name of the original ID variable vanishes. The benefit to this odd behavior is that when analyses or plots need to identify observations, R automatically knows where to get those names. This saves R from needing the SAS equivalent to the ID statement (e.g. PROC CLUSTER).

While this looks like a worthwhile tradeoff, it is complicated by the fact that row names must be unique. That means you cannot maintain the original row names when you stack two files that have the same variables if you measured the same observations at two times, perhaps before and after some treatment. It also means that combining by-group output can be tricky, though the broom package takes that into account for you. The popular dplyr package replaces row names with character values of the consecutive integers, 1, 2, 3…. My advice is to handle your own ID variables as standard variables and put them into the row names position only when an R function offers some you some benefit in return.

Dangerous Control of Variables

In R, a single analysis can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they were convinced had the exact same observations in precisely the same order. However, a quick check on the files showed that they usually did not match up! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses R seems to be asking for disaster and I recommend merging files when possible by key variables before doing an analysis.

Why does R allow such a dangerous operation? Because it provides useful flexibility. For example, from one dataset, you might plot regression lines of variable x against variable y for each of three groups on the same plot. From another dataset you might get group labels to add. This lets you avoid a legend that makes your readers look back and forth between the legend and the lines. The label dataset would contain only three variables: the group labels and their x-y locations. That’s a dataset of only 3 observations so merging that with the main data set makes little sense.

Inconsistent Ways to Analyze Multiple Variables

One of the first functions beginners typically learn is summary(x). As you might guess, it gets summary statistics for the variable x. That’s simple enough. You might guess that to analyze two variables, you would just enter summary(x, y). However that’s wrong because many functions in R, including this one, accept only single objects. The solution is to put those two variables into a single object such as a data frame: summary(data.frame(x,y)). So the generalization you need to make is not from one variable to multiple variables, but rather from one object (a variable) to another object (a dataset).

If that were the whole story, it would not be that hard to learn. Unfortunately, R functions are quite inconsistent in both what objects they accept and how many. In contrast to the summary example above, R’s max function can accept any number of variables separated by commas. However, its cor function cannot; they must be in a data frame. R’s mean function accepts only a single variable and cannot directly handle multiple variables even if they are in a single data frame. These inconsistencies simply need to be memorized.

Overly Complicated Variable Naming and Renaming Process

People are often surprised when they learn how R names and renames its variables. Since R stores the names in a character vector, renaming even one variable means that you must first locate the name in that vector before you can put the new name in that position. That’s much more complicated than the simple newName=oldName form used by many other languages.

While this approach is more complicated, it offers great benefits. For example, you can easily copy all the names from one dataset into another. You can also use the full range of string manipulations (such as regular expressions) allowing you to use many different approaches to changing names. Those are capabilities that are either impossible or much more difficult to perform in other languages.

If the data are coming from a text file, I recommend simply adding the names to a header line at the top of the file. If you need to rename them later, I recommend the dplyr package’s rename function. You can convert to R’s built-in approach when you need more flexibility.

Poor Ability to Select Variables

Most data analysis packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z), that share a common prefix (e.g. varA, varB,…) or that contain numeric suffixes (e.g. x1-x25, not necessarily next to one another). R generally lacks a built-in ability to make these selections easily. While shortcuts for those aren’t built into R, it does offer far more flexibility in variable selection than other software when you use a bit of additional programming. However, those shortcuts and more are provided by the dplyr package’s “select” function.

Too Many Ways to Select Variables

If variable x is stored in mydata and you want to get the mean of x, most software only offers you one way to do that, such as “var x;” in SAS. In R, you can do it in many different ways:

with(mydata, summary(x))
summary(subset(mydata, select=x))
  summary(select(mydata, x))
  mydata %>% with(summary(x))
  mydata %>% summary(.$x)
  mydata %$% summary(x)

To add to the complexity, if we simply asked for the mean instead of several summary statistics, several of those approaches would generate error messages because they select x in a way that the mean function will not accept.

To make matters even worse, the above examples work with data stored in a data frame (what most software calls a dataset) but not all work when the data are stored in a matrix.

Why are there so many ways to select variables? R has many data structures that give it great flexibility, and each can use slightly different approaches to variable selection. When you fully integrate matrix algebra language capabilities into the main language, rather than have it exist as a separate language such as SAS/IML, this too requires that you see more of what is often hidden, until you need it, in other software. The last few examples come from the dplyr package, which makes variable selection much easier, but of course that also means having to learn more.

Too Many Ways to Transform Variables

Analytics software typically offers one way to transform variables. SAS has its data step. SPSS has the COMPUTE statement and so on. R has several approaches, some of which are shown below. Here are some ways R can create a new variable named “mysum” by adding two variables, x and y:

mydata$mysum <- mydata$x + mydata$y
mydata$mysum <- with(mydata, x + y)
mydata["mysum"] <- mydata["x"] + mydata["y"]
  mydata$mysum <- x + y
mydata <- within(mydata,
  mysum <- x + y
mydata <- transform(mydata, mysum = x + y)
  mydata <- mutate(mydata, mysum = x + y)
  mydata <- mydata %>% mutate(mysum = x + y)

Some are variations the ways to select variables, such as the use of the attach function which, for transformation purposes, also requires the use of the “mydata$” prefix (or an equivalent form) to ensure that the new variable gets stored in the original data frame. Leaving that out leads to a major source of confusion as beginners assume the new variable will go into the attached data frame (it won’t!) The use of the “within” function parallels the use of the similar “with” function for variable selection, but it allows for variable modification while “with” does not.

The cost of this situation is clear. The benefit comes from the integration of multiple types of commands (macro, matrix, etc.) and data structures.

Not All Functions Accept All Variables

In most analytics software, a variable is a variable, and all procedures accept them. In R however, a variable could be a vector, a factor, a member of a data frame or even a component of a complex structure in R called a list. For each function you have to learn what it will accept for processing. For example, most simple statistical functions for the mean, median, etc. will accept variables stored as vectors. They’ll also accept variables in datasets or lists, but only if you select them in such a way that they become vectors on the fly.

This complexity is again the unavoidable byproduct of R’s powerful set of data structures which includes vectors, factors, matrices, data frames, arrays and lists (more on this later). Despite adding complexity, it offers a wide array of advantages. For example, categorical variables that are stored as factors can be included in a regression model and R will automatically generate the dummy or indicator variables needed to make such a variable work well in the model.

Confusing Name References

Names must sometimes be enclosed in quotes, but at other times must not be. For example, to install a package and load it into memory you can use these inconsistent steps:


You can make those two commands consistent by adding quotes around “dplyr” in the second command, but not by removing them from the first. To remove the package from memory you use the following command, which refers to the package in yet another way (quotes optional):


Poor Ability to Select Data

The first task in any data analysis is selecting a data set to work with. Most other software have ways to specify the data set to use that is (1) easy, (2) safe and (3) consistent. R offers several ways to select a data set, but none that meets all three criteria. Referring to variables as mydata$myvar works in many situations, but it’s not easy as you end up typing “mydata” over and over. R has an attach function, but its use is quite tricky, giving beginners the impression that new variables will be stored there (by default, they won’t) and that R will pick variables from that dataset before it looks elsewhere (it won’t). Some functions offer a data argument, but not all do. Even when a function offers that argument, it only works if you specify a model formula too (e.g. paired tests don’t need formulas).

If this is so easy in other software but so confusing in R, what’s the point? Part of it is the price to pay for great flexibility. R is the only software I know of that allows you to include variables from multiple datasets in a single analysis. So you need ways to change datasets in the middle of an analysis. However, part of that may simply be design choices that could have been better in hindsight. For example, it could have been designed with a data argument in all functions, with a system-wide option to look for a default data set, as SAS has.


R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses its “apply” family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same. The benefit that this offers is that it’s sometimes easier to write and understand apply functions.

Functions That Act Like Procedures

Many other packages, including SAS, SPSS and Stata have procedures or commands that do typical data analyses which go “down” through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and many functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply” family of functions to force them to go in either direction. So it’s true that in R, functions act like both procedures and functions from other packages. If you’re coming to R from other software, that’s a wild new idea.

Inconsistent Function Names

All languages have their inconsistencies. For example, it took SPSS developers decades before they finally offered a syntax-checking text editor. I was told by an SPSS insider that they would have done it sooner if the language hadn’t contained so many inconsistencies. SAS has its share of inconsistencies as well, with OUTPUT statements for some procedures and OUT options on others. However, I suspect that R probably has far more inconsistencies than most since it lacks a consistent naming convention. You see names in: alllowercase, period.separated, underscore_separated, lowerCamelCase and UpperCamelCase. Some of the built-in examples include:

names, colnames
row.names, rownames
rowSums, rowsum
rowMeans, (no parallel rowmean exists)
browseURL, contrib.url, fixup.package.URLs
package.contents, packageStatus
getMethod, getS3method
read.csv and write.csv, load and save, readRDS and saveRDS
Sys.time, system.time

When you include add-on packages, you can come across some real “whoppers!” For example, R has a built-in reshape function, the Hmisc package has a reShape function (case matters), and there are both reshape and reshape2 packages that reshape data, but neither of them contain a function named “reshape”!

Odd Treatment of Missing Values

In all data analysis packages that I’m aware of, missing values are treated the same: they’re excluded automatically when (1) selecting observations and (2) performing analyses. When selecting observations, R actually inserts missing values! For example, say you have this data set:

 Gender English Math
 male        85   82
 male        72   87
             75   81
 female      77   78
 female      98   91

If you select the males using mydata[mydata$gender == “male”, ] R will return the top three lines, and substitute its missing value symbol, NA, in place of the values 75 and 81 for the third observation. Why create missing where there were none before? It’s as if the R designers considered missing values to be unusual and thought you needed to be warned of their existence. In my experience, missing values are so common that when I get a data set that appears to have none, I’m quite suspicious that someone has failed to set some special code to be missing. There’s a solution to this in R, which is to ask for which of the observations is the logic true with mydata[which(mydata$gender == “male”), ].

When performing more complex analyses, using what R calls “modeling functions,” missing values are excluded automatically. However, when it comes to simple functions such as mean or median, it does the reverse by returning a missing value result if the variable contains even a single missing value. You get around that my specifying na.rm=TRUE on every function call, but why should you have to? While there are system-wide options you can set for many things such as width of output lines, there’s no option to avoid this annoyance. It’s not hard to create your own function that has missing values removed by default, but that seems like overkill for such a simple annoyance.

Neither of these conditions seems to offer any particular benefit to R. They’re minor inconveniences that R users learn to live with.

Odd Way of Counting Valid or Missing Values

The one function that could really benefit from excluding missing values, the length function, cannot exclude them! While most package include a function named something like n or nvalid, R’s approach to counting valid responses is to (1) check if each value is missing with the is.na function, then (2) use the not symbol “!” to find the non-missing. You have to know that (3) this generates a vector of true/false values that have numeric value of 1/0 respectively. Then you add them up (4) with the sum function. That’s an awful lot of complexity compared to n(x). However, it’s easy to define your own n function or you can use add-on packages that already contain one, such as the prettyR package’s n.valid function.

Too Many Data Structures

As previously mentioned, R has vectors, factors, matrices, arrays, data frames (datasets) and lists. And that’s just for starters! Modeling functions create many variations on these structures and they also create whole new ones. Users are free to create their own data structures, and some of these have become quite popular. Along with all these structures comes a set of conversion functions that switch an object’s structure from one type to another, when possible. Given that so many other analytics packages get by with just one structure, the dataset, why go to all this trouble? If you added the various data structures that exist in other package’s matrix languages, you would see a similar amount of complexity. Additional power requires additional complexity.

Warped Dimensions

Two-dimensional objects easily become one-dimensional objects. For example, this way of selecting males uses the variable “Gender” as part of a two-dimensional data frame, so you need to have a comma follow the logical selection:

with(Talent[Talent$Gender == "Male", ],
    data.frame(English, Reading)

But in this very similar way of getting the same thing, Gender is selected as a one-dimensional vector and so adding the comma (which implies a second dimension) would generate an error message:

      English[Gender == "Male"],
      Reading[Gender == "Male"]

Complex By-Group Analysis

Most packages let you repeat any analysis simply by adding a command like “by group” to it. R’s built-in approach is far more complex. It requires you to create a macro-like function that does the analysis steps you need. Then you apply that function by group. Other languages let you avoid learning that type of programming until you’re doing more complex tasks. However, the deeper integration of such macro-like facilities in R means that the functions you write are much more integrated into the complete system.

Sparse Output

R’s output is often quite sparse. For example, when doing cross-tabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a cross-tabulation is as simple as barplot(table(a,b)). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages. Some of them are shown here.

Unformatted Output

The default output from SAS and SPSS is nicely formatted making it easy to read and paste your word processor as complete tables. All future output can be set to journal style (or many other styles) by simply setting an option. R not only lacks true tabular output by default, it does not even provide tabs between columns. So beginners commonly paste the output into their word processor and select a mono-spaced font to keep the columns aligned.

R does have the ability to get its output looking good, but it does so using additional packages such as compareGroups, knitr, pander, sjPlot, sweave, texreg, xtable, and others. In fact, its ability to display output in complex tables (e.g. showing multiple models) is better than any other package I’ve seen.

Complex & Inconsistent Output Management

Strongly related to by-group processing (above) is output management. In its simplest case, you might do one analysis and use output management to select what output to display. R reverses the usual process by showing you very little up front and letting you ask to display more after creating a model (similar to Stata’s approach.)

However, printed output from an analysis isn’t always the desired end result. Output often requires further processing. For example, at my university we routinely do the same salary regression model for each of over 250 departments. We don’t care about most of that output, we only want to see results for the few departments whose salaries seem to depend on gender or ethnicity. We’re hoping to find none of course! R has a special data structure called a “list” that’s ideal for storing such complex output. However, each type of model function creates a list that’s unique, with pieces of output stored inside the list in many types of structures, and labeled in inconsistent ways. For example, p-values from three different types of analyses might be labeled, “p”, “pvalue”, or “p.value”. Luckily an add-on package named broom has commands that will convert the output you need into data frames, standardizing the names as it does so. From there you can do whatever additional analyses you need.

However, R’s complexity here is not without its advantages. For example, it would be relatively easy for me to do two regression models per group, save the output, and then do an additional set of per-group analyses that compared the two models statistically. That would be quite a challenge for most statistics packages (see the help file of the dplyr package’s “do” function for an example.)

Unclear Way to Clear Memory

Another unusually complex approach R takes is the way it clears its workspace in your computer’s memory. While a simple “clear” command would suffice, the R approach is to ask for a listing of all objects in memory and then delete that list: rm(list = ls() ). What knowledge is required to understand this command? You have to (1) know that “ls()” lists the objects in memory, (2) what a character vector is because (3) “ls()” will return the object names in a character vector, (4) that rm removes objects, (5) that the “list” argument is not really asking for information stored in an R list, but rather in a character vector. That’s a lot of things to know for such a basic command!

This approach does have its advantages though. The command that lists objects in memory has a powerful way to search for various patterns in the names of variables or data sets that you might like to delete.

Luckily, the popular RStudio front-end to R offers a broom icon that clears memory with a single click.

Lack of Graphical User Interface (GUI)

Like most other packages R’s full power is only accessible through programming. However unlike many others, it does not offer a standard GUI to help non-programmers do analyses. The two which are most like SPSS and Stata are R Commander and Deducer. While they offer enough analytic methods to make it through an undergraduate degree in statistics, they lack control when compared to a powerful GUI. Worse, beginners must initially see a programming environment and then figure out how to find, install, and activate either GUI. Given that GUIs are aimed at people with fewer computer skills, this is a problem.

Lack of Beneficial Constraints

To this point, I have listed the main aspects of R that confuse beginners. Many of them are the result of R’s power and flexibility. R tightly integrates commands for (1) data management, (2) data analysis and graphics, (3) macro facilities (4) output management and (5) matrix algebra capabilities. That combination, along with its rich set of data structures and the ease with which anyone can extend its capabilities, make R a potent tool for analytics.

However, the converse of that is provided by its competitors, such as SAS, SPSS and Stata. A notable advantage that they offer is “beneficial constraints.” These packages use one kind of input, provide one kind of output, provide very limited ways to select variables, provide one way to rename variables, provide one way to do by-group analyses, and are limited in a variety of other ways. Their macro and matrix languages, and output management systems, are separated enough as to be almost invisible to beginners and even intermediate users. Those constraints limit what you can do and how you can do it, making the software much easier to learn and use, especially for those who only occasionally need the software. They provide power that’s perfectly adequate to solve any problem for which they have pre-written solutions. And the set of solutions they provide is rich. (Stata also provides an extensive set of user-written solutions.)


There are many aspects of R that make it hard for beginners to learn. Some of these are due to R’s unique nature, some are due to its power, flexibility and extensibility. Others are due to aspects of its design that don’t really offer benefits. However, R’s power, free price, and open source nature attract developers, and the wide array of add-on tools have resulted in software that is growing rapidly in popularity.

R has been my main analytics tool of choice for close to a decade now, and I have found its many benefits to be well worth its peccadilloes. If you currently program in SAS, SPSS or Stata and find its use comfortable, then I recommend giving R a try to see how you like it. I hope this article will help prepare you for some of the pitfalls that you’re likely to run into by explaining how some of them offer long-term advantages, and by offering a selection of add-on packages to help ease your transition. However, if you use one of those languages and view it as challenging, then learning R may not be for you. Don’t fret, you can enjoy the simplicity of other software, while calling R functions from that software as I describe here.

If you’re already an R user and I’ve missed any of your pet peeves, please list them in the comments section below.


Thanks to Patrick Burns, Tal Galili, Joshua M. Price, Drew Schmidt, and Kevin Wright for their suggestions that improved this article.


54 Responses to Why R is Hard to Learn

  1. Pingback: Why R is Hard to Learn | r4stats.com

  2. Pingback: Updated: Why R is Hard to Learn | r4stats.com

  3. freakalytics says:

    A very valuable article Bob, thanks so much!

    I could go into many problems with other traditional analytics languages and even new analytics tools, but those are all created by software companies with centralized control and paid developers. Frankly, it’s amazing how well many of the libraries of R work together while maintaining much of the original flexibility!

    I know that many of the current leaders in the R development world are attempting to address these issues with “modernized”, more broadly comprehensible libraries suitable for traditional R users and less technical types. I am grateful for the incredible effort invested by so many to make R what it is today – simply the best platform for addressing a wide range of real-world analytic issues in a deep, comprehensive manner.

    • Bob Muenchen says:

      Hi Stephen,

      I’m glad you liked it. It is an amazing achievement that something extended by so many people works as well as it does. To counterbalance this I should write a condensed version of Chapter 1 of my books on “Why R is Awesome!”


      • Stephen McDaniel says:

        Ironically, by addressing the frustrations of R in this article, you will help many more see that “R is Awesome!” You will also help improve the adoption rate and decrease the chances of incorrect analyses.

      • Bob Muenchen says:

        I hope so! More importantly, I hope it helps people choose the tool that’s right for them.

      • Stephen McDaniel says:

        It’s ironic, but this article will help many more see that “R is Awesome!” It will also help improved be the rate of successfully adopting R and save many from accidentally creating incorrect analyses.

  4. Beliavsky says:

    Thanks for the article. I think there is a missing closing parenthesis in


  5. Kevin Wright says:

    I agree with your complaint about un-formatted output. There’s many aspects to that, one of which is the over-use of scientific format for numbers. I created a package called “lucid” that provides a function for cleaner printing of floating point numbers.

    A short vignette provides examples:

    Click to access lucid.pdf

  6. beginneR says:

    Nice article! As a side note: For “mydata %$% summary(x)” to work, you’ll (currently) need to load the magrittr package (>= 1.5). With dplyr ( alone, it won’t work.

  7. K says:

    Do you have any recommendations for resources to learn R? In my field, it is valuable to learn R, but I am still stuck on Excel. I know I can pick it up, but need the right guide! That is, in addition to your work! I am looking for any resources possible. Thank you.

  8. brilliant work. your writing is an inspiration. very nicely done

  9. Ken says:

    Hi Bob,

    I really enjoyed the article and found it very informative – I was nodding my head and agreeing when I read it.



  10. Sriram says:

    Excellent article. In fact it provides a very good bird’s eye view of R. I found that in order to be a good programmer of any lang , you need to be critical of some features and see how you would have written those constructs. Coming from other series of programming languages, I found R a weird language to start with. But as and when you start focusing on the analytics portion , you will find that its advantages far outweigh the disadvantages. With support packages like shiny , the brilliant plyr , ggplot etc , developing data products is becoming easy. With more packages to integrate with other back end languages( C, C++ ..etc) including cloud systems and the front end web based languages or other scripts, R can be a very good sandwich for analytics. I personally feel (not sure how possible it is) it should start its focus on how to utilize the set of features available on other languages / scripts , provide a pipe rather than redoing the entirety on its own in trying to become an all in one analytics tool.
    Thank you for your sharing your insights.

    • Bob Muenchen says:

      Hi Sriram,

      Thanks for your comments. I also found R to be weird to start with. I would alternate between thinking it was horrible one day and then discovering a feature that was just brilliant the next. For people who are happy programming, it’s well worth the time it takes to get used to it. For people who prefer to point-and-click (many of you SPSS users know who I’m talking about) then R’s probably not a good choice.


  11. WillR says:


    Good article — I have printed it to review again later.

    The Coursera specialist stream seems like a good idea for those needing helo with a beginning effort at programming.

  12. Oliver Keyes says:

    Excellent article!

    Random trivia, around the “Inconsistent Function Names” discussion – I think this comes from two problems. The first is we still have no style guide, a la pep8, and so naming is always going to be markedly inconsistent between libraries and even within core. The second is that the sheer amount of evolution R’s syntax has gone through creates a vast array of naming conventions within base R, each one adapted to avoid (now-historical) problems.

    For example: mean(x, na.rm = FALSE). Why na.rm? Why use the full stop, rather than an underscore? Well, in S-PLUS (and early R), the underscore was used for assignation – and even though that’s no longer the case, a lot of R’s functionality dates from that period (or earlier, because it was ported from S). So that’s no longer a problem programmers have to solve, but it’s a problem a lot of code is written around – and people learn the styles of a language by looking at how the people before them implemented things, whether or not it makes things readable. And then some people use underscores to try to increase readability, and.. http://xkcd.com/927/

  13. Pingback: Обзор наиболее интересных материалов по анализу данных и машинному обучению №26 (8 — 14 декабря 2014) | Zit@i0

  14. Pingback: SAS Versus R Part Two | The Big Analytics Blog

  15. Pingback: Learning R: Live Webinar, Interactive Self-Paced, or Site Visit? | r4stats.com

  16. ReesM says:

    Bob, thank you for all your efforts at instruction in R.

    What would it take to write a package, perhaps called erroR, so when some call goes south, you could enter guideme(), the core function of erroR, and R would take the cryptic error message and explain something in human terms about what it is telling the user and what might be some solutions.

    This would interest me quite a bit to help on.

    • Bob Muenchen says:

      Hi Rees,

      I think it would be a herculean task. Often the messages make no sense unless you dig inside the function returning the message to see that it’s talking about some other function call within that function. You’d be quite the hero if you could pull it off though!


  17. Pingback: learn R wisdom with fortunes | YGC

  18. Pingback: Can Microsoft make R easy? | Cloud Hooligans

  19. Pingback: Indie Game Developer! | Can Microsoft make R easy?

  20. Stefanie says:

    I agree with Stephen: if you read the comments in detail, it shows that in some (most..) aspects R is in fact awesome. However, the way the article is written rather suggests that it is awful, which I regret. If users of other tools are confused because their approaches are simply not needed any more (like the “sort” example), it’s not the fault of R. And I am so thankful that I can address my variable by name or column as it suits the setting best, and that I do not have to use an “ifelse” for recoding, and that it warns me about missing values instead of just letting them drop out more or less silently, just as a few examples.

    • Bob Muenchen says:

      Hi Stefanie,

      I agree that R is awesome. I use it for all my own analysis. I hope that by pointing out the aspects of it that cause beginners frustration that they’ll be able to get past those areas more easily. On the other hand, if they view a simpler language as being difficult, then they’re probably not going to be happy switching to R.


  21. Pingback: Helping Your Organization Migrate to R | r4stats.com

  22. Pingback: Tumia-R | Tumia-R

  23. Closures have been badly implemented. Here is a sample taken from https://class.coursera.org/rprog-014/quiz:
    Consider the following function

    f <- function(x) {
    g <- function(y) {
    y + z
    z <- 4
    x + g(x)
    z <- 10

    I would have expected z in g to be 10 or even better to indicate undefined variable z. But it is 4; because variables are mutable. Does not look like lexical scoping to me.

  24. yunarsoa says:

    Nice article! I am a beginner in R. I was started learning R to do some statistical work in my research. But I did find some inconsistencies in it which are different from other programming languages. One of them is the function to give names to data using names() function. In most programming languages, association is targeted to a variable, but in R, to give names to data you use “names(var) <- c(some names)" which is very unusual and irrasional in common programming language.

    Anyway, nice article!

    • Bob Muenchen says:

      Hi Yunarsoa,

      That surprised me too when I started with R. However, when I need to copy a set of names from one data set to another, it’s fantastic!
      names(NewDataSet) <- names(NewDataSet)


  25. Walter says:

    Great overview, I also nodded often.

    But, how to deal with it?

    I teach R for about 6 years now in a class compulsory for psychology students. This year I changed the paradigm.

    a) Reproducible Research, e.g. Markdown documents within RStudio from lesson 1

    b) dplyr and tidyr from lesson 1

    c) Don’t be afraid to use functions (own functions)

    Right now (lesson 9) we did something like:

    presentAndPassed = 1, na.rm = TRUE)
    Class %>%
    select( starts_with(“Points”, starts_with( “week”) ) %>%
    transmute ( Mn = apply( ., mean, na.rm = TRUE ),
    present = apply( ., presentAndPassed ) %>%
    data.frame( Class, . ) -> Class

    We do a lot of drawing, since this approach follows set theory. I was surprised that you could show code like above and people could read it off and explain it like a story . And–
    * you almost never have more than 2 closing parentheses
    * you almost never have temporary objects
    * you can develop/use the name of your main dataframe continuously since in a document to be knitted it will always develop top to bottom

    However, sometimes we will have to leave this approach, the psych package, that we heavily use is not really compatible, thus we prepare the data outside and then jump in…

    One remark at the end. At the beginning I taught R to former SPSS users, thus I had to mimic SPSS all the time to make it seem comparable. Now I use some operations right from the beginning, that are very cumbersome to do with SPSS like create random variables.

    The development R takes is impressive.

    All the best to all of you,


  26. Pingback: Why R is Hard to Learn | r4stats.com | Unstable Textuality Research

  27. Random dude says:

    Hi Bob,

    Thanks for sharing your insights and for the wonderfully-written piece. You have a gift for simplifying an otherwise dry topic.

    I’ve been using SAS for a few years now for statistical analyses in public health research (regressions, survival analyses, etc.) and want to explore R as an alternative.

    Would you be kind enough to suggest a “step-by-step” plan for someone looking to learn R from scratch? What resources would you recommend? I learn best with ample examples and have found Ron Cody’s books really helpful when I was learning SAS. I don’t have time to enroll in any courses that require physically attending classes, unfortunately.

    Thanks again!

    • Bob Muenchen says:

      Hi Random Dude,

      Love your name! My book, R for SAS and SPSS Users, is aimed at people who know either SAS or SPSS (they’re similar in many ways) and who are starting to learn R. It starts at ground zero and presents things in easy-to-absorb steps. The code for almost everything is shown in both R and SAS so you can compare them. Much of the trouble I had learning R was due to the fact that I kept expecting it to work like SAS, so I start every topic by saying how SAS (and SPSS) does it, then move on to how R does it and why. I think you’d enjoy the perspective.

      Much of the material is also covered in this course at DataCamp.com: https://www.datacamp.com/courses/r-for-sas-spss-and-stata-users-r-tutorial. I also teach that topic on visits to organizations that are migrating to R from SAS, SPSS, or Stata.



  28. Fred says:

    Thank you for this – it validates many of my opinions. I enjoy using R and find it to be very useful and unique. However, as someone with an advanced degree in human computer interaction, using R is often maddening. There are endless violations of consistency and other basic principles of user interaction. Python is an interesting contrast in being a language that strives to be as simple and consistent as possible while still being extremely powerful.

    • Bob Muenchen says:

      Hi Fred,

      I think Python’s an excellent language. However, it’s way behind R in terms of contributed packages. The Julia language has much of Python’s simplicity and consistency, and is much faster. It will be interesting to see how they compete in the future.


  29. Pingback: Water Research Info » My first steps with R & R low-flow statistics cheat sheet

  30. Pingback: Using Discussion Forum Activity to Estimate Analytics Software Market Share | r4stats.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s