Only last August I wrote that among scholars, the use of R had probably exceeded that of SPSS to become their most widely used software for analytics. That forecast was based on Google Scholar searches focused on one year at a time, from 1995 through 2014. Each year from 2010 through 2014, I re-collected that entire data set just in case Google changed the search algorithm enough to affect the overall pattern. The data stayed roughly the same for those four years, but Google Scholar now finds almost twice as many articles for SPSS (at its peak year of 2008) than it found last year and 12% more articles for SAS. Changes in search results for articles that used R varied slightly with fewer in the early years and more in the latter ones. So R did *not* become the most widely used analytics software among academics in 2014. It’s unlikely to become so for another two years, unless present trends change.

So what happened? We’re looking back across many years, so while it’s possible that SPSS suddenly became much more popular in 2014, that could not account for lifting the whole trend line. It’s possible Google Scholar improved its algorithm to find articles that existed previously. It’s also possible that new journal archives have opened themselves up to being indexed by Google. Why would it affect SPSS more than SAS or R? SPSS is menu-driven so it’s easy to install with its menus and dialog boxes translated into many languages. Since SAS and R are much more frequently used via their English-based languages, they may not be as popular in non-English speaking countries. Therefore, one might see a disproportionate impact on SPSS by new non-English archives becoming available. If you have an alternate hypothesis, please leave it in the comments below.

The remainder of this post is the complete updated section on this topic from The Popularity of Data Analysis Software:

**Scholarly Articles**

The more popular a software package is, the more likely it will appear in scholarly publications as a topic or as a tool of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.

Figure 2a shows the number of articles found for each software package for the most recent complete year, 2014. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. The software from Java through Statgraphics show a slow decline in usage from highest to lowest. Note that the general purpose software C, C++, C#, MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest.

From RapidMiner on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 825 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. RapidMiner, KNIME, SPSS Modeler and SAS Enterprise Miner are packages that all use the powerful and easy-to-use workflow interface, but their use has not yet caught on among scholars. BMDP is one of the oldest packages in existence. Its use has been declining for many years, but it’s still hanging in there. The software in the bottom half of this figure contain the newcomers, with the notable exception of Megaputer, whose Polyanalyst software has been around for many years now.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2c I’ve plotted the same scholarly-use data for 1995 through 2014, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown. This is likely due to the fact that those two leaders faced increasing competition from many more software packages than can be shown in this type of graph (such as those shown in Figure 2a).

Since SAS and SPSS dominate the vertical space in Figure 2c by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP, with the result shown in Figure 2d. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. If the current trends continue, the use of R may pass that of SPSS and SAS by the end of 2016. Note that current trends have shifted before as discussed here.

Stata has moved into fourth place, crossing above Statistica in 2014. The growth in the use of Stata is more rapid than all the classic statistics packages except for R. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another. [Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”]

I’ll announce future update on Twitter, where you can follow me as @BobMuenchen.

put your graphs (2a in particular) on a log scale so the bottom end doesn’t disappear perhaps?

I have a dumb question related with graph 2c: I understand it plot the absolute number of hits, so, after 2007-2008 peak, why the total number of citations went down? I should expect a increasing trend in the absolute number of citations, but the plot shows a dramatic decrease in the total number of scholar papers… where are this papers or which software is used in them?

Hi White Lemming,

As I mention in the text, “Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown. This is likely due to the fact that those two leaders faced increasing competition from many more software packages than can be shown in this type of graph (such as those shown in Figure 2a).”

Does that clarify it?

Cheers,

Bob

Hi Bob, having worked for various non-government organizations for many years, I know that SPSS is one of the few statistical packages that we could afford to use. And in my work with local researchers in developing countries, it was also popular with them due to the reasonable initial cost and the fact that one could continue to use it for several years before needing to upgrade to a newer version. SAS’s requirement to pay for a license every year is completely out of reach for those of us who work in a non-university environment and likely also for those working in low-resource university environments. As you suggest in your post, R is a bit intimidating for those who prefer to work through a menu-driven interface. I wonder if the resurgence in SPSS citations in publications may reflect an increase in publications by low- and middle income country researchers.

Hi Kendra,

That’s a good point. As you mentioned, poorer countries are less likely to be able to afford SAS and they might also be behind in getting their journals accessible by Google. That combination could certainly result in that pattern we see.

Cheers,

Bob

Hi Bob,

I am from one of the poorer countries. In my country, most of the researchers get their research published in journals published in the richer countries due to the impact factors and reputation.

regards

Leo

Hi Leo,

Thanks for writing. That’s a good point about impact factor. If that’s how most papers get published from the less developed countries, then that would poke a hole in that hypothesis. Unless we hear from someone who works at Google, we may never know what accounts for the sharp increase in SPSS counts.

Cheers,

Bob

I don’t really buy the explanation for the decline in total number of citations. Are their really hundreds of obscure packages getting hundreds or thousands of citations to make up the difference ? I see no evidence for that.

Hi Andrew,

Thanks for providing the incentive to investigate this further. I don’t have all the data I would like, but in 2008 the classic stat packages listed in Figures 2c and 2d totaled 326,870 hits on Google Scholar. For 2014, I have data for the 31 packages listed in Figure 2a, which total 273,162 (I don’t have such comprehensive data for 2008). So we’re looking at an overall decline of 53,708 hits. I have not collected data on analytics tools from Teradata, Oracle, Microsoft, several SAP tools, etc., since they require data storage in their own databases. I’ve also been skipping data on the use of Excel for analytics as it has pretty serious problems with some of its algorithms. However, I just searched for the use of Microsoft Excel for analytics use and found 19,400 hits in 2014. A similar search for Oracle found 10,600 more. So my replacement hypothesis has potential, but we’ll probably never know for sure. People have written me saying that citing software is no longer required by some journals, so that could account for the apparent drop in use for SPSS and SAS. There’s also the recession, which may have decreased the amount of research done.

Cheers,

Bob

It’s interesting to note that the plunge in SPSS citations starting in 2009 correlates with the IBM acquisition.

Hi Thomas,

That’s in intriguing point. At that time, IBM put significant amounts of pressure on us to change our contract that would have resulting in a doubling of the cost. SAS Institute did something similar around 2005. That resulting in us removing SAS from over 1,000 machines on our campus. More recently, both companies have become much friendlier to academia, offering much of their software for free. I wonder what changed their minds. 😉

Bob

Bob,

SAS’ standard free academic version is a virtualized instance with Base, STAT, GRAPH and Guide, with some serious restrictions on input and output. Are they (privately) offering universities something better than that?

People may not be saying that they are doing analysis with SPSS (or with Excel) but just describing which statistic they have used or it is buried deep in the methods. Google getting access to more full-texts (methods/figure legends, supplements) might be why SPSS now ranks higher. Conversely the decline of SPSS might mean that authors are increasingly not stating that they have done the analysis with SPSS (or with Excel) or that Excel is being used more frequently (and also not stated).

Here is an example where SPSS is not mentioned, although almost certainly used:

http://www.ncbi.nlm.nih.gov/pubmed/?term=25078064

“Statistical significance was determined using 1-way ANOVA, with Dunnett’s post-test” Any time a “named” test is used in a biochemistry/molecular biology paper is was probably done with SPSS but they just give the name of the test.

I think that comparing R to SPSS is very much like comparing apples to oranges (or bananas even). You can so much more with R than with SPSS. And they are being used for very different tasks. R is more comparable to SAS or Stata. Rise of R is related to the rise of bioinformatics most likely. Bioinformaticians are more careful about these things. SPSS is probably more competing with Excel (which can also do very simple statistics and SPSS has a bit similar user interface). That being said there is a generational change going on with even biologists switching to R (and especially using R-Studio more frequently). But since using R still mostly requires writing code (or at least scripts) it is going to take a while to replace SPSS (for real).

Hi Pekka,

Thanks for your interesting comments. I think if R had a user interface that was as easy to use as SPSS, its growth would have been much faster. As much as I like the programming approach to data analysis, the majority of people who need to analyze data will probably never be comfortable programming.

Cheers,

Bob

Giving a R user an interface that is completely Excel-like but which retains the flexibility of R is maybe not possible. Although you can increasingly use R-packages and functions from other software (and even from Excel, although it is a bit clunky at the moment). In biology these external tools are used to do standardized pre-configured bioinformatics and biostatistics analyses and provide a GUI while using R in the background. Examples (from a few years ago already) include TM4: MeV, Galaxy, Chipster and Taverna. The last two are workflow tools which use mostly R-based modules from which users compose almost any analysis workflows or they can use analysis workflows shared by others (esp. in Taverna). The Galaxy Project also gives access to more tailor-made R-based tools for bioinformatics and is very good for genome analysis. So it is actually possible to do sophisticated R-based analysis without writing any scripts. Biologists are being taught these tools but they are also being taught R directly; R-Studio has made learning R coding easy enough that the additional effort of learning to do the analyses with R itself after learning bioinformatics (and the requisite statistics) is not that much more. The biggest hurdle is getting biologists to learn any statistics in the first place. But biology/biochemistry are becoming quantitative sciences, like physics or chemistry before them, and if physicists can learn matlab then biologists can probably learn R (one hopes). But this is still going to take some time, maybe another 5-10 years of exponential technology development until it becomes impossible to do any science in these fields without generating so much data that analyzing it requires heavy-duty statistics like R.

Hi Pekka,

I’m very interested in workflow management tools such as those you mentioned. I’ve been trying out KNIME recently which includes its own set of tools plus it can call R, Python, SAS, Java, Weka, etc. Tools like that provide a very interesting blend of capabilities in a form that easy to use, powerful and re-usable with a minimum of programming.

Cheers,

Bob

Yes, I think that the workflow tools are really promising in many ways. As you said, a big advantage is ability to call a wide variety of functions. Also from web services. The Taverna does a lot of this and there is a big European project for providing biosciences web services: The Elixir (https://www.elixir-europe.org) which has currently about 1469 running services. These, along with local resources, can also be used to build workflows. I am a bit familiar with Knime, since one of the authors works in a toxicology project I am also part of (SEURAT-1). Knime has a wider focus than just bioinformatics. OpenTox is a provider of web services for chemoinformatics too. But since I mostly do bioinformatics I have not felt the need to go beyond R, R/Bioconductor has over 900 packages for bioinformatics which share the same data model and are interoperable.

Pingback: R Passes SPSS in Scholarly Use, Stata Growing Rapidly | r4stats.com