How to Search for Analytics Articles

This article describes the technical details of how to search for scholarly articles in the field of analytics. The goal is not to optimize the search for a particular article, but rather to find all articles that either use or write about a particular software package. Counts of these articles are then used to estimate the market share of each package. The results are displayed and discussed in The Popularity of Data Analysis Software.


Here are the steps I use to search at in brief:

1. For software with non-ambiguous names, simply search on their names in quotes. Often the quotes are not needed, but it can be difficult to determine when they are. For example, Tibco’s Spotfire program has a very unique name, but Google Scholar considers articles about firefighting that include the separate terms “spot fire” to be equivalent to “spotfire” unless you enclose the name in quotes.

2. Scholarly papers are supposed to cite the software they use in a standard way. For example, the use of SAS should reference the vendor, “SAS Institute”. To verify how well such a citation works, it’s good to search for its opposite. For example, the search string: “SPSS” -“SPSS, Inc.” will exclude the vendor (the minus sign “-” excludes) but still find hits relevant to the SPSS Statistics product. While a similar search for SAS: “SAS” -“SAS Institute” will mostly consist of irrelevant hits including many authors whose first or last name happen to be “Sas”. The initials “SAS” are also the equivalent to “Inc.” in Spanish.

Searching by vendor name is very helpful, but authors don’t always cite their software. Surprisingly, authors occasionally cite the vendor but not the package. That is the case for Statsoft’s Statistica. Statistica means “statistics” in Italian perhaps leading authors to often use statements like, “we used the statistics package from Statsoft.”

3. Some software have add-on packages that are so well known that the main package may not be mentioned at all. For example, R’s ggplot2 package may be cited without reporting that R itself was used.

4. Some software names, especially Fico, Java, Python, and SAS are also common names of people and/or geographic locations. Authors can be excluded from Google Scholar searches with “-author:java”. Unfortunately, this exclusion applies to the author field, not the references at the end of a paper. That means that the counts on these packages are inflated unless the search string specifically excludes that possibility.

5. General purpose programming languages are cited most often for tasks that have nothing to do with analytics. Adding inclusion terms helps focus the search. Examples include categories such as “machine learning” and specific methods, such as “regression analysis.” These search terms were added only in English, so the results are underestimates of the true analytics usage of this type of software.

To make the comparisons among packages most equitable, it would be ideal to include the same set of inclusion terms for all the software studied here. However, that would mean under-counting the use of the special-purpose software, which I prefer to avoid.

6. Regarding logic, Google uses a blank between terms to represent logical “AND” (the plus sign is no longer accepted for this purpose). To perform logical “or”, you must type “OR” in capital letters, or Google will search for the word “or”!  Parentheses prioritize the order of the logic as usual.

Software Names and Their Search Terms

Here is a list of the actual search terms I used for each piece of software.

Actuate: "Actuate BIRT"

Alpine:  "Alpine Data Labs"

Alteryx: "Alteryx"

Angos:   "Angoss"

BMDP:    "BMDP" -marrow 
[removing Bone Marrow Donation Program]

C++ / C#: 
("C++" OR "C#") ("statistical analysis" OR 
"t test" OR "regression analysis" OR 
"quantitative analysis" OR 
"data analytics" OR "machine learning" OR 
"artificial intelligence" OR 
"analysis of variance" OR "anova" OR 
"chi square" OR "data mining")

Enterprise Miner: "Enterprise Miner"

FICO: [Not tracking this as I've found it impossible to
separate the analytics from the credit checking even
using inclusion factors.]

Hadoop: "Hadoop"

Infocentricity: "Infocentricity"
[Note: do not add "OR Xeno", which is useful when 
 searching for jobs. It adds a great deal of ambiguity 
 to this type of search.]

java -author:java -weka -"Practical Machine Learning" 
-indonesia ("statistical analysis" OR "t test" OR 
 "regression analysis" OR "quantitative analysis" OR 
 "data analytics" OR "machine learning" OR 
 "artificial intelligence" OR "analysis of variance" OR 
 "anova" OR "chi square" OR "data mining")

JMP: "JMP" "SAS Institute"

"Julia: A Fast Dynamic Language for Technical Computing"



Lavastorm: "Lavastorm"

"MATLAB" ("statistical analysis" OR "t test" OR 
"regression analysis" OR "quantitative analysis" OR 
"data analytics" OR "machine learning" OR 
"artificial intelligence" OR "analysis of variance" OR 
"anova" OR "chi square" OR "data mining")

Megaputer: "Megaputer" OR "Polyanalyst"

Minitab: "Minitab" 

NCSS: "Number Cruncher Statistical System"
[Cannot use "NCSS" for this as it stands for over 
 15 organizations]

Pentaho: "Pentaho" PolyAnalyst: "PolyAnalyst" 

python -author:python -snake 
("statistical analysis" OR "t test" OR 
 "regression analysis" OR 
 "quantitative analysis" OR "data analytics" OR 
 "machine learning" OR "artificial intelligence" OR 
 "analysis of variance" OR "anova" OR "chi square" OR 
 "data mining") 

"" OR "R development core team" OR "lme4" OR 
"bioconductor" OR "RColorBrewer" OR "the R software" OR 
"the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR 
"knitr" OR "RODBC" OR "stringr" OR "mass package"

RapidMiner: "RapidMiner" 

Revolution Analytics: "Revolution Analytics"
[Note: Merged with Microsoft so keywords are uncertain.]

Salford Systems: "Salford Systems" 


"SAS Institute" -JMP -"Enterprise Miner"
[Note: This under counts SAS slightly but I haven't found
 a way around the problem given that "Sas" is a popular 
 first and last name for authors. Also, in Spanish, 
 "S.A.S." is the equivalent of "Inc." in English 
(Sociedad por acciones simplificadas.)

SAS Enterprise Miner: "Enterprise Miner"

Spotfire "Spotfire" -fire -burn
[I've stopped collecting this data.]

SPSS: SPSS -"SPSS Modeler" -"Amos"
[The letters "SPSS" stand for only a few other rare topics
 that I estimate results in over-counting by only 0.28%.]

SPSS Modeler: "SPSS Modeler" 

("stata" "college station") OR "StataCorp" OR "Stata Corp" OR 
"Stata Journal" OR "Stata Press" OR "stata command" OR 
"stata module" 

Statgraphics: "Statgraphics" 

Statistica: "Statsoft" 

Systat: "Systat" 

Tableau (not currently tracking this one): 
"Tableau Software" OR "Tableau Desktop" OR 
"Tableau Online" OR "Tableau Server" 
[Don't include "Tableau Public", it's a common French term.] 

WEKA ("machine learning" OR "data mining")
[Note: The following search string used in previous years
(before March, 2015) under-counted.]
"WEKA Data Mining" OR 
"Waikato Environment for Knowledge Analysis"]

Inclusion Terms

While many of the packages are clearly focused on advanced analytics, the more general purpose ones — C++, C#, Java and Python — are not. So to determine the best way to focus the searches, I compiled a list of relevant terms commonly used in scholarly papers, then I searched for documents that included them, one at a time. I counted the number of documents for each term and tracked how likely it was to result in an accurate hit. The latter was done using the time honored, “I know it when I see it” approach. (I’m quite familiar with advanced text analytics, but I don’t have time to extract all the data and do it.) The items marked with a “*” below show the terms used. These counts were collected on 5/11/2014, but given that I was searching across all years, the prevalence of the various terms is likely to shift slowly as time passes.

   Search Terms               Number of Articles
Survey  (not well focused)       5,300,000 
Statistical (not well focused)   4,860,000 
Statistics (not well focused)    4,770,000 
Statistical analysis *           3,670,000 
t test *                         3,480,000
regression analysis *            2,920,000 
linear regression                2,650,000 
Quantitative analysis *          2,570,000 
Data analytics *                 2,380,000 
Machine learning *               1,740,000 
Artificial intelligence *        1,720,000 
analysis of variance             1,570,000
chi square                       1,490,000
anova                            1,340,000 
Survey research                  1,230,000 
Data mining *                    1,210,000 
Statistical software *           1,120,000 
logistic regression              1,080,000 
nonparametric                      800,000 
Analytics (not well focused)       519,000 
Statistical package                347,000 
Decision Trees                     169,000 
Business intelligence              146,000 
Statistical modeling *             145,000 
Analyze data (not well focused)    125,000 
Big Data *                          51,700 
Predictive modeling *               39,400 
Predictive analytics *               9,540 
Business analytics *                 7,660 
Advanced Analytics                   3,700

I’m very interested in improving this methodology so if you have ideas, please comment below or send me email at

5 Responses to How to Search for Analytics Articles

  1. Pingback: R Passes SPSS in Scholarly Use, Stata Growing Rapidly |

  2. Pingback: R Passes SPSS in Scholarly Use, Stata Growing Rapidly | Business Intelligence Info

  3. Pingback: Google Scholar Finds Far More SPSS Articles; Analytics Forecast Updated |

  4. Pingback: Fastest Growing Software for Scholarly Analytics: Python, R, KNIME… |

  5. Pingback: Stata’s Academic Growth Nearly as Fast as R’s |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s