Adding the SPSS MEAN.n Function to R

SPSS contains a very useful set of functions that R lacks. If you’re lucky enough to have access to SPSS, you can use SPSS and R very well together. If not, it’s easy to add these functions to R. The functions perform calculations across values within each observation. Rather than limit you to removing missing values or not, they let you specify how many valid values you want before setting the result to missing. For example in SPSS,
MEAN.5(Q1 TO Q10) asks for the mean only if at least five of the ten variables have valid values. Otherwise the result will be a missing value. This “.n” extension is also available for SPSS’ SUM, SD, VARIANCE, MIN and MAX functions.

Let’s now take a look at how to do this in R. First we’ll create some data with different numbers of missing values for each observations.

> q1 <- c(1, 1, 1)
> q2 <- c(2, 2, NA)
> q3 <- c(3, NA, NA)
> df <- data.frame(q1, q2, q3)
> df
q1 q2  q3
1 1 2   3
2 1 2  NA
3 1 NA NA

R already has a mean function, but it lacks a function to count the number of valid values. A common way to do this in R is to use the is.na() function to generate a vector of TRUE/FALSE values for missing or not, respectively, then sum them. As with many software packages, R views TRUE as having the value 1 and FALSE as having a value of 0, so this approach gets us the number of missing values. The “!” symbol means “not” in R so !is.na() will find the number non-missing values. Here’s a function that does this:

> nvalid <- function(x) sum(!is.na(x))
> nvalid(q2)
[1] 2

So it has found that there are two valid values for q2. This nvalid() function obviously works on vectors, but we need to apply it to the rows of our data frame. We can select the first three variables using df[1:3] and then pass the result into as.matrix() to make the rows easily accessible by R’s apply() function. The apply() function’s second argument is 1 indicating that we would like to compute the mean across rows (the value 2 would indicate columns). The final arguments are the functions to apply and any arguments they need.

> means  <- apply(as.matrix(df[1:3]), 1, mean, na.rm = TRUE)
> counts <- apply(as.matrix(df[1:3]), 1, nvalid)
> means
[1] 2.0 1.5 1.0
> counts
[1] 3 2 1

We have our means and the counts of valid values, so all that remains is to choose our desired value of counts and accept the mean if the data have that value or greater, but return a missing value (NA) if not. This can be done using the ifelse() function, whose first argument is the logical condition, followed by the value desired when TRUE, then the value when false.

> means <- ifelse(counts >= 2, means, NA)
> means
[1] 2.0 1.5 NA

We’ve seen all the parts work, so all that remains is to put them together into a single function that has two arguments, one for the data frame and one for the n required.

mean.n   <- function(df, n) {
  means <- apply(as.matrix(df), 1, mean, na.rm = TRUE)
  nvalid <- apply(as.matrix(df), 1, function(df) sum(!is.na(df)))
  ifelse(nvalid >= n, means, NA)
}

Let’s test our function requiring 1, 2 and 3 valid values.

> df$mean1 <- mean.n(df[1:3], 1)
> df$mean2 <- mean.n(df[1:3], 2)
> df$mean3 <- mean.n(df[1:3], 3)
> df
q1 q2  q3 mean1 mean2 mean3
1 1 2   3   2.0   2.0     2
2 1 2  NA   1.5   1.5    NA
3 1 NA NA   1.0   NA     NA

That looks good. You could apply this same idea to various other R functions such as sd() or var(). You could also apply it to sum() as SPSS does, but I rarely do that. If you were creating a scale score from a set of survey Likert items measuring agreement and a person replied “strongly agree” (a value of 5), to only half the items but skipped the others, would you want the resulting score to be a neutral value as the sum would imply, or “strongly agree” as the mean would indicate? The mean makes much more sense in most situations. Be careful though as there are standardized tests that require use of the sum.

If you’re an SPSS user looking to learn just enough R to use the two together, you might want to read this, or to learn more you could take one of my workshops. If you really want to dive into the details, you might consider reading my book, R for SAS and SPSS Users.

Advertisements

About Bob Muenchen

I help researchers analyze their data, teach workshops on data analysis using R, and write books about research computing.
This entry was posted in Analytics, Data Mangement, R, SPSS, Uncategorized. Bookmark the permalink.

11 Responses to Adding the SPSS MEAN.n Function to R

  1. Tim Salabim says:

    Hey Bob,
    nice example of how to use R to suit your own needs. I thought this would make a great example of to use functional programming in R to make the whole approach even more flexible. Consider this function:

    
    applyN <- function(n, fun, ...) {
      function(x) {
        nvalid <- sum(!is.na(x))
        out = n) out else NA
      }
    }
    

    This way you can now use pretty much any function (not restricted to base functions – functions from any package should work). As a first step you specify how the function should be set up, e.g.:

    
    myFun <- applyN(2, mean)
    

    and then you apply this to any vector (e.g. q3 from your example above):

    
    myFun(q3)
    

    It may seem a little cumbersome in this example, but imagine a setting where you want to loop over numerous summary functions, numerous validn settings and numerous vectors. Then this approach will let you easily set up a new function for each iteration and save you some lines of code.

    Thanks for the inspiration for this!
    Tim

    • Bob Muenchen says:

      Hi Tim,

      Nice idea! It makes sense to remove the apply from it so that it’s more general purpose. Naming it something like “funN” might make it clear that the apply() is now up to the user to do whichever direction they like.

      Cheers,
      Bob

      • Tim Salabim says:

        Sorry but the function I posted seems to have lost some content. Here’s the complete function renamed ‘funN’

        funN <- function(n, fun, …) {
        function(x) {
        nvalid <- sum(!is.na(x))
        out = n) out else NA
        }
        }

        Cheers
        Tim

  2. Daniel says:

    Hi Bob (and Andrea),
    nice function, I always wanted to write a similar small function because I every now and then have the need to calculate row means if at least x values are valid…
    According to your example: you don’t need to coerce to matrix in the apply function, and I would not use “df” inside the function parameter of the apply-function, since it’s misleading (the parameter is the looped row, not the initial df). But this is just “cosmetics”…

    • Bob Muenchen says:

      Daniel,

      I thought as.matrix() was redundant and I even read the help file which states in the Details section, “If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix if it is two-dimensional (e.g., a data frame)…” However, when I tested it, it appeared not to work without it. It retracing my steps I found the bug in the code that was the real culprit, so you’re quite right, it’s redundant. It may save a nanosecond of compute time by saving the test to see if it’s an array or not, but that’s hardly worth wasting code on.

      Cheers,
      Bob

  3. This is a really neat post (and helpful for more complicated settings, or to learn more about how to learn R). For simpler applications (e.g. teaching intro students), the “mosaic” package facilitates calculation of a variety of summary statistics using a modeling language (akin to lattice and lm()). The “favstats()” function facilitates calculating the mean as well as the available sample size:

    > require(mosaic)
    > favstats(~ q1, data=df)
    min Q1 median Q3 max mean sd n missing
    1 1 1 1 1 1 0 3 0
    > favstats(~ q2, data=df)
    min Q1 median Q3 max mean sd n missing
    2 2 2 2 2 2 0 2 1
    > favstats(~ q3, data=df)
    min Q1 median Q3 max mean sd n missing
    3 3 3 3 3 3 NA 1 2

    It also supports calculations such as:

    favstats(y ~ x, data=df)

    • Bob Muenchen says:

      Hi Nicholas,

      I really like that approach. One of the things that surprised me the most about R is that the data argument is not supported by all functions. I see mosaic also adds formulas and the data argument to mean, sd, etc. Nice! Too bad they didn’t make na.rm = TRUE the default. Having it set to FALSE by default makes R deal with missing values the reverse of all other stat packages I know.

      Cheers,
      Bob

  4. Harold says:

    Here is a variation that adds some formatting and a parameter to set the number of digits.

    mean.n <- function(x, n = 2, digits = 2) {
    mns <- apply(x, 2, mean, na.rm = TRUE)
    nv <- apply(x, 2, function(x) sum(!is.na(x)))
    mnz = n, mns, NA)
    cat(“Mean: “,sprintf(paste0(“%9.”,digits,”f”),mnz),
    “\n N: “,sprintf(“%9s”,nv),”\n”)
    }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s