Quantitative literature review with R: Exploring Psychonomic Society Journals, Part I

Literature reviews, both casual and formal (or qualitative / quantitative), are an important part of research. In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a research literature, using Psychonomic Society’s publications between 2005 and 2016.

Commonly, literature reviews are rather informal; you read a review paper or 10, maybe suggested to you by experts in the field, and then delve deeper into the topics presented in those papers. A more formal version starts with a database search, where you try out various search terms, and collect a more exhaustive list of publications related to your research questions. Here, we are going to have a bit of fun and explore a large heterogeneous literature (kind of) quantitatively, focusing on data manipulation (not the bad kind!) and visualization.

The dataset

I’ll be using data from the psychLit R package:

library(psychLit)
d <- psychLit

Let’s first load all the relevant R packages that we’ll use to do our work for us:

library(tidyverse)
library(stringr)

Here, we limit the investigation to Psychonomic Society journals:

(journals <- unique(d$`Source title`))
##  [1] "Acta Psychologica"                                                   
##  [2] "Applied Cognitive Psychology"                                        
##  [3] "Attention, Perception, and Psychophysics"                            
##  [4] "Behavior Research Methods"                                           
##  [5] "Behavioral and Brain Sciences"                                       
##  [6] "Cognition"                                                           
##  [7] "Cognitive, Affective and Behavioral Neuroscience"                    
##  [8] "Cognitive Psychology"                                                
##  [9] "Cognitive Science"                                                   
## [10] "Consciousness and Cognition"                                         
## [11] "Current Directions in Psychological Science"                         
## [12] "Journal of Experimental Psychology: General"                         
## [13] "Journal of Experimental Psychology: Human Perception and Performance"
## [14] "Journal of Experimental Psychology: Learning Memory and Cognition"   
## [15] "Journal of Mathematical Psychology"                                  
## [16] "Journal of Memory and Language"                                      
## [17] "Learning and Behavior"                                               
## [18] "Memory & Cognition"                                                  
## [19] "Perception & Psychophysics"                                          
## [20] "Psychological Methods"                                               
## [21] "Psychological Review"                                                
## [22] "Psychological Science"                                               
## [23] "Psychonomic Bulletin and Review"                                     
## [24] "Quarterly Journal of Experimental Psychology"                        
## [25] "Trends in Cognitive Sciences"
d <- filter(d, `Source title` %in% journals[c(3, 4, 7, 17, 18, 23)])

Univariate figures

This data is very rich, but for this initial exploration, we’ll be focusing on simple uni- and bi-variate figures. In later parts of this tutorial (future blog posts), we’ll start visualizing more complex data, such as keywords, authors, and networks.

Let’s begin with some simple summaries of what’s been going on, over time, for each journal. We’ll make extensive use of the dplyr data manipulation verbs in the following plots. Take a look at the linked website if they are unfamiliar to you; although I will explain more complicated cases, I won’t bother with every detail.

Publication years

First, we’re interested in a simple question: How many articles has each journal published in each year? Are there temporal patterns, and do they differ between journals? The steps for creating this plot are commented in the code below. Roughly, in order of appearance, we first add grouping information to the data frame, then summarise the data on those groups, create a new column for a publication-level summary, and then order the publications on it. We then pass the results to ggplot2 and draw a line and point graph.

d %>% 
    group_by(`Source title`, Year) %>% 
    summarise(n = n()) %>%  # For each group (Publication, Year), count nobs
    mutate(nsum = sum(n)) %>%  # For each Publication, sum of nobs {1}
    ungroup() %>%  # Ungroup data frame
    mutate(`Source title` = reorder(`Source title`, nsum)) %>%  # {2}
    ggplot(aes(x=Year, y=n)) +
    geom_point() +
    geom_line() +
    labs(y = "Number of articles published") +
    facet_wrap("`Source title`", ncol=2)

The code in {1} works, because the summarise() command in the above line ungroups the last grouping factor assigned by group_by(). And since I called mutate() instead of summarise, the sum of number of observations was added to each row, instead of collapsing the data by group. {2} made the Source title variable into a factor that’s ordered by the sum of articles across years for that journal; this ordering is useful because when I used facet_wrap() below, the panels are nicely ordered (journals with fewer total papers in the upper left, increasing toward bottom right.)

If this code doesn’t make any sense, I strongly recommend loading the data into your own R workspace, and executing the code line by line.

I’m a little surprised at the relative paucity of articles in Learning and Behavior, and it looks like there might be some upward trajectory going on in PBR. Should I run some regressions? Probably not.

Citations

Let’s look at numbers of citations next. For these, simple histograms will do, and we expect to see some heavy tails. I will pre-emptively clip the x-axis at 100 citations, because there’s probably a few rogue publications with a truly impressive citation count, which might make the other, more common, values less visible.

d %>% 
    ggplot(aes(`Cited by`)) +
    geom_histogram(binwidth=1, col="black") +
    coord_cartesian(xlim=c(0,100)) +
    facet_wrap("`Source title`", ncol=2)

Nice. But how many papers are there with more than 100 citations? What’s the proportion of those highly cited papers?

d %>%
    group_by(`Source title`) %>%
    mutate(total = n()) %>% 
    filter(`Cited by` > 100) %>% 
    summarise(total = unique(total),
              over_100 = n(),
              ratio = signif(over_100/total, 1)) %>% 
    arrange(total)
## # A tibble: 6 x 4
##   `Source title`                                   total over_100 ratio
##   <chr>                                            <int>    <int> <dbl>
## 1 Learning and Behavior                              595        5 0.008
## 2 Cognitive, Affective and Behavioral Neuroscience   884       65 0.07 
## 3 Behavior Research Methods                          898       11 0.01 
## 4 Attention, Perception, and Psychophysics          1710        8 0.005
## 5 Memory & Cognition                                2481      194 0.08 
## 6 Psychonomic Bulletin and Review                   3085      187 0.06

Excellent. We have now reviewed some of the powerful dplyr verbs and are ready for slightly more complex summaries of the data.

Bivariate figures

Publication year and citations

We can visualize publication counts over years of publication, but this might not be the most informative graph (because recent articles naturally can’t be as often cited as older ones.) Here, I’ll show how to plot neat box plots (without outliers) with ggplot2:

d %>% 
    filter(`Cited by` < 100) %>% 
    ggplot(aes(Year, `Cited by`)) +
    geom_boxplot(aes(group = Year), outlier.color = NA) +
    facet_wrap("`Source title`", ncol=2, scales = "free")

Pages and citations

The last figure for today is a scatterplot of pages (how many pages the article spans) versus number of citations. For scatterplots I usually prefer empty points (shape=1).

d %>% 
    filter(`Page count` < 100, `Cited by` < 100) %>% 
    ggplot(aes(`Page count`, `Cited by`)) +
    geom_point(shape=1) +
    facet_wrap("`Source title`", ncol=2)

Summary

We have now scratched the surface of these data, and there is clearly a lot to explore:

glimpse(d)
## Observations: 9,653
## Variables: 24
## $ Authors                     <chr> "Todorović D.", "Cardoso-Leite P.,...
## $ Title                       <chr> "The effect of the observer vantag...
## $ Year                        <int> 2009, 2009, 2009, 2009, 2009, 2009...
## $ `Source title`              <chr> "Attention, Perception, and Psycho...
## $ Volume                      <int> 71, 71, 71, 71, 71, 71, 71, 71, 71...
## $ Issue                       <chr> "1", "1", "1", "1", "1", "1", "1",...
## $ `Art. No.`                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Page count`                <int> 10, 12, 2, 11, 14, 16, 11, 8, 5, 5...
## $ `Cited by`                  <int> 16, 15, 8, NA, 19, 22, 1, 17, 13, ...
## $ DOI                         <chr> "10.3758/APP.71.1.183", "10.3758/A...
## $ Affiliations                <chr> "University of Belgrade, Belgrade,...
## $ `Authors with affiliations` <chr> "Todorović, D., University of Belg...
## $ Abstract                    <chr> "Some features of linear perspecti...
## $ `Author Keywords`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Index Keywords`            <chr> "art; article; attention; depth pe...
## $ `Funding Details`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Funding Text`              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ References                  <chr> "Ames A., Jr., The illusion of dep...
## $ `Correspondence Address`    <chr> "Todorović, D.; Laboratory of Expe...
## $ Publisher                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ ISSN                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `PubMed ID`                 <int> 19304608, 19304599, 19304592, NA, ...
## $ `Abbreviated Source Title`  <chr> "Atten. Percept. Psychophys.", "At...
## $ `Document Type`             <chr> "Article", "Article", "Article", "...

In next posts on this data, we’ll look at networks and other more advanced topics. Stay tuned.

Related