Quantitative literature review with R: Exploring Psychonomic Society Journals, Part II

(This post directly continues from part I of Quantitative literature review with R. Please read that first for context.) In this tutorial, I’ll show how to use R to quantitatively explore, analyze, and visualize a broad research literature, using all Psychonomic Society publications between 2004 and 2016. Part I focused on data cleaning and simple figures, but here we will look at relational data by visualizing some network structures in the data.

We’ll be working with R, so if you want to follow along on your own computer, fire up R and load up the required R packages:

library(tidyverse)
library(stringr)

The data is contained in the psychLit R package:

library(psychLit)
d <- psychLit

As before, we limit the investigation to Psychonomic Society journals:

(journals <- unique(d$`Source title`))
##  [1] "Acta Psychologica"                                                   
##  [2] "Applied Cognitive Psychology"                                        
##  [3] "Attention, Perception, and Psychophysics"                            
##  [4] "Behavior Research Methods"                                           
##  [5] "Behavioral and Brain Sciences"                                       
##  [6] "Cognition"                                                           
##  [7] "Cognitive, Affective and Behavioral Neuroscience"                    
##  [8] "Cognitive Psychology"                                                
##  [9] "Cognitive Science"                                                   
## [10] "Consciousness and Cognition"                                         
## [11] "Current Directions in Psychological Science"                         
## [12] "Journal of Experimental Psychology: General"                         
## [13] "Journal of Experimental Psychology: Human Perception and Performance"
## [14] "Journal of Experimental Psychology: Learning Memory and Cognition"   
## [15] "Journal of Mathematical Psychology"                                  
## [16] "Journal of Memory and Language"                                      
## [17] "Learning and Behavior"                                               
## [18] "Memory & Cognition"                                                  
## [19] "Perception & Psychophysics"                                          
## [20] "Psychological Methods"                                               
## [21] "Psychological Review"                                                
## [22] "Psychological Science"                                               
## [23] "Psychonomic Bulletin and Review"                                     
## [24] "Quarterly Journal of Experimental Psychology"                        
## [25] "Trends in Cognitive Sciences"
d <- filter(d, `Source title` %in% journals[c(3, 4, 7, 17, 18, 23)])

We’ve got data on quite a few variables which we could imagine are interconnected, such as Tags (Keywords), Authors, and the articles themselves (through the References variable).

glimpse(d)
## Observations: 9,653
## Variables: 24
## $ Authors                     <chr> "Todorović D.", "Cardoso-Leite P.,...
## $ Title                       <chr> "The effect of the observer vantag...
## $ Year                        <int> 2009, 2009, 2009, 2009, 2009, 2009...
## $ `Source title`              <chr> "Attention, Perception, and Psycho...
## $ Volume                      <int> 71, 71, 71, 71, 71, 71, 71, 71, 71...
## $ Issue                       <chr> "1", "1", "1", "1", "1", "1", "1",...
## $ `Art. No.`                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Page count`                <int> 10, 12, 2, 11, 14, 16, 11, 8, 5, 5...
## $ `Cited by`                  <int> 16, 15, 8, NA, 19, 22, 1, 17, 13, ...
## $ DOI                         <chr> "10.3758/APP.71.1.183", "10.3758/A...
## $ Affiliations                <chr> "University of Belgrade, Belgrade,...
## $ `Authors with affiliations` <chr> "Todorović, D., University of Belg...
## $ Abstract                    <chr> "Some features of linear perspecti...
## $ `Author Keywords`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Index Keywords`            <chr> "art; article; attention; depth pe...
## $ `Funding Details`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Funding Text`              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ References                  <chr> "Ames A., Jr., The illusion of dep...
## $ `Correspondence Address`    <chr> "Todorović, D.; Laboratory of Expe...
## $ Publisher                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ ISSN                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `PubMed ID`                 <int> 19304608, 19304599, 19304592, NA, ...
## $ `Abbreviated Source Title`  <chr> "Atten. Percept. Psychophys.", "At...
## $ `Document Type`             <chr> "Article", "Article", "Article", "...

Let’s begin by looking at the articles’ Keywords.

Tags

These are the things that you have to choose to describe your article when you submit it for publication. They are called Keywords in the data, but I’m going to refer to them as Tags for brevity. Also, I will only work with the Author Keywords, because it seems like Scopus’s automatic index keywords are mostly nonsense or missing.

So our first operation is to rename the Author Keywords variable to Tags, and get rid of the automatic keywords altogether.

d <- rename(d, Tags = `Author Keywords`) %>% 
    select(-`Index Keywords`)

Before moving on to networks, let’s take a look at what these tags are like: How many are there? Do people use them consistently?

Exploring univariate tags

Our first goal is to simply visualize the most common tags. However, recall that the Tags variable in the data is quite messy:

## # A tibble: 3 x 1
##   Tags 
##   <chr>
## 1 <NA> 
## 2 <NA> 
## 3 <NA>

Each value of Tags is a kind of a list, whose individual items are separated by semicolons. What we’d like to have is a variable (column in a data frame) where every item is a single tag, so we need to do some kind of a data unwrapping operation.

The unwrapping can be done by simply splitting the Tags variable on each semicolon (and space) found in each string, but this creates a strange variable:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) 
## # A tibble: 9,653 x 1
##    Tags     
##    <list>   
##  1 <chr [1]>
##  2 <chr [1]>
##  3 <chr [1]>
##  4 <chr [1]>
##  5 <chr [1]>
##  6 <chr [1]>
##  7 <chr [1]>
##  8 <chr [1]>
##  9 <chr [1]>
## 10 <chr [1]>
## # ... with 9,643 more rows

This is where things get really interesting. A few lines above, we conceptualized each value (cell) of Tags as a sort of a list. We have now operationalized that idea by creating a Tags variable that is an R list-column, meaning that each cell can contain a complex R object. Here, each value is a list of character values. To make this notion concrete, look at the first row (slice(1)) of the Tags variable (.$Tags, . is a placeholder of the data being passed by %>% arrows) by using the familiar str() function:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    slice(1) %>% 
    .$Tags %>%  str()
## List of 1
##  $ : chr NA

All the tags associated with the first article in our data frame are in a neat list of neat character strings. Brilliant. This data format is quite foreign to us who are used to working with strictly flat rectangular data, but has a certain recursive beauty to it. Now that our list is concrete, we can unlist it to create a Tags variable whose each row is one individual tag.

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags)
## # A tibble: 26,358 x 1
##    Tags 
##    <chr>
##  1 <NA> 
##  2 <NA> 
##  3 <NA> 
##  4 <NA> 
##  5 <NA> 
##  6 <NA> 
##  7 <NA> 
##  8 <NA> 
##  9 <NA> 
## 10 <NA> 
## # ... with 26,348 more rows

Excellent! This data frame has one variable, Tags, where each observation is an individual tag. Now we can go ahead and count and summarise the tags. From here onwards, it’s all regular dplyr verbs:

d %>% 
    select(Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%  # Remove missing observations
    group_by(Tags) %>%  # Group by unique Tags
    count() %>%  # And count them
    ungroup() %>% 
    top_n(40) %>%  # Choose n most numerous Tags
    mutate(Tags = reorder(Tags, n)) %>%  # {1}
    ggplot(aes(n, Tags)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=Tags)) +
    scale_x_continuous(limits = c(0, 400), expand = c(0, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 30 tags in Psychonomics Journals",
         caption = "Data for years 2004-2016 only")

{1} reordered the Tags so that they are nicely ordered in the figure on the n variable. You could also do a wordcloud with these data, but I think I’ll refrain this time. Instead, what I’ll do is explore popular tags by journal. The “pipeline” to get tags by journal is almost identical, except that we’ll keep the Publication variable in the data.

The following code is a little more complicated than is required for a simple plot, but I wanted to order the Tags inside each publication, which required a little bit of extra work as described by @drsimonj. Below, I’ll work with journal abbreviations.

pd <- d %>% 
    select(`Abbreviated Source Title`, Tags) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags)) %>%
    group_by(`Abbreviated Source Title`, Tags) %>%  # Group by unique Tags AND journals
    count() %>%  # And count them
    group_by(`Abbreviated Source Title`) %>% 
    top_n(10, n) %>%  # Select top 10 tags within each journal
    ungroup() %>% 
    arrange(`Abbreviated Source Title`, n) %>%  # Arrange by n within journals
    mutate(order = row_number())  # Will order Tags on this
pd
## # A tibble: 61 x 4
##    `Abbreviated Source Title`  Tags                      n order
##    <chr>                       <chr>                 <int> <int>
##  1 Atten. Percept. Psychophys. Speech perception        37     1
##  2 Atten. Percept. Psychophys. Working memory           39     2
##  3 Atten. Percept. Psychophys. Visual working memory    46     3
##  4 Atten. Percept. Psychophys. Visual attention         48     4
##  5 Atten. Percept. Psychophys. Attentional capture      56     5
##  6 Atten. Percept. Psychophys. Selective attention      58     6
##  7 Atten. Percept. Psychophys. Visual perception        63     7
##  8 Atten. Percept. Psychophys. Eye movements            68     8
##  9 Atten. Percept. Psychophys. Visual search           136     9
## 10 Atten. Percept. Psychophys. Attention               195    10
## # ... with 51 more rows

That was more complicated than usual, so I wanted to print out the intermediate result as well. We can now use it to plot:

pd %>%     
    ggplot(aes(n, order)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=order)) +
    scale_y_continuous(breaks = pd$order, labels = pd$Tags) +
    scale_x_continuous(expand = c(0.02, 0)) +
    labs(x="Number of articles", y="",
         subtitle = "Top 10 tags within 6 Psychonomics Journals",
         caption = "Data for years 2004-2016 only") +
    facet_wrap("`Abbreviated Source Title`", scales="free", ncol=2)

Finally, let’s look at the overall distribution of the tags as a histogram:

There are about 8000 unique Tags, and about 7000 of them occur fewer than three times in the corpus. Now that we have some idea of what kinds of tags there are in the data, roughly what their distributions might look like, and how they vary across the journals, we are ready to look at networks of tags.

Network analysis of Tags

Network analysis (or graph analysis) is a vast and complicated area, and my graph analysis skills are a little rusty. So here we’ll simply illustrate how to format the data in an appropriate way, and draw a few example graphs. I referred to this website more than once while working on this (also thanks to @jalapic for tips).

The first graph will be about Tags. We’ll try to illustrate how and which tags co-occur within articles in the Psychonomic Society journals between 2004 and 2016.

90% of data analysis is data cleaning and manipulation, and network analysis is no exception. The first data manipulation step is to create an adjacency matrix where each row and column is a tag, where the values in the cells are the number of times those tags co-occurred (were present in the same article.) I found this stack overflow answer and this blogpost helpful when preparing the data.

Preparing the adjacency matrix

In order to create the adjacency matrix we need to solve a sequence of smaller problems. First, we’ll take two columns from the data, one for Tags and another one that uniquely identifies each article (doesn’t matter what it is, here we use a created variable called item). Then we unnest the Tags into single tags per row as detailed above, and remove empty tags.

m <- d %>% 
    rownames_to_column("item") %>% 
    select(Tags, item) %>% 
    mutate(Tags = str_split(Tags, "; ")) %>% 
    unnest(Tags) %>% 
    filter(!is.na(Tags))
m
## # A tibble: 21,446 x 2
##    item  Tags                   
##    <chr> <chr>                  
##  1 417   Exogenous cuing        
##  2 417   Perceived duration     
##  3 417   Spatial attention      
##  4 418   Intersensory perception
##  5 418   Synesthetic congruency 
##  6 418   Temporal ventriloquism 
##  7 418   Visual TOJ             
##  8 419   Coordinate systems     
##  9 419   Eye-centered           
## 10 419   Reference frame        
## # ... with 21,436 more rows

Next, I remove all tags from the data that occurred fewer than a couple of times, because otherwise the data would be too large and messy.

m <- group_by(m, Tags) %>% 
    mutate(n = n()) %>% 
    filter(n > 8) %>% 
    ungroup() %>% 
    select(-n)

Next, we need a variable that identifies whether the tag occurred in the article or not:

m$occur <- 1
slice(m, 1:2)
## # A tibble: 2 x 3
##   item  Tags              occur
##   <chr> <chr>             <dbl>
## 1 417   Spatial attention     1
## 2 419   Saccades              1

Note that occur must be one. To create the matrix we need all the item-tag pairs, even ones that didn’t occur. For these pairs, occur will be zero. The tidyr package in tidyverse provides the perfect helper function for this purpose:

m <- m %>% 
    complete(item, Tags, fill = list(occur=0))
m
## # A tibble: 1,274,021 x 3
##    item  Tags                      occur
##    <chr> <chr>                     <dbl>
##  1 1000  3-D perception                0
##  2 1000  Action                        0
##  3 1000  Action control                0
##  4 1000  Adaptation                    0
##  5 1000  Affect                        0
##  6 1000  Age of acquisition            0
##  7 1000  Aging                         0
##  8 1000  Amygdala                      0
##  9 1000  Animal cognition              0
## 10 1000  Anterior cingulate cortex     0
## # ... with 1,274,011 more rows

Now we have a data frame with 1274021 rows; each row signifying a possible article - Tag co-occurrence. Most occurs will, of course, be zero.

Now we can create the co-occurrence matrix. We first spread Tags over the columns, such that each column is a unique tag, each row is an article, and the cell values are occurrences.

library(reshape2)
y <- m %>% 
    acast(Tags~item, value.var="occur", fun.aggregate = sum)
y[1:2, 1:2]
##                1000 1001
## 3-D perception    0    0
## Action            0    0
y[y>1] <- 1  # Remove duplicates

All we need to do then is to make a symmetric matrix where both rows and columns are Tags, and the cell values are numbers of co-occurrences. So we matrix multiply y with its transpose:

co <- y %*% t(y)
co[1:3, 1:3]
##                3-D perception Action Action control
## 3-D perception             11      0              0
## Action                      0     22              0
## Action control              0      0              9

Visualizing the adjacency matrix

It’s important to understand what co is; it’s a symmetric matrix whose rows and columns are tags, and their co-occurrences are numerically represented at their intersections. For example, Action and 3-D perception never co-occurred in the data, but Adaptation and 3-D percetion co-occurred once. We are now in an excellent position to investigate what tags tend to co-occur.

First, let’s use igraph to create the graph object g, and take out some unnecessary information from it.

library(igraph)
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)

g is now an igraph R object that contains edges and vertices, and information about them. For example, we can use V(g)$name to look at the Vertice names. Some of these are quite long, so I’ll create another variable for each Vertice, called label, which is the name, but replacing each space with a newline character. This will make some of the graphs prettier.

V(g)$label <- gsub(" ", "\n", V(g)$name)

Networks are very complicated visualizations, and if you simply call plot(g), you’ll get an ugly mess of a figure:

plot(g)

There are more parameters to the plot() command than can be listed here, but I tried out a few different options, and this is what I ended up with:

plot(g,
     layout = layout_on_sphere(g),
     edge.width = E(g)$weight/7,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex=.75,
     vertex.label.color="black",
     vertex.label = ifelse(degree(g) > 35, V(g)$label, NA))

The Bayes Network

The next network will be fairly involved. Our goal is to visualize the prominence and connectedness of Psychonomic Society’s “Bayesians”. Who are these Bayesians, you might ask? Our definition here is simply: “An author who has published one or more articles in Psychonomic Journals (2004 - 2016) where the keywords contained the word ‘Bayes’.” This is obviously suboptimal because someone might have published an article called “Why I’m not a Bayesian”, yet they would be included in this graph, but we’ll go with this for now.

I’ll let the code speak for itself this time. If it is impossible to parse, I recommend downloading the data (see top of post) to your own computer and running the code yourself.

m <- d %>% 
    rownames_to_column("item") %>% 
    select(Authors, Tags, item) %>% 
    mutate(Authors = str_split(Authors, ", ")) %>% 
    unnest(Authors) %>% 
    filter(!is.na(Tags), !is.na(Authors)) %>% 
    filter(grepl("bayes", Tags, ignore.case = T))
m$occur <- 1
m <- m %>% 
    complete(item, Authors, fill = list(occur=0)) %>% 
    select(-Tags)
m
## # A tibble: 20,678 x 3
##    item  Authors     occur
##    <chr> <chr>       <dbl>
##  1 1234  Adger D.        0
##  2 1234  Annis J.        0
##  3 1234  Arnold D.       0
##  4 1234  Arnold N.R.     0
##  5 1234  Bååth R.        0
##  6 1234  Bakker M.       0
##  7 1234  Bar M.          0
##  8 1234  Barker J.D.     0
##  9 1234  Barr S.         0
## 10 1234  Bartlema A.     0
## # ... with 20,668 more rows

In these data, we have simply counted whether an author “occurred” in an article, and only included articles whose Tags included the word “bayes” in any shape or form. This network will therefore visualize the connectedness of “Bayesian”, as in who has published papers with whom.

We can then re-run pretty much the same steps as in the previous network.

y <- m %>% 
    acast(Authors~item, value.var="occur", fun.aggregate = sum)
co <- y %*% t(y)
co[1:3, 1:3]
##           Adger D. Annis J. Arnold D.
## Adger D.         1        0         0
## Annis J.         0        1         0
## Arnold D.        0        0         1
g <- graph.adjacency(co, weighted=TRUE, mode ='undirected')
g <- simplify(g)
V(g)$label <- gsub(" ", "\n", V(g)$name)
deg <- degree(g)
plot(g,
     layout = layout_with_kk(g),
     edge.width = E(g)$weight,
     edge.color = adjustcolor("dodgerblue1", .7),
     vertex.color = adjustcolor("white", .8),
     vertex.frame.color=adjustcolor("dodgerblue4", .4),
     vertex.size=1.9,
     vertex.label.family = "Helvetica",
     vertex.label.font=1,
     vertex.label.cex = sqrt(1+deg)/5,
     vertex.label.color = "black",
     main = "The Bayesians of Psychonomic Society")

That’s it!

Related