2018 Workshop at the Academy of Management Annual Meeting

First time setup in R

In order to follow the code developed for the workshop (shown in full below under Workshop example), two pieces of software need to be installed beforehand. The first is R, a free software environment for statistical computing and graphics. The second is RStudio, which is a graphical user interface that goes ‘over’ R, making it more user friendly. It is adamant that R is installed first, and RStudio second.

Before running the code shown below, install R on your system by going to the following page: https://cran.r-project.org/ Here, OS-specific versions of R can be found. For example, by clicking here, you can download the executable for Windows. For Mac OS X, the install file can be found here. Installation using the default settings should do the trick.

Then, after the installation of R is complete, navigate to the following page: https://www.rstudio.com/ You can download the free version of RStudio on this page. Again, the default settings should do the trick.

Then, after these steps are completed, it is advisable to run the following two lines of code in RStudio before coming to the workshop.

# The "tm" package enables the text mining infrastructure that we will use for LDA.
if (!require("tm")) install.packages("tm")

# The "topicmodels" package enables LDA analysis.
if (!require("topicmodels")) install.packages("topicmodels")

# The "LDAVis" package enables visualization from the LDA analysis.
if (!require("LDAVis")) install.packages("LDAVis")

# The "servr" is used by LDAVis.
if (!require("servr")) install.packages("servr")

# The "igraph" package enables creation of networks.
if (!require("igraph")) install.packages("igraph")

These will install the core packages that we will use in the workshop, and their installation may take some time on the standard conference internet connection. After this is done, you’re all set to participate in the workshop! It is also possible to run the code shown below at home beforehand, but note that the actual topic model takes a LONG time to finish on most PCs.

You can install packages by entering them to your script (you can start a new script on Windows via “Shift+Ctrl+N” or by navigating to “File” –> “New File” –> “R Script”. You can run code by selecting the code and pressing “Ctrl + Enter” or the “Run” button at the top of the script window. Packages can also be installed by navigating to “Packages” (which should be on the right half of your screen), or by selecting “Tools” at the top of your screen and selecting “Install packages” from there.

2018 Workshop example

Texts on electric vehicles: 2005-2014. ——————- Code tested and written for R version 3.5.1, tm package version 0.7-4, topicmodels package version 0.2-7, igraph version 1.2.1., and LDAVis version 0.3.2.

Code prepared on July 30, 2018 by Richard Haans (haans@rsm.nl). Data obtained from Hovig Tchalian, used with permission.

Loading packages

# The following command loads the required packages.
library(topicmodels)
library(tm)
library(LDAvis)
library(igraph)

Get the data, turn into a corpus, and clean it up

For more information with regards to cleaning, please see the PDW from last year. It can be found here

#########################################
### Data loading
#########################################
corpus_pr  <-Corpus(DirSource("Texts\\PR"), readerControl = list(reader=readPlain))
corpus_general  <-Corpus(DirSource("Texts\\General"), readerControl = list(reader=readPlain))
# Note that R loads up the texts in alphabetic order based on the filename, even though the folder itself
# was not sorted in this way. It is always good to check how the data are organized after loading to prevent
# errors in data combination further down the road. 

# These files were read locally, but for ease of use, use the command below to load the starting file:
load(url("https://github.com/RFJHaans/topicmodeling/blob/master/Data/2018/Data_pr.RData?raw=true"))
load(url("https://github.com/RFJHaans/topicmodeling/blob/master/Data/2018/Data_general.RData?raw=true"))

#########################################
### Cleaning up the corpus
#########################################
# Below, we clean up the documents in various steps. 
# We write everything to a new corpus called "corpusclean" so that we do not lose the original data.
# 1) Remove numbers
corpusclean_pr <- tm_map(corpus_pr, removeNumbers)
# 2) Remove punctuation
corpusclean_pr <- tm_map(corpusclean_pr, removePunctuation)
# 3) Transform all upper-case letters to lower-case.
corpusclean_pr <- tm_map(corpusclean_pr,  content_transformer(tolower))
# 4) Remove stopwords which do not convey any meaning.
corpusclean_pr <- tm_map(corpusclean_pr, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
# 5) And strip whitespace. 
corpusclean_pr <- tm_map(corpusclean_pr , stripWhitespace)

# 1) Remove numbers
corpusclean_general <- tm_map(corpus_general, removeNumbers)
# 2) Remove punctuation
corpusclean_general <- tm_map(corpusclean_general, removePunctuation)
# 3) Transform all upper-case letters to lower-case.
corpusclean_general <- tm_map(corpusclean_general,  content_transformer(tolower))
# 4) Remove stopwords which do not convey any meaning.
corpusclean_general <- tm_map(corpusclean_general, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
# 5) And strip whitespace. 
corpusclean_general <- tm_map(corpusclean_general , stripWhitespace)

# See the help of getTransformations for more possibilities, such as stemming. 

#########################################
### More cleaning: infrequent words and frequent words
#########################################
# We convert the corpus to a "Document-term-matrix" (dtm)
dtm_pr <-DocumentTermMatrix(corpusclean_pr)  
dtm_general <-DocumentTermMatrix(corpusclean_general)  
# dtms are organized with rows being documents and columns being the unique words.

# To speed up the computation process for this tutorial, I have selected some of the most frequent words 
# that don't seem to be very meaningful. 
# We update the corpusclean corpus by removing these words. 
corpusclean_pr <- tm_map(corpusclean_pr, removeWords, c("words","also","can","get","going","just","well","said","will","thats","now","right","like","last","one","see"))
corpusclean_general <- tm_map(corpusclean_general, removeWords, c("words","also","can","get","going","just","well","said","will","thats","now","right","like","last","one","see"))

# We then create a dictionary that contains words occurring more than 50 times. 
highfreq50_pr <- findFreqTerms(dtm_pr,50,Inf)
highfreq50_general <- findFreqTerms(dtm_general,50,Inf)

# and create a smaller dtm
# Note that this is completed on the corpus, not the DTM. 
smalldtm_50w_pr <- DocumentTermMatrix(corpusclean_pr, control=list(dictionary = highfreq50_pr))
smalldtm_50w_general <- DocumentTermMatrix(corpusclean_general, control=list(dictionary = highfreq50_general))

# The following loads the data after processing via the above steps:
load(url("https://github.com/RFJHaans/topicmodeling/blob/master/Data/2018/Data_preTM.RData?raw=true"))

LDA: Running the model

Please see the PDW from last year for some more information on the basic models. It can be found here

# We first fix the random seed for future replication.
SEED <- 123456789

#########################################
# 10 Topics
t1_10_pr <- Sys.time()
LDA10_pr <- LDA(smalldtm_50w_pr, k = 10, control = list(seed = SEED, verbose = 1))
t2_10_pr <- Sys.time()

t1_10_general <- Sys.time()
LDA10_general <- LDA(smalldtm_50w_general, k = 10, control = list(seed = SEED, verbose = 1))
t2_10_general <- Sys.time()

# Assess time it took to run:
t2_10_pr- t1_10_pr
t2_10_general- t1_10_general

#########################################
# 25 Topics
t1_25_pr <- Sys.time()
LDA25_pr <- LDA(smalldtm_50w_pr, k = 25, control = list(seed = SEED, verbose = 1))
t2_25_pr <- Sys.time()

t1_25_general <- Sys.time()
LDA25_general <- LDA(smalldtm_50w_general, k = 25, control = list(seed = SEED, verbose = 1))
t2_25_general <- Sys.time()

# Assess time it took to run:
t2_25_pr- t1_25_pr
t2_25_general- t1_25_general

#########################################
# 50 Topics
t1_50_pr <- Sys.time()
LDA50_pr <- LDA(smalldtm_50w_pr, k = 50, control = list(seed = SEED, verbose = 1))
t2_50_pr <- Sys.time()

t1_50_general <- Sys.time()
LDA50_general <- LDA(smalldtm_50w_general, k = 50, control = list(seed = SEED, verbose = 1))
t2_50_general <- Sys.time()

# Assess time it took to run:
t2_50_pr- t1_50_pr
t2_50_general- t1_50_general

LDA: Rendering

#########################################
# Terms per topic

#########################################
# 10 topics
# We then create a variable that captures the top ten terms assigned to the 10-topic model:
topics_LDA10_pr <- terms(LDA10_pr, 10)
# And show the results:

      Topic 1       Topic 2     Topic 3      Topic 4      Topic 5    Topic 6   
 [1,] "million"     "market"    "technology" "market"     "market"   "new"     
 [2,] "new"         "think"     "vehicles"   "million"    "year"     "electric"
 [3,] "company"     "growth"    "energy"     "sales"      "company"  "business"
 [4,] "information" "million"   "power"      "energy"     "think"    "company" 
 [5,] "markets"     "new"       "year"       "year"       "sales"    "vehicle" 
 [6,] "electric"    "company"   "business"   "industry"   "business" "million" 
 [7,] "statements"  "energy"    "new"        "quarter"    "electric" "year"    
 [8,] "quarter"     "products"  "market"     "automotive" "first"    "power"   
 [9,] "power"       "year"      "systems"    "growth"     "table"    "vehicles"
[10,] "energy"      "customers" "statements" "time"       "power"    "inc"  

      Topic 7     Topic 8     Topic 9    Topic 10   
 [1,] "business"  "new"       "electric" "company"  
 [2,] "year"      "growth"    "market"   "year"     
 [3,] "new"       "energy"    "energy"   "quarter"  
 [4,] "think"     "inc"       "quarter"  "vehicle"  
 [5,] "energy"    "electric"  "think"    "product"  
 [6,] "sales"     "business"  "million"  "think"    
 [7,] "first"     "million"   "global"   "million"  
 [8,] "time"      "customers" "battery"  "business" 
 [9,] "companies" "tesla"     "question" "new"      
[10,] "today"     "financial" "year"     "operating"

topics_LDA10_general <- terms(LDA10_general, 10)
# And show the results:
topics_LDA10_general

      Topic 1    Topic 2     Topic 3     Topic 4    Topic 5   Topic 6   
 [1,] "car"      "electric"  "know"      "new"      "people"  "tesla"   
 [2,] "vehicles" "time"      "cars"      "first"    "energy"  "new"     
 [3,] "electric" "year"      "president" "obama"    "say"     "company" 
 [4,] "cars"     "president" "people"    "company"  "make"    "percent" 
 [5,] "new"      "price"     "year"      "think"    "think"   "year"    
 [6,] "company"  "cars"      "john"      "electric" "new"     "million" 
 [7,] "year"     "company"   "dont"      "car"      "auto"    "pct"     
 [8,] "gas"      "global"    "big"       "vehicles" "billion" "inc"     
 [9,] "energy"   "million"   "time"      "million"  "know"    "electric"
[10,] "first"    "high"      "years"     "hes"      "years"   "sent" 

      Topic 7      Topic 8  Topic 9    Topic 10 
 [1,] "contact"    "price"  "price"    "price"  
 [2,] "new"        "total"  "electric" "average"
 [3,] "holds"      "new"    "total"    "month"  
 [4,] "location"   "year"   "stock"    "day"    
 [5,] "national"   "day"    "market"   "usd"    
 [6,] "house"      "energy" "high"     "stock"  
 [7,] "street"     "assets" "day"      "company"
 [8,] "energy"     "times"  "year"     "market" 
 [9,] "conference" "month"  "month"    "year"   
[10,] "president"  "market" "insider"  "energy"

#########################################
# 25 topics
topics_LDA25_pr <- terms(LDA25_pr, 25)
# And show the results:
topics_LDA25_pr

      Topic 1       Topic 2      Topic 3          Topic 4       Topic 5     
 [1,] "million"     "market"     "business"       "energy"      "market"    
 [2,] "quarter"     "energy"     "new"            "market"      "business"  
 [3,] "statements"  "company"    "year"           "sales"       "year"      
 [4,] "years"       "think"      "think"          "million"     "sales"     
 [5,] "think"       "years"      "electric"       "markets"     "think"     
 [6,] "company"     "year"       "market"         "year"        "company"   
 [7,] "new"         "new"        "energy"         "automotive"  "table"     
 [8,] "fuel"        "weve"       "million"        "industry"    "electric"  
 [9,] "two"         "growth"     "vehicles"       "growth"      "first"     
[10,] "technology"  "vehicles"   "statements"     "years"       "energy"    

      Topic 6      Topic 7          Topic 8      Topic 9       Topic 10       
 [1,] "company"    "year"           "business"   "market"      "year"         
 [2,] "year"       "business"       "energy"     "electric"    "quarter"      
 [3,] "new"        "think"          "electric"   "energy"      "think"        
 [4,] "power"      "new"            "new"        "quarter"     "company"      
 [5,] "two"        "markets"        "growth"     "information" "business"     
 [6,] "million"    "energy"         "market"     "think"       "million"      
 [7,] "electric"   "first"          "customers"  "question"    "years"        
 [8,] "think"      "two"            "results"    "year"        "new"          
 [9,] "industrial" "technology"     "dont"       "technology"  "vehicle"      
[10,] "vehicles"   "vehicles"       "global"     "global"      "first"            

      Topic 11     Topic 12         Topic 13      Topic 14         Topic 15     
 [1,] "new"        "year"           "market"      "quarter"        "million"    
 [2,] "million"    "hybrid"         "new"         "business"       "year"       
 [3,] "quarter"    "technology"     "business"    "electric"       "sales"      
 [4,] "business"   "vehicles"       "million"     "new"            "time"       
 [5,] "vehicle"    "power"          "power"       "think"          "vehicles"   
 [6,] "market"     "electric"       "company"     "time"           "product"    
 [7,] "statements" "systems"        "electric"    "year"           "vehicle"    
 [8,] "company"    "company"        "sales"       "growth"         "business"   
 [9,] "fuel"       "quarter"        "growth"      "energy"         "quarter"    
[10,] "technology" "first"          "vehicles"    "million"        "power"        

      Topic 16         Topic 17     Topic 18     Topic 19      Topic 20        
 [1,] "vehicle"        "market"     "million"    "company"     "new"           
 [2,] "market"         "electric"   "electric"   "energy"      "year"          
 [3,] "million"        "quarter"    "year"       "technology"  "energy"        
 [4,] "new"            "growth"     "quarter"    "sales"       "technology"    
 [5,] "company"        "new"        "market"     "electric"    "power"         
 [6,] "think"          "sales"      "first"      "car"         "global"        
 [7,] "energy"         "customers"  "sales"      "vehicles"    "vehicles"      
 [8,] "product"        "vehicle"    "industry"   "million"     "think"         
 [9,] "technology"     "industry"   "energy"     "markets"     "information"   
[10,] "growth"         "think"      "companies"  "think"       "sales"          

      Topic 21      Topic 22     Topic 23    Topic 24      Topic 25       
 [1,] "market"      "market"     "electric"  "new"         "market"       
 [2,] "year"        "business"   "market"    "year"        "think"        
 [3,] "energy"      "million"    "company"   "million"     "company"      
 [4,] "business"    "technology" "million"   "growth"      "million"      
 [5,] "million"     "electric"   "business"  "company"     "sales"        
 [6,] "technology"  "think"      "new"       "market"      "technology"   
 [7,] "vehicle"     "products"   "think"     "years"       "years"        
 [8,] "car"         "battery"    "energy"    "inc"         "products"     
 [9,] "electric"    "vehicles"   "vehicles"  "markets"     "year"         
[10,] "products"    "share"      "year"      "information" "statements"

topics_LDA25_general <- terms(LDA25_general, 25)
# And show the results:
topics_LDA25_general

      Topic 1     Topic 2    Topic 3      Topic 4     Topic 5     Topic 6    
 [1,] "company"   "year"     "know"       "obama"     "people"    "tesla"    
 [2,] "pct"       "car"      "people"     "car"       "cars"      "electric" 
 [3,] "year"      "years"    "year"       "year"      "years"     "year"     
 [4,] "vehicles"  "ago"      "time"       "think"     "car"       "new"      
 [5,] "car"       "price"    "years"      "hes"       "billion"   "company"  
 [6,] "cars"      "people"   "big"        "today"     "think"     "cars"     
 [7,] "electric"  "month"    "cars"       "mccain"    "market"    "sent"     
 [8,] "energy"    "cars"     "street"     "first"     "year"      "state"    
 [9,] "million"   "global"   "john"       "electric"  "time"      "inc"      
[10,] "tesla"     "electric" "says"       "companies" "states"    "pct"      

      Topic 7         Topic 8    Topic 9     Topic 10   Topic 11     Topic 12  
 [1,] "contact"       "total"    "deadline"  "energy"   "electric"   "year"    
 [2,] "location"      "new"      "dec"       "price"    "year"       "total"   
 [3,] "holds"         "price"    "new"       "day"      "assets"     "electric"
 [4,] "new"           "day"      "car"       "company"  "new"        "price"   
 [5,] "national"      "year"     "cars"      "month"    "energy"     "index"   
 [6,] "president"     "month"    "vehicles"  "new"      "know"       "million" 
 [7,] "house"         "energy"   "nov"       "shares"   "state"      "market"  
 [8,] "conference"    "company"  "gas"       "stock"    "really"     "week"    
 [9,] "director"      "assets"   "notified"  "assets"   "day"        "time"    
[10,] "street"        "market"   "hybrid"    "capital"  "million"    "times"    

      Topic 13     Topic 14    Topic 15    Topic 16     Topic 17    Topic 18    
 [1,] "new"        "new"       "new"       "new"        "price"     "new"       
 [2,] "energy"     "car"       "day"       "car"        "stock"     "million"   
 [3,] "electric"   "sent"      "market"    "percent"    "company"   "electric"  
 [4,] "company"    "moved"     "price"     "tesla"      "new"       "percent"   
 [5,] "contact"    "vehicles"  "president" "inc"        "total"     "cars"      
 [6,] "government" "says"      "times"     "people"     "shares"    "first"     
 [7,] "vehicles"   "years"     "total"     "company"    "average"   "energy"    
 [8,] "assets"     "photos"    "electric"  "electric"   "times"     "general"   
 [9,] "month"      "first"     "average"   "make"       "day"       "time"      
[10,] "national"   "percent"   "company"   "model"      "volume"    "company"   

      Topic 19    Topic 20     Topic 21    Topic 22      Topic 23   Topic 24    
 [1,] "new"       "obama"      "electric"  "month"       "years"    "day"       
 [2,] "energy"    "dont"       "new"       "price"       "people"   "price"     
 [3,] "people"    "auto"       "vehicles"  "market"      "say"      "year"      
 [4,] "oil"       "know"       "company"   "stock"       "new"      "total"     
 [5,] "house"     "new"        "president" "year"        "think"    "market"    
 [6,] "john"      "president"  "total"     "assets"      "oil"      "electric"  
 [7,] "cars"      "energy"     "year"      "months"      "know"     "stocks"    
 [8,] "think"     "first"      "price"     "energy"      "mccain"   "index"     
 [9,] "way"       "make"       "power"     "week"        "states"   "average"   
[10,] "time"      "car"        "today"     "average"     "back"     "ratio"     

      Topic 25   
 [1,] "today"    
 [2,] "price"    
 [3,] "company"  
 [4,] "president"
 [5,] "cars"     
 [6,] "know"     
 [7,] "car"      
 [8,] "stocks"   
 [9,] "percent"  
[10,] "director"

#########################################
# 50 topics
topics_LDA50_pr <- terms(LDA50_pr, 50)
# And show the results:
topics_LDA50_pr

      Topic 1       Topic 2        Topic 3          Topic 4        Topic 5      
 [1,] "million"     "energy"       "market"         "market"       "market"     
 [2,] "company"     "new"          "year"           "energy"       "year"       
 [3,] "new"         "think"        "new"            "sales"        "business"   
 [4,] "electric"    "market"       "business"       "year"         "sales"      
 [5,] "power"       "power"        "million"        "million"      "company"        

      Topic 6        Topic 7          Topic 8          Topic 9       Topic 10    
 [1,] "new"          "think"          "new"            "market"      "year"      
 [2,] "year"         "year"           "business"       "electric"    "think"     
 [3,] "company"      "business"       "energy"         "energy"      "quarter"   
 [4,] "power"        "energy"         "electric"       "think"       "business"  
 [5,] "think"        "new"            "growth"         "quarter"     "company"   

      Topic 11       Topic 12       Topic 13         Topic 14        
 [1,] "market"       "power"        "market"         "electric"      
 [2,] "quarter"      "vehicles"     "new"            "new"           
 [3,] "sales"        "year"         "business"       "energy"        
 [4,] "new"          "company"      "million"        "think"         
 [5,] "company"      "technology"   "sales"          "million"              

      Topic 15       Topic 16         Topic 17      Topic 18      Topic 19     
 [1,] "million"      "new"            "market"      "million"     "electric"   
 [2,] "year"         "market"         "electric"    "year"        "energy"     
 [3,] "sales"        "energy"         "growth"      "electric"    "company"    
 [4,] "power"        "think"          "sales"       "energy"      "million"    
 [5,] "vehicle"      "company"        "quarter"     "quarter"     "global"      

      Topic 20         Topic 21       Topic 22        Topic 23      Topic 24     
 [1,] "new"            "market"       "business"      "electric"    "new"        
 [2,] "year"           "million"      "think"         "market"      "year"       
 [3,] "energy"         "business"     "market"        "company"     "million"    
 [4,] "sales"          "year"         "million"       "business"    "years"      
 [5,] "vehicles"       "energy"       "vehicles"      "think"       "company"    

      Topic 25        Topic 26         Topic 27         Topic 28        
 [1,] "market"        "new"            "million"        "million"       
 [2,] "company"       "electric"       "electric"       "technology"    
 [3,] "sales"         "year"           "think"          "business"      
 [4,] "technology"    "business"       "year"           "new"           
 [5,] "think"         "power"          "markets"        "company"              

      Topic 29         Topic 30       Topic 31     Topic 32       Topic 33    
 [1,] "million"        "think"        "energy"     "vehicle"      "think"     
 [2,] "quarter"        "electric"     "million"    "electric"     "quarter"   
 [3,] "market"         "business"     "electric"   "quarter"      "new"       
 [4,] "year"           "energy"       "business"   "year"         "million"   
 [5,] "power"          "vehicle"      "quarter"    "new"          "vehicles"    

      Topic 34         Topic 35         Topic 36      Topic 37        
 [1,] "electric"       "company"        "energy"      "year"          
 [2,] "year"           "quarter"        "million"     "new"           
 [3,] "first"          "market"         "think"       "million"       
 [4,] "market"         "statements"     "vehicles"    "energy"        
 [5,] "information"    "year"           "business"    "business"        

      Topic 38       Topic 39         Topic 40         Topic 41      
 [1,] "market"       "business"       "time"           "vehicles"    
 [2,] "new"          "electric"       "sales"          "quarter"     
 [3,] "year"         "year"           "first"          "power"       
 [4,] "company"      "market"         "year"           "energy"      
 [5,] "sales"        "million"        "vehicles"       "million"     

      Topic 42         Topic 43         Topic 44       Topic 45      Topic 46      
 [1,] "business"       "market"         "market"       "market"      "business"    
 [2,] "markets"        "million"        "company"      "company"     "market"      
 [3,] "company"        "energy"         "energy"       "year"        "sales"       
 [4,] "growth"         "power"          "year"         "think"       "electric"    
 [5,] "million"        "quarter"        "vehicles"     "energy"      "energy"            

      Topic 47      Topic 48         Topic 49        Topic 50        
 [1,] "market"      "electric"       "market"        "business"      
 [2,] "business"    "energy"         "year"          "quarter"       
 [3,] "first"       "million"        "years"         "company"       
 [4,] "quarter"     "market"         "company"       "new"           
 [5,] "company"     "think"          "think"         "year"

topics_LDA50_general <- terms(LDA50_general, 50)
# And show the results:
topics_LDA50_general

       Topic 1      Topic 2      Topic 3         Topic 4      Topic 5        
 [1,] "cars"       "year"       "year"          "obama"      "people"       
 [2,] "year"       "car"        "obama"         "think"      "president"    
 [3,] "vehicles"   "cars"       "know"          "today"      "think"        
 [4,] "company"    "electric"   "cars"          "electric"   "energy"       
 [5,] "electric"   "month"      "people"        "car"        "know"                

      Topic 6      Topic 7         Topic 8      Topic 9       Topic 10     
 [1,] "tesla"      "president"     "new"        "deadline"    "new"        
 [2,] "company"    "contact"       "total"      "new"         "price"      
 [3,] "electric"   "location"      "price"      "car"         "energy"     
 [4,] "state"      "holds"         "day"        "cars"        "company"    
 [5,] "cars"       "electric"      "year"       "notified"    "day"              

      Topic 11     Topic 12     Topic 13     Topic 14     Topic 15     
 [1,] "electric"   "electric"   "electric"   "company"    "new"        
 [2,] "year"       "total"      "new"        "assets"     "day"        
 [3,] "new"        "year"       "company"    "new"        "sales"      
 [4,] "energy"     "price"      "energy"     "billion"    "market"     
 [5,] "know"       "times"      "vehicles"   "vehicles"   "electric"    

      Topic 16     Topic 17        Topic 18     Topic 19    Topic 20    
 [1,] "new"        "times"         "new"        "new"       "obama"     
 [2,] "pct"        "price"         "electric"   "oil"       "new"       
 [3,] "inc"        "average"       "cars"       "energy"    "know"      
 [4,] "company"    "shares"        "million"    "think"     "auto"      
 [5,] "percent"    "stock"         "company"    "people"    "make"           

      Topic 21    Topic 22      Topic 23     Topic 24     Topic 25     Topic 26    
 [1,] "electric"  "month"       "think"      "price"      "cars"       "vehicles"  
 [2,] "new"       "market"      "say"        "day"        "today"      "cars"      
 [3,] "president" "price"       "new"        "electric"   "president"  "say"       
 [4,] "price"     "week"        "people"     "market"     "company"    "energy"    
 [5,] "year"      "stock"       "know"       "average"    "price"      "first"         

      Topic 27      Topic 28     Topic 29     Topic 30     Topic 31       
 [1,] "price"       "mccain"     "year"       "market"     "price"        
 [2,] "day"         "know"       "electric"   "new"        "electric"     
 [3,] "year"        "two"        "new"        "stock"      "obama"        
 [4,] "month"       "years"      "cars"       "energy"     "president"    
 [5,] "stock"       "year"       "house"      "month"      "first"            

      Topic 32     Topic 33     Topic 34        Topic 35     Topic 36    
 [1,] "electric"   "vehicles"   "street"        "year"       "new"       
 [2,] "year"       "company"    "contact"       "new"        "car"       
 [3,] "market"     "energy"     "new"           "president"  "sent"      
 [4,] "assets"     "people"     "people"        "company"    "million"   
 [5,] "shares"     "united"     "president"     "million"    "tesla"       

      Topic 37     Topic 38     Topic 39        Topic 40      Topic 41       
 [1,] "year"       "price"      "house"         "day"         "new"          
 [2,] "total"      "total"      "contact"       "ratio"       "location"     
 [3,] "price"      "company"    "car"           "usd"         "american"     
 [4,] "usd"        "stock"      "year"          "week"        "contact"      
 [5,] "company"    "year"       "cars"          "market"      "first"            

      Topic 42        Topic 43      Topic 44        Topic 45     Topic 46    
 [1,] "years"         "new"         "contact"       "year"       "million"   
 [2,] "president"     "sent"        "car"           "price"      "percent"   
 [3,] "state"         "year"        "electric"      "president"  "electric"  
 [4,] "new"           "vehicles"    "pct"           "global"     "car"       
 [5,] "tesla"         "first"       "cars"          "day"        "company"     

      Topic 47     Topic 48        Topic 49        Topic 50    
 [1,] "year"       "year"          "new"           "year"      
 [2,] "times"      "new"           "year"          "electric"  
 [3,] "new"        "car"           "energy"        "energy"    
 [4,] "day"        "federal"       "company"       "total"     
 [5,] "years"      "time"          "state"         "know"

Getting more precise (document-topic / term-topic) matrices to be used in rendering

# Document-topic matrices
documents_LDA10_pr <- as.data.frame(LDA10_pr@gamma) 
documents_LDA10_general <- as.data.frame(LDA10_general@gamma) 
documents_LDA25_pr <- as.data.frame(LDA25_pr@gamma) 
documents_LDA25_general <- as.data.frame(LDA25_general@gamma) 
documents_LDA50_pr <- as.data.frame(LDA50_pr@gamma) 
documents_LDA50_general <- as.data.frame(LDA50_general@gamma) 

#########################################
# Term-topic matrices
terms_LDA10_pr <- posterior(LDA10_pr)[["terms"]]
terms_LDA10_general <- posterior(LDA10_general)[["terms"]]
terms_LDA25_pr <- posterior(LDA25_pr)[["terms"]]
terms_LDA25_general <- posterior(LDA25_general)[["terms"]]
terms_LDA50_pr <- posterior(LDA50_pr)[["terms"]]
terms_LDA50_general <- posterior(LDA50_general)[["terms"]]

#########################################
# Write tables
write.table(documents_LDA10_pr, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\documents_10_pr.csv", sep=',',row.names = FALSE)
write.table(documents_LDA10_general, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\documents_10_general.csv", sep=',',row.names = FALSE)
write.table(documents_LDA25_pr, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\documents_25_pr.csv", sep=',',row.names = FALSE)
write.table(documents_LDA25_general, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\documents_25_general.csv", sep=',',row.names = FALSE)
write.table(documents_LDA50_pr, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\documents_50_pr.csv", sep=',',row.names = FALSE)
write.table(documents_LDA50_general, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\documents_50_general.csv", sep=',',row.names = FALSE)

write.table(terms_LDA10_pr, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\terms_10_pr.csv", sep=',',row.names = FALSE)
write.table(terms_LDA10_general, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\terms_10_general.csv", sep=',',row.names = FALSE)
write.table(terms_LDA25_pr, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\terms_25_pr.csv", sep=',',row.names = FALSE)
write.table(terms_LDA25_general, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\terms_25_general.csv", sep=',',row.names = FALSE)
write.table(terms_LDA50_pr, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\terms_50_pr.csv", sep=',',row.names = FALSE)
write.table(terms_LDA50_general, file = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\terms_50_general.csv", sep=',',row.names = FALSE)

Creating topic networks (based on term correlations)

# Get the term-topic matrices again
post_10_pr <- topicmodels::posterior(LDA10_pr)
post_10_general <- topicmodels::posterior(LDA10_general)
post_25_pr <- topicmodels::posterior(LDA25_pr)
post_25_general <- topicmodels::posterior(LDA25_general)
post_50_pr <- topicmodels::posterior(LDA50_pr)
post_50_general <- topicmodels::posterior(LDA50_general)

# Create correlation matrices between the topics based on their term-loadings
cor_mat_10_pr <- cor(t(post_10_pr[["terms"]]))
cor_mat_10_general <- cor(t(post_10_general[["terms"]]))
cor_mat_25_pr <- cor(t(post_25_pr[["terms"]]))
cor_mat_25_general <- cor(t(post_25_general[["terms"]]))
cor_mat_50_pr <- cor(t(post_50_pr[["terms"]]))
cor_mat_50_general <- cor(t(post_50_general[["terms"]]))


# Change row values to zero if less than row minimum plus two times the row standard deviation
# The two times was chosen for illustrative purposes. 
# This is similar to how Jockers subsets the distance matrix to keep only 
# closely related documents and avoid a dense spagetti diagram 
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
cor_mat_10_pr[ sweep(cor_mat_10_pr, 1, (apply(cor_mat_10_pr,1,min) + 2*apply(cor_mat_10_pr,1,sd) )) < 0 ] <- 0
diag(cor_mat_10_pr) <- 0
cor_mat_10_general[ sweep(cor_mat_10_general, 1, (apply(cor_mat_10_general,1,min) + 2*apply(cor_mat_10_general,1,sd) )) < 0 ] <- 0
diag(cor_mat_10_general) <- 0
cor_mat_25_pr[ sweep(cor_mat_25_pr, 1, (apply(cor_mat_25_pr,1,min) + 2*apply(cor_mat_25_pr,1,sd) )) < 0 ] <- 0
diag(cor_mat_25_pr) <- 0
cor_mat_25_general[ sweep(cor_mat_25_general, 1, (apply(cor_mat_25_general,1,min) + 2*apply(cor_mat_25_general,1,sd) )) < 0 ] <- 0
diag(cor_mat_25_general) <- 0
cor_mat_50_pr[ sweep(cor_mat_50_pr, 1, (apply(cor_mat_50_pr,1,min) + 2*apply(cor_mat_50_pr,1,sd) )) < 0 ] <- 0
diag(cor_mat_50_pr) <- 0
cor_mat_50_general[ sweep(cor_mat_50_general, 1, (apply(cor_mat_50_general,1,min) + 2*apply(cor_mat_50_general,1,sd) )) < 0 ] <- 0
diag(cor_mat_50_general) <- 0


# And create graphs:
g_10_pr <- igraph::graph.adjacency(cor_mat_10_pr, weighted = TRUE , mode= 'undirected')
plot(g_10_pr, layout = layout_with_kk, 
     edge.width=E(g_10_pr)$weight, 
     vertex.label.family = "sans",vertex.label.font=2, vertex.color = "white", vertex.label.color = "black")

g_10_general <- igraph::graph.adjacency(cor_mat_10_general, weighted = TRUE , mode= 'undirected')
plot(g_10_general, layout = layout_with_kk, 
     edge.width=E(g_10_general)$weight, 
     vertex.label.family = "sans",vertex.label.font=2, vertex.color = "white", vertex.label.color = "black")

g_25_pr <- igraph::graph.adjacency(cor_mat_25_pr, weighted = TRUE , mode= 'undirected')
plot(g_25_pr, layout = layout_with_kk, 
     edge.width=E(g_25_pr)$weight, 
     vertex.label.family = "sans",vertex.label.font=2, vertex.color = "white", vertex.label.color = "black")

g_25_general <- igraph::graph.adjacency(cor_mat_25_general, weighted = TRUE , mode= 'undirected')
plot(g_25_general, layout = layout_with_kk, 
     edge.width=E(g_25_general)$weight, 
     vertex.label.family = "sans",vertex.label.font=2, vertex.color = "white", vertex.label.color = "black")

g_50_pr <- igraph::graph.adjacency(cor_mat_50_pr, weighted = TRUE , mode= 'undirected')
plot(g_50_pr, layout = layout_with_kk, 
     edge.width=E(g_50_pr)$weight, 
     vertex.label.family = "sans",vertex.label.font=2, vertex.color = "white", vertex.label.color = "black")

g_50_general <- igraph::graph.adjacency(cor_mat_50_general, weighted = TRUE , mode= 'undirected')
plot(g_50_general, layout = layout_with_kk, 
     edge.width=E(g_50_general)$weight, 
     vertex.label.family = "sans",vertex.label.font=2, vertex.color = "white", vertex.label.color = "black")

Using LDAVis to visualize topics

# We need to create a function that allows converting LDA output from the 'topicmodels' package to LDAVis
# https://gist.github.com/trinker/477d7ae65ff6ca73cace
#' Transform Model Output for Use with the LDAvis Package
topicmodels2LDAvis <- function(x, ...){
  post <- topicmodels::posterior(x)
  if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
  mat <- x@wordassignments
  LDAvis::createJSON(
    phi = post[["terms"]], 
    theta = post[["topics"]],
    vocab = colnames(post[["terms"]]),
    doc.length = slam::row_sums(mat, na.rm = TRUE),
    term.frequency = slam::col_sums(mat, na.rm = TRUE)
  )
}

LDAvis::serVis(topicmodels2LDAvis(LDA10_pr), out.dir = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\LDAVis\\vis_10_pr", open.browser = FALSE)

LDAvis::serVis(topicmodels2LDAvis(LDA10_general), out.dir = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\LDAVis\\vis_10_general", open.browser = FALSE)

LDAvis::serVis(topicmodels2LDAvis(LDA25_pr), out.dir = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\LDAVis\\vis_25_pr", open.browser = FALSE)

LDAvis::serVis(topicmodels2LDAvis(LDA25_general), out.dir = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\LDAVis\\vis_25_general", open.browser = FALSE)

LDAvis::serVis(topicmodels2LDAvis(LDA50_pr), out.dir = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\LDAVis\\vis_50_pr", open.browser = FALSE)

LDAvis::serVis(topicmodels2LDAvis(LDA50_general), out.dir = "C:\\Users\\rfjha\\Documents\\GitHub\\topicmodeling\\Output\\2018\\LDAVis\\vis_50_general", open.browser = FALSE)

# The following URL points to the data with all the output
load(url("https://github.com/RFJHaans/topicmodeling/blob/master/Data/2018/Data_LDA.RData?raw=true"))

Please see this paper for more information on this visualization (Sievert and Shirley, 2014).

You can access interactive versions of the images below by clicking the figure title (e.g. “10-topic PR”).

The following URL points to the data with all the output

load(url("https://github.com/RFJHaans/topicmodeling/blob/master/Data/2018/Data_LDA.RData?raw=true"))