Clustering and Topic Analysis
CS 5604 Final Presentation

December 12, 2017
Virginia Tech, Blacksburg VA 24061

Global Event and Trend Archive Research (GETAR) project, supported by NSF IIS-1619028
Integrated Digital Event Archiving and Library (IDEAL) project supported by NSF IIS-1319578
CS5604 GTA Liuqing Li, Instructor Edward A. Fox

Acknowledgments

Team Members
Ashish Baghudana
Shruti Shetty
Aman Ahuja
Ashish Malpani
Rammohan Chintha
Mo Yang
Pavan Bellam
Prathyush Sambaturu
Topic Analysis
Clustering

Overview
Use topic analysis and clustering algorithms to find sub-themes and similar patterns in collections of webpages and tweets about real world events
Pull and clean documents
Topic Analysis
Clustering
Store results in HBase


Overview
Classify Documents into Real World Events
Topic Analysis: LDA
Clustering: K-means
Topic Names
Topic Probability
Cluster Name
Cluster Probability
HBase
Preprocessing

Topic Analysis

A (really) brief overview of topic models
Topic models discover “abstract” topics in a collection of documents where a topic is a group of “semantically similar” words.
Mainly latent Dirichlet allocation (LDA) or its variants
Probabilistic algorithm
Optimized via Gibbs Sampling or Expectation Maximization
LDA assumes documents a mixture over topics and topics a mixture over words

Tech Stack - Python++


Database Access
Topic Modeling
 Data Extraction & Visualization
NLTK
pyLDAvis
happybase
Data retrieval and upload pipelines written using the Python library happybase.
Preprocess tokenized data using NLTK for stopword and punctuation removal. Allow custom pipelines. Use multi-core support from gensim for LDA.
Extract topics for each document and visualize topic clusters for each collection 
An alternate stack includes PySpark for LDA implementation
Alternate Tech Stacks
Scala + Spark + MLLib is an alternate stack
Might scale better for larger datasets
However, lacks the visualization and evaluation mechanisms that are built into gensim and pyLDAvis
Gotchas
PySpark + Spark + MLLib 
does not implement the class DistributedLDAModel 
LDAModel does not implement several important methods such as: topTopicsPerDocument(), logPerplexity() and topicCoherence()
Python + Gensim is best suited (multi-core support really helps!)

Preprocessing Pipeline (where the magic happens)
Built a fast and customizable preprocessing pipeline. 

Collection Filter - Customizing for each collection
Solar Eclipse 2017 
Hurricane Irma
Certain words occur repeatedly in each document and can be regarded as stop words for that collection

Developer Manual
Topic Modeling is end-to-end integrated as a single script with several built-in options

usage: lda.py [-h] -c COLLECTION_NAME -t TOPICS [TOPICS ...] [-p] [--table_name TABLE_NAME] (-hb | -f FILE) [-l LOGS] [-a ALPHA] [-b BETA] [-i ITER] [--save_dir SAVE_DIR] [--tokenizer TOKENIZER] [--mappers MAPPERS [MAPPERS ...]] [--filters FILTERS [FILTERS ...]] [--filter_words FILTER_WORDS]

Run LDA on a given collection

required arguments:
  -c COLLECTION_NAME, --collection_name COLLECTION_NAME
                    	Collection name
  -t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
                    	Number of topics to run the model on
  -p, --preprocess  	Preprocess data
  --table_name TABLE_NAME
                    	Table name for HBase
  -hb, --hbase      	Get collection from HBase
  -f FILE, --file FILE  File name for tokens
preprocessing arguments to be added when using -p flag:
  --tokenizer TOKENIZER
                    	Tokenizer to use
  --mappers MAPPERS [MAPPERS ...]
                    	Mappers to use
  --filters FILTERS [FILTERS ...]
                    	Filters to use
  --filter_words FILTER_WORDS
                    	Filename with words to filter out

optional arguments:
  -h, --help        	show this help message and exit
  -l LOGS, --logs LOGS  Log directory
  -a ALPHA, --alpha ALPHA
                    	Alpha hyperparameter
  -b BETA, --beta BETA  Beta hyperparameter
  -i ITER, --iter ITER  Number of iterations
  --save_dir SAVE_DIR   Save directory for topic models

Choosing the Right Number of Topics
Solar Eclipse Webpages
Hurricane Irma Webpages

Visualizing Topics

Naming Topics
Once we finalize on the number of topics, each “topic” has to be named. We follow two strategies:
Automatic Naming
Choose the first non-repeating word in the topic-word distribution (sorted by probability)
Suitable for large number of topics
Manual Naming
Look at top words manually to decide topic name
Suitable for small number of topics

Solar Eclipse: Tweets
These results aren’t perfect however. Despite strict filtering, we see words like trumpresign and president in the list.

Computational Complexity
* Models trained with 500 iterations and 10 topics each
+ Preprocessing is faster on local machines because of SSDs

HBase Fields
We populate:
topic:topic-list
topic:probability-list
topic:display-topicnames
How these are used:
The front-end team uses display-topicnames as a facet in their interface 
It is also possible to use probability scores for recommending similar documents

Shortcomings and Improvements
Automatic elimination of collection-specific stop-words
Currently requires manual curation
Crowd experiments to decide topic names and coherence
Topic names decided either naively or based on the experimenter’s judgement
Better to use the class strength to crowdsource annotations
Topic modeling visualizations can be part of front end
pyLDAvis provides visualizations of the documents in a cluster via a MDS algorithm
Could be part of the faceted search
Joint models for Tweets and Webpages

Clustering

A brief overview of Clustering
Clustering categorize data into clusters such that objects grouped in same cluster are similar to each other according to specific metrics
K-means Algorithm
Elbow method to find number of K
Clustering based on cosine similarity


Tech Stack	
DataBase Access:
Data retrieval and upload pipelines written using the Python library happybase.

Clustering:
Preprocess tokenized data using NLTK for stopword and punctuation removal. 
Scala and Spark for used for Clustering

Result Evaluation & Visualization:

Python and matplotlib are used for calculating inter-/intra-cluster similarity and plotting the results


happybase

Developer Manual 
Create a Scala Package using 
sbt-package
And submit the generated jar to Spark
Spark-submit <jar file> <input file>
For large datasets,performance metrics like driver memory can be enhanced

Clustering Results
“Solar Eclipse” Webpages
“Solar Eclipse” Tweets


Similarity Analysis
“Solar Eclipse” Webpages
Average intra-cluster similarity: 0.33
Average inter-cluster similarity: 0.08

Naming Clusters by Frequent words 
K-means just returns clusters of documents, but does not name the clusters

We can name each cluster by looking at a handful of documents in the cluster manually, which can be tedious in Big Data scenario

Therefore,- we determine the most frequent words in all documents within a cluster - name the cluster based on a few very frequent words in the cluster (Assumption: The frequent words in the cluster describe the information in the documents within  the cluster) 

Clusters in “Solar Eclipse” Tweets and Webpages

HBase Fields
We populate:
cluster:cluster-list
cluster:cluster-probability
cluster:display-clusternames
Script and Command:
Python hbase_write_cluster.py [Cluster_Result].csv
Cluster_Result: Document_ID, Cluster_Name, Cluster_Probability 

Shortcomings and Improvements
Cluster Probability
We set it as “1” because we only did hard clustering
Comparison with other clustering algorithm
Hierarchical Clustering
Density-based Clustering 


Questions?

Code Repository: https://github.com/ashishbaghudana/cs5604
     https://github.com/yuan9/CS5604_CTA