Creating a word cloud using R

In this post we will describe you how to create a wordcloud using R with the help of text mining and wordcloud packages

What is a word cloud?

Word cloud is a data visualization technique. It is basically an image consisting of words. Imagine When there is a huge chunk of data and you have no idea even where to begin, word cloud is your savior. It’s a great way to communicate valuable information at a glance when the raw data is text based.

When you have lengthy pages and pages of text data, word cloud helps you by pointing out which areas to pay your attention first.

In a word cloud, the most frequent word appears at the center of the cloud.

Most of the research questionnaires consists of at least one open-ended question. When you gather hundreds and thousands of data points for such open-ended questions, word clouds are used to analyze the data quantitatively. Since word clouds leave out all unnecessary stop words, punctuation etc, only the critical data is visualized.

Real World use cases of Word Clouds

When companies do customer satisfaction surveys, word clouds help to identify customer dissatisfaction areas and provide insights into areas where that should be improved. In fact, when you combine a word cloud with other analytic tools, you get a top-notch output! Employee satisfaction surveys also use word clouds to visualize employee sentiment instantly.

Search Engine Optimization is a major domain where word clouds are useful. SEO marketing agencies can use them to view target keywords as they instantly indicates the most prominent words in your content.

Marketing and political campaign agencies use word clouds to understand the thinking pattern of people. When you generate a word cloud for a public speech delivered by a politician or a businessmen, word cloud helps to understand the hidden implications and the underlying thinking pattern of the speaker. Thus, providing a considerable understanding about their intentions. Then the agencies can aim specific areas to trigger the sentiments of their target audience.

Generating a word cloud using R

We use the 2007 Harward Commencement address by Mark Zuckerberg to generate the word cloud. You can download that txt file here.

We need 3 packages for this. Namely Text Mining, Color Brewer and Wordcloud packages.

First let’s load these 3 libraries into R.

library(tm)
library(RColorBrewer)
library(wordcloud)

Import the text file into R

mz<-readLines("mark_harward.txt")

Now let’s create a corpus of words

mzcorpus<-Corpus(VectorSource(mz))

You can view the corpus using the following code

inpect(mzcorpus)

Our next step is to pre-process the data. We need to replace characters like -,\,/, with white spaces.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
mzcorpus <- tm_map(mzcorpus, toSpace, "/")
mzcorpus <- tm_map(mzcorpus, toSpace, "@")
mzcorpus <- tm_map(mzcorpus, toSpace, "\\|")

Now we need to clean our data. That is removing unnecessary white spaces, removing stop words, removing numbers etc. Depending on the nature of your text, you may sometimes need to remove numbers or not. Sometimes we have to keep numbers in our data set. In this case also, we don’t remove numbers from our data set.

mzclean<-tm_map(mzcorpus, tolower)
mzclean<-tm_map(mzclean, removeNumbers)
mzclean<-tm_map(mzclean, removePunctuation)
mzclean<-tm_map(mzclean, removeWords, stopwords("english"))

Now that we have cleaned our data, let’s generate the word cloud.

wordcloud(mzclean,min.freq = 5,colors = brewer.pal(8,"Set2"), random.order = F)

In the above code, the argument “min.freq” means the condition imposed on a word to appear on the word cloud. In this case, if a particular does not appear 5 times in the text, that word will not appear in our word cloud. When it comes to larger texts, it is better to set this equal or above 5. The “colors” argument simply adds beauty using the color brewer package. We have set the “random.order” argument as False because we need our word cloud to be ordered according to the frequency of words. That is, words that appear most frequently in the text will be located at the center of the text.

Here is our word cloud