in Big Data/Data Mining/Data Visualization, Search Engine Optimization, Search Engines

Semantic Keyword Research with KNIME and Social Media Data Mining – #BrightonSEO 2015

I had the opportunity to travel to the UK in April and speak at BrightonSEO, an SEO conference I’ve always admired from afar in the United States.

Needless to say, it was an incredible experience. I return from England having connected with many of my European SEO brethren, a liking of beans for breakfast, and with the word “garbage” stricken from my vocabulary and replaced with “rubbish”.

For those of you who did not make it out to see my presentation, or for those who attended and yearned for greater detail, I present to you this recap of my presentation…

Jump to…

Why Do Semantic Keyword Research?


Search Engines such as Google and Bing are coming to rely more heavily on semantic search technology to better understand the websites in their indices and what people mean when they search.

The prevalence of these technologies means that it is time for SEOs to adapt once again and better understand language usage and how keywords relate to each other conceptually.

The strength of the keywords’ conceptual connection may be scored for relevancy on-page, within a search query or combination of the two.

Let’s do semantic keyword research!

What is Semantic Search?
The transition from strings to things.

Note: Some these concepts are simplified to ease reader understanding.

At a very high-level the idea of semantic search is about looking past a piece of text on the web as a series of keywords and understanding their meaning and how they relate to each other.

This helps with relevancy scoring and to better understanding and interpret the meaning of text. This might affect how the search engines come to understand the intent of a search query or determine if a website is a good match for that search.

There are two ways to look at data for semantic search:

Through the lens of structured data or unstructured data.

Structured data means reading user-provided markup to understand how concepts and entities might relate to each other. Schema.org markup on a page is a source of structured data on a web page that the search engines can use to understand it semantically.

Search engines may also examine web text semantically without the presence of structured data, using technologies such as Natural Language Processing and Machine Learning algorithms. Search Engines may also use data provided by web pages marked up with structured data to be able to understand unstructured data better.

In our case, we are not concerned as much with structured data and are focusing on semantic search in the context of unstructured data.

In my presentation, I gave the following example search:

“What is a mammal that has a vertebrate and lives in water?”

The search engines may break out the search this way:

semantic whale search example

And then interpret it as:

semantic whale example with connections highlights - node graph

You can try this example in Google Search. In most cases, Google produces information about whales and related animals.

whale semantic google search result

About Google Hummingbird

It is difficult to discuss the prevalence of Semantic Search without at least mentioning Google Hummingbird.

Hummingbird is at its core, more of an infrastructure update, but it does have some new technology baked into the new engine.

Amit Singhal, the head of Google’s core ranking team, discussed some of Hummingbird’s conversation Search capabilities with Danny Sullivan:

“Hummingbird is paying more attention to each word in a query, ensuring that the whole query — the whole sentence or conversation or meaning — is taken into account, rather than particular words. The goal is that pages matching the meaning do better, rather than pages matching just a few words.”

amit singhal talking about google hummingbird

It is clear that Google has incorporated more semantic search technology with the introduction of Hummingbird.

If you’d like to learn more about Google Hummingbird and how it pertains to Semantic Search, I recommend you read Gianluca Fiorelli’s Moz Post on the subject. He covers it better than anyone else in my opinion.

He boils down Google’s potential new capabilities post-Hummingbird, saying it should be able:

  1. To better understand the intent of a query;
  2. To broaden the pool of web documents that may answer that query;
  3. To simplify how it delivers information, because if query A, query B, and query C substantively mean the same thing, Google doesn’t need to propose three different SERPs, but just one;
  4. To offer a better search experience, because expanding the query and better understanding the relationships between search entities (also based on direct/indirect personalization elements), Google can now offer results that have a higher probability of satisfying the needs of the user.
  5. As a consequence, Google may present better SERPs also in terms of better ads, because in 99% of the cases, verbose queries were not presenting ads in their SERPs before Hummingbird.

How Can SEOs Optimize for Semantic Search?

At a high level, you want to make sure that you are creating high-quality content that delights your users, paying close attention to searcher intent. Mapping content to personas and categorizing keyword as navigational, transactional, or informational may also help with this endeavor.

matt cutts - great content

“Now this is great content”

After that, keywords on your website should be semantically related, not necessarily on the page level. To help with this, start thinking about your website in terms of related topical buckets. One of your goals should be to have the search engines perceive your site as an authority for each one of of those topical buckets.

Your website should be able to be broken down into one or more broad interrelated topics, likely representable by short tail keywords. Each one of these topics can be thought of like a bucket.

seo topical bucket visualization

Within each of those buckets reside sub-concepts or keywords, often long-tail keywords, but not necessarily (represented as red ball above). They should relate to the other topic buckets on your website, but even more so to the bucket they are contained inside.

Creating quality content to represent those sub-concepts, that earn links helps build the topical authority of its bucket and for the website overall.

When creating your buckets, it is helpful to have an exceptional understanding of consumer language, and the myriad of ways that users may search in relation to your website’s topic.

At a bare minimum, you need to understand the following language search perspectives:

  1. What are consumers searching for when they are familiar with your topic?

    • Language used should represent your core keywords.

  2. What are consumers searching for when they are not familiar with your topic?

    • Language tends to be more conversational. You may uncover additional related terms when exploring your topic from this perspective.

  3. What else do these two groups search for typically?

    • These searches may be directly and/or indirectly related to your topic.

Looking at your topic like this will help form a foundation of keywords that we will use in our topic buckets and expand upon in our semantic keyword research. Later on, we’ll examine the semantic relationship between these keywords using data visualization to simplify the selection process.

Why Social Media Data is an Awesome Data Source for Semantic Keyword Research

When conduction keyword research, it is intuitive to factor in SERP data, but an incredible secondary data source is social media.

Reasons to use Social Media for Keyword Research

  1. Social data helps us expand your collection of keyword ideas, especially when it comes to newer, fresh keywords.
  2. Social Networking language is inherently conversational and can help you understand the phrasing of conversational queries.
  3. We can use Social language to mimic the language of the user, which has a secondary CRO benefit.

Note: I typically focus on Twitter for this data since it has an existing infrastructure for data mining, and it is easiest to work with of all the social networks.

Secondary Benefit, CRO: The Echo Effect

This is a bit of tangent, but it is worth mentioning. While you are already doing social data mining, you might as well use this information to better your copywriting. There are several academic studies that indicate that that mimicking the language of the consumer (we will derive this from Twitter text), help to build trust and improve conversions1:

  • A study published in the International Journal of Hospitality Management demonstrated that waitresses who copied the language of a person’s order word-for-word were given higher tips on average.
  • Another study, published in the Journal of Language and Social Psychology discusses how mimicking peoples’ language can help with building likability, safety, and rapport–all aspects of effective copywriting.

Moving on…

Let’s say we’ve collected massive amounts of data. Some of that data will come from websites ranking in the SERPs for relevant keywords and some from social networks like Twitter.

What kind of simple analyses can we do to help with our semantic keyword research?

You can very easily examine that data through the lens of:

  • Co-Occurence: How often two or more words appear along side each other in a corpus of documents (in our case, websites and Tweets)

  • LDA (Latent Dirichlet Allocation): Helps find semantically related keywords and groups them into topical buckets.

  • TF-IDF (Term Frequency-Inverse Document Frequency): Reflects how important a keyword is to a document in a whole collection of documents.

Paul, this does not sound easy!

Well it is…with the right tool.

Introducing to KNIME!

Introducing KNIME, a tool that just might change how you handle marketing automation, do data analysis, and do SEO.

What is KNIME?

KNIME is a free and open-source, visual data pipelining tool.

the data pipelining model - visualization

Click here to read an awesome explanation of data pipelining.

KNIME allows you to do things using a drag-and-drop interface that you would normally need a developer or programming background to accomplish.

It helps synergize data-oriented tasks and helps easily automate:

  • Data collection through many sources
  • Data manipulation
  • Analysis
  • Visualization
  • Reporting

You can get started by downloading KNIME. I recommend downloading the version with all the free extensions (it’s a large file of ~1gb):

Quality Visualization Will Help Make Use of the Data

Semantic relations can be difficult to incorporate into your typical Excel-based keyword research document, so there are some data visualizations that KNIME will produce that we can use easily process this information.

useful semantic keyword research visualizations

The most useful have been a simple color-codes word cloud (depicted left) and a node graph visualization (depicted right). I’ll come back to these later.

The Basics of KNIME

Let’s start with “nodes”, the building blocks of a KNIME project.

What is a Node?

  • Nodes are pre-built drag-and-drop boxes designed to do a single task. There are a HUGE number of pre-built nodes in KNIME that are useful for marketing and beyond.
  • KNIME nodes are combined together into “workflows” to accomplish larger, more complex tasks.
  • Nodes can be grouped together into meta nodes that can be configured in unison.
KNIME Google Analytics Node

That’s right, there is even pre-build Google Analytics nodes. KNIME can be used bioinformatics AND marketing.

How do you add Nodes?

The KNIME interface is somewhat customizable, but typically you can find your list of nodes on the left-hand panel within the “Node Repository”.

If you installed the correct version, you should have access to hundreds already.

To use a node, it’s as simple as finding the one you want to use and click-and-dragging into a workflow tab.

knime click and drag demonstration

Demonstration: How to click-and-drag a node from the “node repository” into your workflow tab.

How do you connect nodes to one another?

To connect nodes to one another, it is also a click-and-drag action.

knime input and output ports

Nodes have input and output “ports” that look like a little white triangle on the left and right sides. You click-and-drag from an output port to an input port.

how to connect nodes in knime - demonstation

Demonstration: How do you connect KNIME nodes to each other.

Note: Honestly, KNIME seems intimidating at first, but it’s SUPER easy. The trickiest part is becoming familiar with which nodes are available, what they are called, and which ones can connect together. You can learn about that by reading the documentation for each node in the “Node Description” area.

Configuring the Nodes in your Workflow

Once you’ve added and connected nodes in your workflow, depending on the node, it may be necessary to change their settings.

To change a node’s setting, you simply right-click and choose “Configure”.

demonstation how to configure a knime node

A settings dialog will pop-up. Each node will have a different setting interface, but most of them are self-explanatory.

The example above is the “Table Creator” node (very useful for some quick text entry) and it’s setting dialog looks like a basic Microsoft Excel Spreadsheet, and it functions about the same.

knime configuration dialog for the create a table node

How to Run Your Workflow

There are a few ways you can run your KNIME workflow. You can right-click and choose “Execute” to run an individual node (If you choose select on the last node in a linear workflow, then all of the previous nodes should run as well)

running an individual node in knime - a demonstration

…Or you can click the green circle with the double white arrow from top to run all of the nodes in the workflow.

knime demonstation: running all nodes in a workflow

How to Extract Twitter Data with KNIME

I’ve already mentioned the merits of using Twitter as a data. source for your semantic keyword research and thankfully, this is very easy to do within KNIME. Here’s how…

Get a Twitter API Key

Head over to https://apps.twitter.com/ to register for a free Twitter API key.

Sign in with your Twitter account (you’ll have to create one if you don’t already have one).

Click “Create New App”.

get a twitter api key - create new app

Fill out the form.

twitter api form - how to fill it out

The name and description fields can say anything.

The website is necessary, but you can put any website you want. I usually put my blog URL or http://www.google.com.

Don’t worry about putting a callback URL.

After you’ve created an app, navigate to the “Keys and Access Tokens” tab.

keys and access tokens twitter api tab

Once there, grab the “Consumer Key (API Key)” and the “Consumer Secret (API Secret)”.

You’ll also need to scroll down to “Your Access Token” and click the “create my access token” button.

twitter access token button

Then jot down your “Access Token” and “Access Token Secret”.

You’re on your way…

From the “node repository”, you’ll need two different nodes:

  1. the “Twitter API Connector” node
  2. the “Twitter Search” node

You can find both of these by making use of the search box at the top of the node repository, or by navigating to:
KNIME Labs -> Twitter API

knime node search box

The two nodes connect to each other by click-and-dragging the green boxes on the end to on another.

The KNIME Twitter API nodes

You will need to configure both of the nodes.

Within the configuration of the API Connector node, you will input your three API keys from earlier

inputing twitter api into KNIME:

Configuring the Twitter Search node is easy. In the Query section, type your Twitter search. Choose “recent” under “search for” and the number of Tweet results you would like “number of rows”.

twitter search node configuration -  knime

Note: you can only get about 2,000 tweets at a time and are rate-limited. Check out information about Twitter’s rate limiting here. Information about the Search API can be found here.

The output looks like a spreadsheet (seen if right click on the Twitter Search node and choose “search results” at the bottom).

twitter output menu item for preview in knime

That was stupid easy!

knime - twitter search output example

At this point, we can add-on more nodes and analyze the Tweet text…

node configuration to extact and crawl links from tweets

Or we can take it a step further and extract all the links that were Tweeted, crawl those pages, and then extract the text from those pages to be analyzed!

 
Might as well do both 😉

#WINNING

SERP Data in Your KNIME Workflow

The next source of text data you should be using is very obviously, search result data.

If I do a search for a keyword and look at the pages ranking for the top 10 results, we know that Google has determined by its full range of ranking factors that it considers those the best pages to match that query.

Using KNIME we can extract the text from those pages and use them as a seed for our keyword analysis, just like we’ve done with Twitter.

There are a number of ways to go about this…

Inputting SERP Data Manually

We can use a rank checking software like AWR that outputs either a CSV or Excel file and read it with KNIME.

KNIME has both an “XLS Reader” node and a “CSV Reader” node.

serp ranking data in knime using the csv reader node or the xls reader node

Alternatively, you can do a search for your keyword and using something like the SERPS Redux bookmarklet, grab a list of ranking URL and input them manually using the “Table Creator” node.

using table creator knime node to manually input SERP data

Inputting SERP Data with an API

A better way of inserting search result data into your KNIME workflow would be to use a rank checker with an API access, like Authority Labs or getSTAT.

serp data from rank tracker api example slide

If you’re familiar with using APIs, than the two main nodes you will need are the “GET Resource” node and the “Read REST Representation” nodes (see the above slide).

Working with an API is one of the more difficult things I discussed in my presentation, so you may want to grab a developer on Elance or something to help you out with this step if you’re having difficulty.

If you’re interested, you can download an example I made using the getSTAT API for a Trial account here, but this will vary slightly with a full account or another API.

I am personally using a weird set-up where a Python script that updates a Google Spreadsheet which is getting ranking data, and extracting that information using this weird built-in SQL-like query language. I don’t recommend doing this 😉

Extract Plain Text from Websites in KNIME

Now you have a list of URLs from Twitter and from the SERPs. The next step is to crawl those pages and get them into a plain text format.

boilerpipe api node workflow

There are a number of ways to get a webpage into a plain text format.

KNIME even has a built in “ContentExtractor” Node that makes use of the Readability API under the Palladian Community Nodes, but it doesn’t work that well in my experience.

I found that BoilerPipe, a Java Library with a web API interface works best:

http://boilerpipe-web.appspot.com/ (go ahead and give the web interface a try)

The agency that I work for is lucky enough to have a developer and we created a native KNIME node for BoilerPipe, but unfortunately I am unable to share it.

On the bright side, the free API works quite well and can be incorporated into KNIME as well. The only limitation is you might get some timeouts if you are hitting their server too hard and too frequently.

I’ve provided you a meta node that makes use of the BoilerPipe web interface, that you can incorporate into your workflow:

 
The output you will get from feeding a URL into the BoilerPipe meta node will look something like this:

boilerpipe output example

It effectively extract the main content of a page and distils it into plain ASCII text.

Before we can do any text analysis in KNIME, we have to do a quick intermittent step and put everything into the correct format. Any text must be converted into the “Document” format in order to be collected within our Document corpus for anlysis.

intermittent strings to documents node

To do this, we will use the “Strings to Document” node. It can be attached to any plain text data in KNIME, such as webpages we have converted to plain text using BoilerPipe or Tweet text.

From here, we can work some KNIME magic!

Useful Nodes for Text Analysis Worth Mentioning

There’s a lot of text mining and Natural Language Processing built into KNIME out of the box. I won’t have time to cover all of it’s capabilities, nor am I an expert.

I do however think it’s worth mentioning a few that come in use very frequently within the context of semantic keyword research and marketing in general.

useful text mining and NLP nodes in KNIME

Nodes with definitions:

  • The Bag of Words Creator node: a Bag of Words is “a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity”.2 It’s necessary to get text into a Bag of Words model in order to do a lot of analyzes.

  • The Ngram Creator node: An N-gram is a “contiguous sequence of n items from a given sequence of text or speech”.3 If we want to examine the occurrence of various text segments or phrases, we need to look at it by multi-word segments. To do that, we examine the text by N-grams. If you’ve ever played with Google’s Ngram Viewer, you know how powerful this can be.

  • The POS tagger node: POS stands for Parts of Speech–“A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token) such as noun, verb, adjective, etc”.4 It’s really helpful to understand how language is being used to talk about your topic.

  • The OpenNLP NE tagger node: This node can be used to isolate “Named Entities” in text. I don’t go into how to use this, but the usefulness for doing semantic keyword research is apparent, since you can easily extract entities such as persons, organizations, locations, expressions of times, quantities, and monetary values.

Note: As an alternative to the OpenNLP NE tagger node, I would also consider exploring the incorporation of the AlchemyAPI AlchemyLanguage API into your workflow, for even better entity detection. It is freely available for non-commercial use with credit given. You can incorporate it pretty easily using the “REST Nodes” which can be found under the “Community Nodes” node section within KNIME.

THINGS ABOUT TO GET MUCH TECHNICAL. GLORIOUS.

keyword research doge

Parts of Speech Tagging

As mentioned above, it is helpful to use parts of speech tagging to understand the language of your topic.

using knime for pos (parts of speech) tagging

Want to understand how people topic about a product that your client sells?

Drop in the information from Twitter and ranking pages from the SERP and examine the Adjectives!

This can be extended out to examine semantic triples.

semantic triples. Created by Cyrus Shepard of Moz.

Image Source: This awesome Moz post by Cyrus Shepard (also worth reading).

Calculate TF-IDF

I mentioned TF-IDF earlier:

TF-IDF (Term Frequency-Inverse Document Frequency) – Reflects how important a keyword is to a document in a whole collection of documents.”

It’s an important number that you will need to generate in KNIME if you want to filter out junk keywords and examine only the most significant ones in your analysis.

caluclate TFIDF in KNIME

To generate TF-IDF, you only need three nodes to find this number from a Bag of Words node (see above graphic for the configuration):

  • the TF node. Make sure it is set to “relative frequency” in the configuration.
  • the IDF node
  • the Math Formula node.

Set your Math Formula node to:

$TF rel$*$IDF*

This multiplies the relative document frequency and inverse document frequency together to produce TF-IDF.

It’s a little more complicated if you want to calculate TF-IDF with Ngrams.

To calculate for Ngrams, you will have to set the Math Formula node to:

abs($Document frequency$*(log($${IRowCount}$$/$Corpus frequency$)))

Calculating Co-Occurence

You’ve likely heard the term co-occurrence thrown around in SEO world.

I mentioned it earlier:

Co-Occurence – How often two or more words appear along side each other in a corpus of documents (in our case, websites and Tweets)

Bill Slawski has some good information on his blog. I definitely recommend you give it a read.

calculate co-occurence with knime for seo

KNIME makes calculating co-occurrence for our SERP and Twitter data very easy. There’s a pre-built node that will handle the entire analysis.

From a Bag of Words node, you only need to add-on the Term Co-Occurence Counter node.

It will produce two columns, each with the co-occurring words. Then it will produce statistics about the frquency which those terms appear along side each other.

Topic Modeling in KNIME Using LDA (Latent Dirichlet Allocation)

I mentioned LDA earlier:

LDA (Latent Dirichlet Allocation) – Helps find semantically related keywords and groups them into topical buckets.

For a good explanation about how LDA works, I recommend giving this post a read. Although it isn’t necessary if you understand the goal of performing an LDA analysis:

LDA is an excellent way to start looking at a large number of keywords from various sources and understand which ones relate to each other and which ones don’t.

It fits very well into the topic bucket model I explained at the beginning of my post, with each topic modeled by LDA represented as a keyword bucket.

LDA keyword analysis for SEO in KNIME

It is really easy to conduct an LDA analysis in KNIME. There is a built-in node called the Topic Extractor (Parallel LDA) node.

The only limitation of LDA is that you have to estimate how many topic buckets for the model to identify.

I usually set it to the default of 10 topics and see related the output looks and then adjust accordingly.

Types of Visualizations That Are Useful for Examining TF-IDF, Co-Occurence, and LDA

There are two main visualizations which I have found to be very useful with enhancing your boring, Excel based keyword research template and conveying semantic attributes of you keyword suggestions.

Word Clouds

color coded word clouds for semantic keyword visualization

If you need to examine data containing keywords and any sort of weight or frequency metric (such as TF-IDF or even maybe Co-Occurence) than ye olde word cloud is a very sensible visualization.

  1. It shows the actual keywords.
  2. The keywords weight can be displayed with the size of the word.

Furthermore, color can be applied to segment the keywords beyond frequency or weight. Using the “Color Manager” node and feeding into the “Tag Cloud” node (the default word cloud visualization node built into KNIME), we can apply different colors to different keyword types.

So, if you are segmenting your keywords by parts of speech or entity type, than a color-coded word cloud makes a whole lot of sense.

Network/Node Graphs

network / node graph visualization which is useful for examing semantic keyword relations

If the goal of your visualization is to demonstrate a connection between two or more elements, than a node graph is a very logical means of doing so.

For out purposes, these connections represent semantic relationships between topics via keyword trees and clusters.

Displaying keywords according to LDA or Co-Occurence represent simple examples of this visualization.

When we display LDA with a node graph, we can easily see how several keywords cluster around a certain theme.

This also true when visualizing Co-Occurence in a node graph, except we pay special attention to thick-forest-like clusters.

More densely connected keywords have a greater number of co-occurrences with other keywords and may represent a stronger connection with your theme.

How certain clusters inter-connect with other clusters is also something to pay attention to when visualizing co-occurence this way. The greater the number of inter-connected clusters, the greater the relevance–much more so than on the individual keyword level.

To create a node graph, you need three built-in KNIME nodes:

  1. The Network Creator node – You will use this to initiate node graph creation. It doesn’t do much for our purposes but create an empty node graph.
  2. The Objected Inserter node – You will feed your keyword data (LDA or Co-Occurence works) and your Network Creator node into this node. Configure it to define which data represent the nodes and edges of your graph.
  3. The Network Viewer node – Feed the Object Inserter into this node and generate the actual visualization. You can right-click and configure to choose different clustering algorithms for an optimal visualization.

Bringing it Together

So at this point, you should have a pretty good understanding of some of the things we can do in KNIME.

We can…

  1. Search Twitter for a keyword and then collect all of the text of Tweet.
  2. Search Twitter for a keyword, extract only the shared links from those Tweets, crawl those URLs and then scrape the text from them.
  3. Extract the top 10 ranking pages for a keyword and then crawl and scrape text from those pages.
  4. Isolate single word keywords and/or multi-word N-grams.
  5. Calculate TF-IDF.

From there we can…

  • Tag Parts of Speech (Nouns, Adjectives, Verbs, etc.) and display in Word Cloud.
  • Tag Parts of Entities and display in Word Cloud.
  • Do Co-Occurrence Analysis and display in Node Graph.
  • Identify semantic topic groupings with LDA and display in Node Graph.

Might as well do all of the above! 😉

Doing a Similar Analysis to What Google Does…

One of the best things you can do with these capabilities is to try and implements on a smaller and simpler scale, some of the technology that the search engines are utilizing.

For example, we can replicate some of the methodology depicted in this Google patent about “word relationships and document rankings (good blog post format summary).

Let’s walk through it a bit.

“Perform a query at Google for a term such as “mockingbird” and take the top 1,000 or so documents that appear in the search results responding to that search.”

Google uses the top 1,000 results for it’s analysis. We will do a simplified version and use the methodology depicted earlier to find the top 10 results for the query “mockingbird”. We’ll reduce them to plain text using the BoilerPipe set-up I’ve discussed.

“Extract most of the terms from those documents after marking where they appear on the page, and calculate scores for each of the words based upon things such has how many times they occur in a document,…”

Use the term frequency node.

“…and how close to the beginning of the document they might be.”

I didn’t touch upon this previously, but you can either choose to ignore this or you can create a system using the “String Manipulation” node and the indexOf() function built into it.

“Perform a capitalization analysis and a part of speech analysis to determine if the terms might be nouns, proper nouns, named entities, or even nuggets of information such as sentences. These might be scored higher than verbs or other types of terms within the document. Other types of analysis might also be used to determine if a term is a named entity.”

Use the POS Tagger and the OpenNLP NE tagger (maybe also AlchemyAPI). Use some Text Processing filters to isolate the nouns and entities.

“Filter out the terms that tend to appear pretty commonly on the Web using something like a term frequency–inverse document frequency (TFIDF) score for those documents to see which terms are common. The top 20 or so terms that are above a certain threshold based upon the TFIDF analysis might be kept for a document, and the rest eliminated. These remaining terms are the most significant terms in the document.”

Calculate TF-IDF on the terms you’ve filtered out and then filter it out some more based on a value threshold. You can use the Rule Base Row Filter or Row Filter to accomplish this.

“Then calculate relationships scores for the terms left over in each document. Words that interact in a document by being in close proximity to each other are said to have a relationship. A close proximity might be seen if the words appear in the same sentence, or the same paragraph, or within a certain number of sentences from each other. These are local term relationships. If one of the remaining terms has no local term relationships with any of the other terms, it is disregarded.”

Use the term co-occurence counter to node to calculate co-occurences. Use multiple instances of the node, performing a co-occurence at different levels such as the sentence, neighborhood, or document levels. These can be configured within the node.

Use the row filters to filter to certain co-occurence threshold.

“A score for each of those documents can be generated by looking at which documents have terms in common, and among those documents with common terms, and something like a combination of the original ranking score and a document score based upon all of the term relationship scores within each document.”

Do some calculations to get at this, the Math Formula node will help you work with this data better.

Throw the results into a node graph. The visual output of the node graph will help you actually use this data for keyword research purposes.

Awesome, right!?

Using our Output Visualizations…

Let move on to some of the ways that the various built-in KNIME data visualizations can help us interpret our data…

Using the various visualizations let’s examine the subject of the classic horror movie, Night of the Living Dead (1968).

doing a semantic seo analysis of keywords for night of the living dead

Parts of Speech Output

So we’ve filtered by TF-IDF and generated a word cloud.

We’ve applied colors that represent either Parts of Speech or different entity types.

Final-BrightonSEO-Paul-Shapiro-46

As we’re writing content, we can easily look at this graph and sprinkle in an extra adjective or commonly associated entity.

Or, we’re fleshing our keyword research document and we want to expand our long tail keywords, we can look at these words and try to combine them together for more ideas.

TDIDF + Co-Occurence Output

As previously mentioned, we are looking for two things: 1) individual keywords clusters, and 2) keyword clusters connected to other keyword clusters.

Final-BrightonSEO-Paul-Shapiro-47

For the example above exploring “Night of the Living Dead”, we’d pay special attention to these two highlighted clusters.

highlighted co-occurrence clusters

For cluster one…

Doing a little bit of Googling, we find that the subject matter pertains to horror movie convention that some of the cast of the movie attended that generated some buzz. If we were writing a website about Night of the Living Dead, we might want to write a blog post about that convention.

For cluster two…

This cluster amalgamates multiple smaller clusters and is probably the most representative of our overall subject matter.

The bottom of the cluster has something to do with George Romero creating the zombie movie genre with Night of the Living Dead (there is some associated junk about a blog post that we can ignore).

The mid-region of the cluster, we see discussion of a specific scene about a recent remake of the film called Night of the Living Dead 3D (it wasn’t very good).

Toward the top of the cluster we have a region that is much less densely connected. Doing some research, we find that this pertains to a comic book series being created about the movie. The same company is making a Pacific Rim comic as indicated by a few tiny branches of the cluster.

These are all topics we should consider exploring for our Night of the Living Dead fan site!

TFIDF + LDA Output

We’ve filtered to “important” keywords and performed a topical analysis of them using latent Dirichlet allocation, isolating 10 different topics.

Final-BrightonSEO-Paul-Shapiro-48

Each identified topic is visualized in the node graph as a keywords spiral, each labeled topic_#.

annotated LDA topics for SEO

For example, spiral #1 above (labeled as topic_6) pertains to the various sequels to Night of the Living Dead, both Dawn of the Dead and Day of the Dead are mentioned. A page devoted to Night of the Living Dead sequels would be an excellent page for our fan site.

Spiral #2 above (labeled as topic_8) pertains to how Night of the Living Dead was selected by the Library of Congress for preservation in the National Film Registry. Another excellent topic to discuss for out fan website!

Now go forth upon the world and start doing better semantic keyword research!

You should now have a basic understanding of an awesome tool and how you can use it to start doing better, more tangible semantic keyword research.

There’s a lot that I didn’t have an opportunity to cover.

So if you have any questions about what I’ve covered or didn’t, ask away in the comments below!

Write a Comment

Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

14 Comments

    • There are more than one versions you can download. You want to download the version of KNIME that includes all of the extensions. Alternatively, there is a way to download individual nodes, but I recommend just grabbing the more comprehensive version out of the box.

    • Big fan of MarketMuse. I highly recommend it to anyone who finds this process too technically advance, or even anyone who would like to go even deeper (the technology is better).

  1. Great stuff man. Keep it coming. A semantic enthusiast such as yourself and this article hits the nail on the head. It’s almost like I don’t want to share it with anyone because then I would slowly become irrelevant 😉

  2. Paul, what an excellent article! I was searching for a practical approach to something like semantic keyword clouds and as far as I can tell, this is the sweet spot for it. However, after downloading & installing the complete KNIME and giving myself a try at the ‘extract and crawl twitter links’ workflow, I run into trouble with three nodes, that do not seem to be included in the node repository anymore
    – HttpRetriver
    – UrlExtractor
    – UrlResolver
    Does this also happen to you, or am I missing an additional node package I need to install, anything?
    Thanks in advance.

    • I’m running the latest stable version (2.1.2.0) and they’re included. My guess is that you downloaded the Early Access 3.0 version? Or downloaded the version without all of the extensions? Another thing that’s possible, is that they’re included in the 32bit version and not the 64bit versions?

      Alternatively, these are all under the Community Nodes->Palladian section and you should be able to download them under Help->Install New Software… menu.

      • Hello Paul, thanks for the reply and help provided. Indeed I had Version 3.0. I downgraded to yours to have the exact same setting. All nodes are there, but… the ‘Extract and Expand Tweet URLs’ workflow still has issues. I went through each node / process step, executed, checked the result table.
        – String Manipulation: Part 1:t.co Extraction RegEx
        – String Manipulation: Part 2:t.co Extraction RegEx
        In these nodes I adapted the config to work on https which was http before and the result table looked fine afterwards.
        – URLResolver
        This simply does not seem to work anymore, the ‘Resolved URL’ column is exactly like the input column.
        So unfortunately, at least so far, this is the furthest I could get. If you have any advise, highly appreciated.
        Cheers, Kai

  3. Fantastic post, deep, and insightful. Very creative, low-cost solution to getting some deep data/research done!

    I am curious, what’s the time commitment usually involved in this process once you’re familiar with KNIME from your experience?

    Again, really enjoyed it.

    • Thanks! Once you’re familiar, and you have stuff built out, the time investment is VERY minimal. Even if you don’t have everything built-out, and you’re familiar, it’s amazing what you can accomplish in a short period of time.

  4. Great post, Paul. Thank you very much for your insight. Unfortunately, I get lots of 403s that prevent me from extracting text. I am assuming BoilerPipe does not want to process a batch the way it is setup in your example. Please let me know if you have alternatives that were not mentioned in the original post.

    • Yeah, an alternative would to be to integrate the actual BoilerPipe library, which unfortunately isn’t easy if you’re not familiar with Java. I’ve written the code if you want to figure it out, it’s on my GitHub.

  5. Terrific writing and really impressive way to make things clear. Topical and intent based keyword research is very important these days and LSI keywords are playing major role after the Hummingbird update. But the newest change on Google keyword planner tool is making things difficult to select best relevant keywords depending on their average monthly search volumes. Really enjoyed reading the post and couple of valuable points are also noted.