SKIP, The Search Keyword Intent Predictor
Magnetic specializes in search retargeting, thus we really need to understand our users’ searches — it is our bread and butter. We need to recognize what a user’s search means in an understandable way for both humans and computers. This is why we map each search to a category (e.g. “Automotive”), brand (e.g. “BMW”), or other intent data. Our keyword categorization service and Search Keyword Intent Predictor (SKIP) is our core technology which addresses this need.
Keyword Intent Basics
OK, so let’s get some basic undersatnding first. If we talk about search keywords and categories or brands what do we really mean? If a user searches for “galaxy s5” we want to understand it by mapping search keyword to categories and brands:
- Category: Electronics & Computing\Cell Phone & PDA
- Brand: Samsung
Another example, for the search term “fanta”:
- Category: Food & Drink
- Brand: Coca-Cola
We have previously described how we use Wikipedia to detect the category of search queries, and published a paper on our approach. Brand detection is a new feature of our keyword categorization service which we will write about in a forthcoming article here on our blog, but suffice to say that it uses Wikipedia as a data source in much the same way as keyword categorization does.
Before the recent hackathon, keyword categorization was only available as a web service for use by other parts of our platform. During the hackathon we created a user interface for our keyword categorization service to enable humans to explore and interact with our keyword, category, and brand data sets. Now we can see that each keyword is modeled by Wikipedia articles/entities which are further turned into categories, brands or other intent data.
Two wikipedia articles are returned as SKIP results, each have categories and brands and query categories are combination of individual search results categories.
We also took advantage of the hackathon to build a new feature into the categorization service and SKIP. Because of the hyperlinked nature of Wikipedia, we can identify related wiki pages for a keyword, category, or brand, and use that to derive related keywords from the resulting pages. We generate related keywords for each Wikipedia concept with a score computed based on Wikipedia graph.
Related keywords for seed search “Grand Cherokee”.
So if you try to get related keywords for “Grand Cherokee” you will get all alternative names, competitors, related SUVs and continuing list with related car terms with lower scores. However, because we use the brand and category detection previously described, but no keywords related to the Cherokee Native American tribe are returned, which is a risk when using a more naive algorithm to generate related terms (for instance, those based on word co-ocurrence in some corpus).
We are eager to see how our operations and data teams will use SKIP in production. If you would like to get more technical details, continue to read below.
Algorithm for extending related keywords
The Wikipedia corpus is a lot like the general web document corpus in two important ways: first, the documents contain a lot of text, which aids natural language processing tasks, and second, the documents are hyperlinked to one another, which allows us to apply graph processing ideas to the corpus.
Our initial approach was to use existing keyword extraction tools to mine the representative keywords and phrases directly from the text. However, we were not able to achieve results competitive with methods discussed here.
Instead, we leverage the link graph structure of Wikipedia. A link usually denotes a relationship between the two linked concepts. In fact, it encodes human knowledge on the relation of the given concept with the rest of the content in the Wikipedia. We have generated the set of concepts related to the seed concepts of a category as follows:
- For a seed article A, we construct a set of concepts s(A) that A links to.
- Based on the DBPedia categories (DBpedia is structured version of Wikipedia which contains semantic types like Person, City, Location, Organization or Music Artist), we remove from s(A), all the concepts of the types: Location, Person, Organization and Work, which resulted in a smaller graph and had only a minor impact on result quality.
- For all remaining members of s(A), we compute the cosine similarity with the concept A, where we use link adjacency as vectors for the computation. The computed value is used as a concept similarity score to the target category.
- We construct a list of pairs <name, score>, where name (n-gram keyword) is the name of the Wikipedia concepts and score is value of its cosine similarity with a seed category concept.
- We extend the list by the alternative names (defined by titles of Wikipedia redirect pages pointing to the concept) of each concept.
Wikipedia alternative names become keyword suggestions.
In this way, we have generated a list of keywords and phrases related to different categories. Such an approach allows us to unlock knowledge encoded in Wikipedia to represent categories by meaningful and short text phrases with probability scores. Based on the produced data set, we have developed a simple classification algorithm, described in the next section.
If you’d like to hear more about this, please look for us in person at the World Wide Web Conference, precisely at its TargetAd workshop.