Mon 04 May 2015

SKIP, The Search Keyword Intent Predictor

Magnetic specializes in search retargeting, thus we really need to understand our users’ searches — it is our bread and butter. We need to recognize what a user’s search means in an understandable way for both humans and computers. This is why we map each search to a category (e.g. “Automotive”), brand (e.g. “BMW”), or other intent data. Our keyword categorization service and Search Keyword Intent Predictor (SKIP) is our core technology which addresses this need.

Keyword Intent Basics

OK, so let’s get some basic undersatnding first. If we talk about search keywords and categories or brands what do we really mean? If a user searches for “galaxy s5” we want to understand it by mapping search keyword to categories and brands:

Another example, for the search term “fanta”:

We have previously described how we use Wikipedia to detect the category of search queries, and published a paper on our approach. Brand detection is a new feature of our keyword categorization service which we will write about in a forthcoming article here on our blog, but suffice to say that it uses Wikipedia as a data source in much the same way as keyword categorization does.

SKIP

Before the recent hackathon, keyword categorization was only available as a web service for use by other parts of our platform. During the hackathon we created a user interface for our keyword categorization service to enable humans to explore and interact with our keyword, category, and brand data sets. Now we can see that each keyword is modeled by Wikipedia articles/entities which are further turned into categories, brands or other intent data.

Two wikipedia articles are returned as SKIP
results Two wikipedia articles are returned as SKIP results, each have categories and brands and query categories are combination of individual search results categories.

Keyword Expansion

We also took advantage of the hackathon to build a new feature into the categorization service and SKIP. Because of the hyperlinked nature of Wikipedia, we can identify related wiki pages for a keyword, category, or brand, and use that to derive related keywords from the resulting pages. We generate related keywords for each Wikipedia concept with a score computed based on Wikipedia graph.

Related keywords for seed search "Grand
Cherokee" Related keywords for seed search “Grand Cherokee”.

So if you try to get related keywords for “Grand Cherokee” you will get all alternative names, competitors, related SUVs and continuing list with related car terms with lower scores. However, because we use the brand and category detection previously described, but no keywords related to the Cherokee Native American tribe are returned, which is a risk when using a more naive algorithm to generate related terms (for instance, those based on word co-ocurrence in some corpus).

We are eager to see how our operations and data teams will use SKIP in production. If you would like to get more technical details, continue to read below.

Algorithm for extending related keywords

The Wikipedia corpus is a lot like the general web document corpus in two important ways: first, the documents contain a lot of text, which aids natural language processing tasks, and second, the documents are hyperlinked to one another, which allows us to apply graph processing ideas to the corpus.

Our initial approach was to use existing keyword extraction tools to mine the representative keywords and phrases directly from the text. However, we were not able to achieve results competitive with methods discussed here.

Instead, we leverage the link graph structure of Wikipedia. A link usually denotes a relationship between the two linked concepts. In fact, it encodes human knowledge on the relation of the given concept with the rest of the content in the Wikipedia. We have generated the set of concepts related to the seed concepts of a category as follows:

Wikipedia alternative names become keyword
suggestions Wikipedia alternative names become keyword suggestions.

In this way, we have generated a list of keywords and phrases related to different categories. Such an approach allows us to unlock knowledge encoded in Wikipedia to represent categories by meaningful and short text phrases with probability scores. Based on the produced data set, we have developed a simple classification algorithm, described in the next section.

If you’d like to hear more about this, please look for us in person at the World Wide Web Conference, precisely at its TargetAd workshop.

Tags: hackathon, keyword categorization, data science

We are Hiring!

We have in Engineering, including:

Apply online and see all positions at our job board.