Wed 24 Sep 2014

Search Query Categorization at Scale

Classification of short text into a predefined hierarchy of categories is a challenge. The need to categorize short texts arises in multiple domains: page keywords and search queries in online advertising, improvement of search engine results, analysis of tweets or messages in social networks, etc.

audienceThe meetup garnered a large audience.

Using individual keywords from users’ search history is a great way to target an online advertising campaign — we have a high degree of confidence that the user is (or was) interested in the topic they were searching on. But in order to reach a large enough audience to deliver at scale, users must select thousands or tens of thousands of keywords. The ability to categorize searches and keywords helps to create larger audiences with less user effort, without sacrificing the quality of the targeting rules used.

Last night, I presented at the NYC Search, Discovery, and Analytics meetup about our approach to keyword categorization. We leverage community-moderated, freely-available data sets (Wikipedia, DBPedia, Freebase) and open-source tools (Hadoop and Solr) to build a flexible and extensible classification model.

Watch the video to learn how we scale to production data volumes of more than 20 million classifications per hour.

Tags: data science, keyword categorization

