Existing search technology starts by filtering the entire world through a particular grid or quanta (a.k.a unit of search). For web search, that unit of search is a web page. While not often discussed explicitly, the unit of search for various search types is pretty obvious to most people:
- On Kayak, the unit of search is a Hotel or Flight (depending on the type of search).
- Google’s primary unit of search for web search is a web page
- Amazon searches are focused on finding particular products
- Bing image search orients around each individual image
In many places, a single unit of search is exactly what the doctor ordered. At the end of the day the goal is to reflect how end users think about a particular problem domain. What we realized was that there are places where a single unit of search is not meaningful enough by itself. Some of these places include: discussion forums, blogs with comments, mailing lists and Q&A sites.
What is different about these places is that they have multiple natural competing units of search. For example, the individual posts in an online threaded discussion are given context by the other posts within the discussion. The post by itself might not have much meaning or not be very keyword dense, but in the context of the discussion it might be very relevant.
Different solutions have been typically provided for overcoming this problem of having competing units of search:
- Search at each level discretely: For example, discussion forums allow you to search at two different levels of unit of search: threads and posts.
- Ignore a large part of the content: Traditional blog search engines often completely ignore valuable comments to the original blog entry when using search. Sites like Quora only let a user search at the title level.
- Search at some other common unit of search: Bing and Google index discussions based on the page level. This usually correspond to 5-10 different users comments.
Each of these strategies has major weaknesses. The first solution makes the user work too hard. The second undervalues the importance of a large portion of a publisher’s content. And the third, while often the best of three, suffers in multiple ways: it throws away important meaning reducing the ability to provide useful results, it doesn’t give users context so they are often disoriented on click-through and lastly, it often confuses the user since the snippet you might see could extend across multiple completely discrete pieces of content.
This is the core abstract challenge that YapMap tries to solve: providing search at multiple levels of quanta quickly and in a useful manner. The backend portion of this problem breaks down into a few key pieces: integration and division, indexing and retrieval, and scoring. Our solutions, covered by a number of patents, are outlined below.
A Couple Side Notes
Most search engines and applications use many different factors to determine the relevance of a particular piece of content. Things like receny, number of inbound links, reading level, etc are combined with keyword density and term proximity to build a list of results. These are all attributes of whatever singular content unit that system is pivoted around. At YapMap, we also use many of the same factors to determine relevance. Where we differ is that we also understand the interrelationship of the major and minor content units at score time, allowing us to do things like determine the relationship between keyword density, author authority and reply distance (the distance from a parent post to a particular reply on that discussion branch).
Many engines & applications provide you with multiple different types of content on a single page. For example, if you do a search for "cruise ship" on Google, you'll see web pages, images and news articles. Other times, sites will intersperse videos, movie times and shopping listings. While a useful feature, it is different from what we're talking about here. In those situations, you're still seeing individual result items analyzed at a singular quanta. To provide an analogous example to what we're doing would be to do a search for Vanagon Transmission and seeing the results for pages that that have the best combination of quality text content and quality images (with the result page showing both pieces of information in an abstract context).
Integration and Division
Determining how to merge and then divide the content in a scalable/parallel way becomes a key concern when you’re trying to develop multi-quanta search. Put another way: what are our primary keys and how do we parallelize the integration and processing of the units within them? Our parser is responsible for a portion of this in the delineation of sub-content pieces. However, we also need to stitch together a single thread from multiple distinct pages. The problem with most solutions for this kind of problem is that they are not scalable. It was only through the judicious use of HBase and Zookeeper that could this effectively and provide a stable place for indexing to leverage. At the end of the day, we’ve successfully merged and subdivided single discussion threads in excess of 59,000 individual messages!
Indexing and Retrieval
This is where the heart of our solution varies substantially from existing solutions. We saw the excellent work of projects like Lucene, Solr, Elastic Search & Katta and planned on building our application on top. It seemed like basic search technologies had become commoditized enough that we could do what we needed. The problem we ran across is that these solutions were never built to handle a multi-quanta search problem. At the core of the indexing and retrieval problem is a question of compact efficient data structures and micro-locality. If you look at it from the end state: we need to score threads based on their constituent parts as well as the whole.
There are multiple ways to build a result page like ours. Through our first three major iterations, we leveraged Solr and Lucene. We started by indexing each conversation or thread as a single document. This had two major problems: the ordering of results was very poor and the calculation of individual post level relevance was resource intensive. Our second solution was to maintain two parallel indices, one at each level of quanta. This suffered from a host of different problems: parallel index maintenance was a nightmare, we effectively doubled the size of our index, the scoring code had to be split and necessarily hairy to work within the existing platforms’ paradigms and lastly, the lack of flexibility within the scoring model meant the results were still not ordered effectively or intuitively.
Upon reflection, we struggled with the inevitable need to build a solution for our problem. We hate building new things unless absolutely necessary. Nonetheless, at the end of the day we needed to have an index structure and retrieval paradigm that was pre-calculated and allowed us to understand the relationship of the different quanta at scoring time with micro data locality so that memory requirements wouldn’t balloon out of control. And that’s what we built. Where we can, we leverage existing technologies (examples include utilizing the Lucene processing pipeline and the great tokenization work that the Lucene and Solr teams have done. But despite the interesting work on things like Flexible Indexing we had to build our own distributed indexing and retrieval technology. While painful, this allowed us to build a truly appropriate system, minimize resource requirements and draw on more modern techniques for index construction such as hybrid patched frame of reference encoding and judicious use of bitsets. It also allowed us to load massive indices directly into memory while avoiding many of the traditional garbage collection problems of having large Java heaps.
Which is better: two good posts on a topic or one great post? In the realm of threaded discussions, content is numerous, opinions are rampant, and the best signal of traditional web search (the link) is extremely rare. In addition, the academic world hasn’t spent a lot of time thinking about multi-quanta scoring. Things TF/IDF and BM25 don’t have multiquanta equivalents. As such, we’ve had to establish a new set of scoring models upon which to provide our results. The good news is we have a number of new signals and factors that we can consider. Things like types of responses and reply distance between keywords are available for the first time. We leverage these to provide the best results to the user.
Combining new systems for integration/division, indexing/retrieval and scoring establishes our foundation for building world class applications. While we are very happy with our current status, there is a list a mile long of things we want to improve on. As the nascent field of multi-quanta search is established, we look forward to continually improving our results and end user experience.