In the first part of our three part tech talk series, we'll be discussing how data gets into the YapMap Visual Search Platform. For this purpose, YapMap built a Targeted Structured Crawler composed of two key components that work closely together to collect our search corpus.
YapMap is uses a targeted crawler. Traditional projects and products like Heritrix, 80Legs, Common Crawl and Nutch are amazing tools and make things possible that were unthinkable only a few years ago. They are really good at breadth crawling.
We started by exploring the use of these technologies. However, we’re not trying to crawl the whole web. For our first vertical, we’re trying to have the most complete collection of Automotive discussions on the web. The sites that we do crawl might have hundreds of thousands or millions of pages to crawl. They also often suffer from many different urls for the same content. Lastly, we need to be smart in updating our content effectively using only small company resources. While there is a specification to help with this at sitemaps.org, many sites don’t implement it.
Because of these challenges and our focus on a specific subset of crawling problems, we realized that we needed to build a crawler that was smarter for the job at hand. That meant it needed to understand page types and places where we could rely on date ordering. It also needed to be extra smart about utilization of sites—ensuring that we can crawl large sites as quickly as possible without overwhelming the sites or violating robots.txt. So we built a targeted crawler that solves these problems. This has allowed us to crawl hundreds of millions of web pages over the past year and maintain their recency. All on a reasonable infrastructure while minimizing publisher impact.
Building one of our maps requires more understanding about the interrelationship of pages and the structure of the content on the page than traditional technologies provide. Whereas traditional parsers such as Boilerplate and Tika understand more generic concepts like bigger or smaller type faces, paragraphs versus navigation, and ads versus content, we need to have a much more nuanced understanding of the content of a site. After all, our goal is to be able to figure out that the fifth message in a particular thread is replying to the third message and that the second and fourth messages are by the same author. That meant we needed to build a parser that could work off platform and web standards but could also be taught to understand individual sites.
This parser also needed to be integral to the recursive patterns of crawling so that we could do a targeted crawl of content that we needed. As such we built a streaming structured parser that leverages key lower level technologies like TagSoup and Neko while providing an abstract interface to understand new types of content. This will enable us to expand to cover more types of Q&A sites like Quora, OSQA and StackOverflow and start supporting blogs entries and their associated comments on platforms like WordPress and Blogger.