Google’s Panda Document Classifier

Est. Reading Time: 3 minutes

Since Google’s Panda update, I’ve been looking for a clear definition of what Google means by a “document classifier”. Here’s an excerpt  that mentions it in their official Google blog:

“.. we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly. The new classifier is better at detecting spam on individual web pages, e.g., repeated spammy words—the sort of phrases you tend to see in junky, automated, self-promoting blog comments.

Until now, I haven’t been able to get a definitive explanation of how Google defines a “document classifier”. But I believe I have found what I was looking for in Peter Norvig‘s textbook “Artificial Intelligence. A Modern Approach“. Peter Norvig is currently the Director of Research at Google and was formerly the Director of Search Quality at Google.

Here is an excerpt from chapter 13 “Uncertainty”:

Text categorization is the task of assigning a given document to one of a fixed set of categories, on the basis of text it contains. Naive Bayes models are often used for this task. In these models, the query variable is the document category, and the “effect” variables are the presence or absence of each word in the language;  the assumption is that the words occur independently in the documents, with frequencies determined by document category.

Once a document is classified into a category based on the text/content, patterns are looked for a given probability distribution. If your website fits that classification and frequency distribution of keywords, you may or may not find yourself  on the wrong side of the tracks.

Applying a classifier to anchor text, you may indeed have a ravenous panda pulling down link branches and swaths of pages across the web forest.

To get an idea of how a document classifier works, you can read a passage from Big Data which looks at how Googlers classified search queries to help predict flu trends :

All their system did was look for correlations between the frequency of certain search queries and the spread of flu over time and space. In total, they processed a staggering 450 million different mathematical models in order to test search terms, comparing their predictions against actual flu cases from the CDC in 2007 and 2008. And they struck gold: their software found a combination of 45 search terms, that when used together in a mathematical model, had a strong correlation between their prediction and the official figures nationwide.”

All content on the web is now run through algorithms, classified against other documents,  examined for spam and put into categories to match terms that correlate strongly to search queries.