There are a lot of misconceptions about the Penguin algorithm and what drives its detection of over optimization. In particular, most case studies regarding Penguin tend to trivialize the nuance with which Penguin detects over-optimization. Unlike other algorithms that may focus on a handful of factors, Penguin looks at a large number of factors and the patterns between them to determine over optimization. Subsequently, many of the features of over optimization that one would expect in a penalized site are not so common after all…

Anchor Text

Only 28% of URLs impacted by Penguin have exact match anchor text for the keyword for which they were penalized. Less than 8% had that keyword as their most common anchor text.

Sitewide Links

Less than 40% of URLs impacted by Penguin have any sitewide links. There was only an 8% difference between sites hit by Penguin and those not in terms of possessing sitewide links.

Link Sources

Less than 10% differentiation occurs between sites that were and were not penalized by Penguin in terms of having links from articles submission sites, guestbooks, directories, forums or blog comments. The real differentiation were sites that depended upon it.

The Simple Truth

Google’s Penguin Analysis uses complex mathematics to analyze huge data sets with likely hundreds if not thousands of variables to identify patterns of over optimization. Those patterns cannot be easily discerned by a human (if they could be, they would have been written into the algorithm years ago!) To get to the heart of a machine learned algorithm, you need to apply machine learning techniques. Just like we have with the Penguin Vulnerability Score.