How does Google find relevant content when it\'s parsing the web?
Let\'s say, for instance, Google uses the PHP native DOM Library to parse content. What methods would t
There are lots of highly sophisticated algorithms for extracting the relevant content from a tag soup. If you're looking to build something usable your self, you could take a look at the source code for readability and port it over to php. I did something similar recently (Can't share the code, unfortunately).
The basic logic of readability is to find all block level tags and count the length of text in them, not counting children. Then each parent node is awarded a fragment (half) of the weight of each of its children. This is used to fund the largest block level tag that has the largest amount of plain text. From here, the content is further cleaned up.
It's not bullet proof by any means, but it works well in the majority of cases.