问题
I'm working on a web application where I display HTML from other websites. Before displaying the final version I'd like to get rid of the ads.
Any ideas, suggestions on how to accomplish this? it doesn't need to be a super efficient filtering tool, I was thinking in porting some of the filters defined by adblockplus to Ruby and return the parsed doc with some help of Nokogiri.
Let's say I use the super wildcard filter ad
. That's not an official adblock but for simplicity I'll use it here. The idea then would be to remove all the elements for which any of the attributes match the filter, e.g: src="http://ad.foo.com?my-ad.gif"
href="http://ad.foo.com"
class="annoying-ad"
etc.
The Nokogiri command for this filter would be:
doc.xpath("//*[@*[contains(., 'ad')]]").each { |element| element.remove }
I applied the filter for this page:

And the result was:

Not that bad, note that the global wildcard filter also got rid of valid elements like headers because they have attributes like id="masthead"
.
So I think this approach is ok for my case, now the question would be what filters to use? they have a huge list of filters and I don't feel like iterating over all of them. I'm thinking in grabbing the top 10-20 and parse the docs based on that, is there a list out there with the most popular ones? If so, I haven't been able to find it.
来源:https://stackoverflow.com/questions/18564924/ads-filtering-server-side