Screen scraping: regular expressions or XQuery expressions?

天大地大妈咪最大 提交于 2019-12-04 06:33:57

I'd use a regular expression, for the reasons the manager gave, pluss a few (more portable, easier for outside programmers to follow, etc).

Your counter argument misses the point that his solution was fragile with regard to local changes while yours is fragile with regard to global changes. Anything that breaks his will probably break yours, but not visa-versa.

Finally, it's a lot easier to build slop / flex into his solution (if, for example, you have to deal with multiple minor variations in the input).

I'd use a regular expression, but only because most HTML pages are not valid XML, so you'd never get the XQUERY to work.

I don't know XQuery, but that looks like an XPATH expression to me. If so, it looks a bit expensive with so many "//" operators in it.

Try JTidy or BeautifulSoup works fine for me. certainly // XPATH experssion is quite costly to scrap.

I'm using BeautifulSoup for scrapping.

I actually find CSS search expressions easier to read than either. There probably exists at least one library in the language of your choice that will parse a page and allow you to write CSS directives for locating particular elements. If there's an appropriate class or ID hook nearby then the expression is pretty trivial. Otherwise, grab the elements that seem appropriate and iterate through them to find the ones that you need.

As for fragile, well, they're all fragile. Screen-scraping is by definition dependent on the author of that page not changing its layout drastically. Go with a solution that's readable and can be easily changed later.

A non-brittle solution for screen-scraping? Good luck to the interviewer for that: just because regular expressions toss away a lot of context doesn't mean they are any less brittle: just that they are brittle in other ways. Brittleness may not even be a drawback: if something changes in the source web page, you are frequently better off if your solution raises an alarm, rather than tries to compensate in a clever (and unpredictable) way. As you noted. These things always depend on your assumptions: in this case, on what constitutes a likely change.

I'm rather fond of the HTML agility pack: you get tolerance of non-XHTML-compliant web pages combined with the expressive power of XPath.

Regular expressions are really fast and work with non XML documents. Those are really good points against XQuery. However I think that using some converter to XHTML like tidy and maybe somewhat simpler XQuery, like only the last part from yours:

//b[contains(child::text(), "Product Dimensions:")]/following-sibling::text()

is a very good alternative.

Regards,

Rafal Rusin

To work on html pages, it is best to use HTMLAgilityPack (and with some Linq codes). It's a great way to parse through all the elements and/or do a direct search with XPath. In my opinion, it is more accurate than RegEx and easier to program. I was a bit reluctant to use it previously, but it's very easy to add to your project and I think is the de factor standard for working with html. http://htmlagilitypack.codeplex.com/

Good luck!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!