python method to extract content (excluding navigation) from an HTML page
问题 Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc. I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of. 回答1: Try the Beautiful Soup library for