Is there a way to extract text along with text-links in Scrapy using CSS?

天大地大妈咪最大 提交于 2021-01-20 13:26:31

问题


I'm brand new to Scrapy. I have learned how to use response.css() for reading specific aspects from a web page, and am avoiding learning the xpath system. It seems to do the exact same thing, but in a different format (correct me if I'm wrong)

The site I'm scraping has long paragraphs of text, with an occasional linked text right in the middle. This sentence with a link to a picture of a dog is an example. I'm not sure if there is a way to have a spider read the text, with links in place (I've only been using response.css("p::text").extract())

Is there a way, using CSS (preferably) or xpath that I can grab all text in the paragraphs including the link-embedded text, without moving the links or link-text out of the sentence? The wording is difficult on this so apologies if I need to re-explain or give an example.

edit: some clarification is needed, this was poorly explained initially. A statement in this webpage can look like: <p>My sentence has a <a href="https://www.google.com">link to google</a> in it.</p> But when you use response.css("p::text").extract(), that sentence would show up as the list ["My sentence has a ","in it."], completely negating the text in the link. My goal is to get: ["My sentence has a link to google in it."]


回答1:


You can try to extract text with this expression:

>>> txt = """<p>My sentence has a <a href="https://www.google.com">link to google</a> in it.</p>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.css('p ::text').extract()
[u'My sentence has a ', u'link to google', u' in it.']
>>> ' '.join(sel.css('p ::text').extract())
u'My sentence has a  link to google  in it.'

Or, for example, use w3lib.html library to clean html tags from your response. In this way:

from w3lib.html import remove_tags
with_tags = response.css("p").get()
clean_text = remove_tags(with_tags)

But first variant looks shorter and more readable.




回答2:


Use html-text after extracting the whole paragraph:

from html_text import extract_text

for paragraph in response.css('p'):
    html = paragraph.get()
    text = extract_text(html)


来源:https://stackoverflow.com/questions/55779773/is-there-a-way-to-extract-text-along-with-text-links-in-scrapy-using-css

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!