How can i extract only text in scrapy selector in python

前端 未结 5 821
[愿得一人]
[愿得一人] 2020-12-13 14:42

I have this code

   site = hxs.select(\"//h1[@class=\'state\']\")
   log.msg(str(site[0].extract()),level=log.ERROR)

The ouput is



        
相关标签:
5条回答
  • 2020-12-13 15:15
    //h1[@class='state']
    

    in your above xpath you are selecting h1 tag that has class attribute state

    so that's why it's selecting everything that comes in h1 element

    if you just want to select text of h1 tag all you have to do is

    //h1[@class='state']/text()
    

    if you want to select text of h1 tag as well as its children tags, you have to use

    //h1[@class='state']//text()
    

    so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags

    below mentioned code works for you

    site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()
    
    0 讨论(0)
  • 2020-12-13 15:22

    I haven't got a scrapy instance running so I couldn't test this; but you could try to use text() within your search expression.

    For example:

    site = hxs.select("//h1[@class='state']/text()")
    

    (got it from the tutorial)

    0 讨论(0)
  • 2020-12-13 15:27

    You can use html2text

    import html2text
    converter = html2text.HTML2Text()
    print converter.handle("<div>Please!!!<span>remove me</span></div>")
    
    0 讨论(0)
  • 2020-12-13 15:34

    You can use BeautifulSoup to strip html tags, here is an example:

    from BeautifulSoup import BeautifulSoup
    ''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))
    

    You can then strip all the additional whitespaces, new lines etc.

    if you don't want to use additional modules, you can try simple regex:

    # replace html tags with ' '
    text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))
    
    0 讨论(0)
  • 2020-12-13 15:39

    You can use BeautifulSoup get_text() feature.

    from bs4 import BeautifulSoup
    
    text = '''
    <td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
    <br/><a href="http://www.fakewebsite.com">I am waiting....</a>
    </td>
    '''
    soup = BeautifulSoup(text)
    
    print(soup.get_text())
    
    0 讨论(0)
提交回复
热议问题