Scrapy: extract text with special characters

感情迁移 提交于 2021-01-28 19:33:48

问题


I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped "La exministra, procesada como part\u00edcipe a titulo lucrativo, intenta burlar a los fot\u00f3grafos" I wish to return a json with the special characters. I presume that my spyder code need something to get the json in the right way. This is my spyder code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from pais.items import PaisItem


class NoticiaSpider(scrapy.Spider):
   name = "noticia"
   allowed_domains = ["elpais.com"]
start_urls = (...

)

def parse(self, response):

    hxs = HtmlXPathSelector(response)        
    item= PaisItem()
    item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract()
    item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract()
    return item

回答1:


maybe you should add .encode('utf8') after extract()




回答2:


When you write the characters to the file, you need to encode them as UTF-8. Try changing the last lines of your example to the following:

item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract().encode('utf-8')
item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract().encode('utf-8')
return item


来源:https://stackoverflow.com/questions/29032653/scrapy-extract-text-with-special-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!