Scrapy Pipeline doesn't insert into MySQL

问题

I'm trying to build a small app for a university project with Scrapy. The spider is scraping the items, but my pipeline is not inserting data into mysql database. In order to test whether the pipeline is not working or the pymysl implementation is not working I wrote a test script:

Code Start

#!/usr/bin/python3

import pymysql

str1 = "hey"
str2 = "there"
str3 = "little"
str4 = "script"

db = pymysql.connect("localhost","root","**********","stromtarife" )

cursor = db.cursor()

cursor.execute("SELECT * FROM vattenfall")
cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (str1, str2, str3, str4))
cursor.execute("SELECT * FROM vattenfall")
data = cursor.fetchone()
print(data)
db.commit()
cursor.close()

db.close()

Code End

After i run this script my database has a new record, so its not my pymysql.connect() function, which is broke.

I'll provide my scrapy code:

vattenfall_form.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from stromtarife.items import StromtarifeItem

from scrapy.http import FormRequest

class VattenfallEasy24KemptenV1500Spider(scrapy.Spider):
    name = 'vattenfall-easy24-v1500-p87435'

    def start_requests(self):
        return [
            FormRequest(
                "https://www.vattenfall.de/de/stromtarife.htm",
                formdata={"place": "87435", "zipCode": "87435", "cityName": "Kempten",
                      "electricity_consumptionprivate": "1500", "street": "", "hno": ""},
            callback=self.parse
        ),
    ]

    def parse(self, response):
        item = StromtarifeItem()
        item['jahrespreis'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[3]/td[2]/text()').extract_first()
        item['treuebonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[2]/td/strong/text()').extract_first()
        item['sofortbonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[1]/td/strong/text()').extract_first()
        item['tarif'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[1]/h2/span/text()').extract_first()
        yield item



class VattenfallEasy24KemptenV2500Spider(scrapy.Spider):
    name = 'vattenfall-easy24-v2500-p87435'

    def start_requests(self):
        return [
                    FormRequest(
                    "https://www.vattenfall.de/de/stromtarife.htm",
                    formdata={"place": "87435", "zipCode": "87435", "cityName": "Kempten",
                              "electricity_consumptionprivate": "2500", "street": "", "hno": ""},
                    callback=self.parse
                ),
    ]

    def parse(self, response):
        item = StromtarifeItem()
        item['jahrespreis'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[3]/td[2]/text()').extract_first()
        item['treuebonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[2]/td/strong/text()').extract_first()
        item['sofortbonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[1]/td/strong/text()').extract_first()
        item['tarif'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[1]/h2/span/text()').extract_first()
        yield item



process = CrawlerProcess()
process.crawl(VattenfallEasy24KemptenV1500Spider)
process.crawl(VattenfallEasy24KemptenV2500Spider)
process.start()

pipelines.py

import pymysql
from stromtarife.items import StromtarifeItem


class StromtarifePipeline(object):
    def __init__(self):
        self.connection = pymysql.connect("localhost","root","**********","stromtarife")
        self.cursor = self.connection.cursor()


    def process_item(self, item, spider):
        self.cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis']))
        self.connection.commit()
        self.cursor.close()
        self.connection.close()

settings.py (i changed only that line)

ITEM_PIPELINES = {
   'stromtarife.pipelines.StromtarifePipeline': 300,
}

So what is wrong with my code ? I couldn't figure it out and would be really happy if someone is seeing something i'm missing. Thanks in advance!

回答1:

You should not close your pymsql connection every time you process an item.

You should write the close_spider function in your pipeline like this, so the connection is closed just once, at the end of the execution:

 def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

Moreover you neeed to return your item at the end of process_item

Your file pipeline.py should look like this:

import pymysql
from stromtarife.items import StromtarifeItem


class StromtarifePipeline(object):
    def __init__(self):
        self.connection = pymysql.connect("localhost","root","**********","stromtarife")
        self.cursor = self.connection.cursor()


    def process_item(self, item, spider):
        self.cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis']))
        self.connection.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

UPDATE :

I tried your code, the problem is in the pipeline, there are two problems:

You try to index the euro symbol € and I think mysql does not like it.
Your query string is not well built.

I managed to get things done by writting the pipeline like this:

def process_item(self, item, spider):
    query = """INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)""" % ("1", "2", "3", "4")
    self.cursor.execute(query)
    self.connection.commit()
    return item

I thing you should remove the € from the prices you try to insert.

Hope this helps, let me know.

回答2:

There is another problem with your scraper besides the fact that your SQL Pipeline closes the SQL connection after writing the first item (as Adrien pointed out).

The other problem is: your scraper only scrapes one single item per results page (and also visits only one results page). I checked Vattenfall and there are usually multiple results displayed and I guess you want to scrape them all.

Means you'll also have to iterate over the results on the page and create multiple items while doing so. The scrapy tutorial here gives a good explanation how to do this: https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-quotes-and-authors

回答3:

First of all, in Code Start print(data) must come after db.commit(), otherwise the data which was just inserted into your database will not show up in the print.

Lastly, judging by the names of your columns, it's probably an issue of encoding if the idea above doesn't work.

来源：https://stackoverflow.com/questions/43656127/scrapy-pipeline-doesnt-insert-into-mysql

标签

mysql

web-scraping

scrapy

pymysql

scrapy-pipeline