Access Django models with scrapy: defining path to Django project

后端 未结 2 1571
遥遥无期
遥遥无期 2020-12-07 12:44

I\'m very new to Python and Django. I\'m currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain gi

2条回答
  •  悲&欢浪女
    2020-12-07 13:06

    Even though Rho's answer seems very good I thought I'd share how I got scrapy working with Django Models (aka Django ORM) without a full blown Django project since the question only states the use of a "Django database". Also I do not use DjangoItem.

    The following works with Scrapy 0.18.2 and Django 1.5.2. My scrapy project is called scrapping in the following.

    1. Add the following to your scrapy settings.py file

      from django.conf import settings as d_settings
      d_settings.configure(
          DATABASES={
              'default': {
                  'ENGINE': 'django.db.backends.postgresql_psycopg2',
                  'NAME': 'db_name',
                  'USER': 'db_user',
                  'PASSWORD': 'my_password',
                  'HOST': 'localhost',  
                  'PORT': '',
              }},
          INSTALLED_APPS=(
              'scrapping',
          )
      )
      
    2. Create a manage.py file in the same folder as your scrapy.cfg: This file is not needed when you run the spider itself but is super convenient for setting up the database. So here we go:

      #!/usr/bin/env python
      import os
      import sys
      
      if __name__ == "__main__":
          os.environ.setdefault("DJANGO_SETTINGS_MODULE", "scrapping.settings")
      
          from django.core.management import execute_from_command_line
      
          execute_from_command_line(sys.argv)
      

      That's the entire content of manage.py and is pretty much exactly the stock manage.py file you get after running django-admin startproject myweb but the 4th line points to your scrapy settings file. Admittedly, using DJANGO_SETTINGS_MODULE and settings.configure seems a bit odd but it works for the one manage.py commands I need: $ python ./manage.py syncdb.

    3. Your models.py Your models.py should be placed in your scrapy project folder (ie. scrapping.models´). After creating that file you should be able to run you$ python ./manage.py syncdb`. It may look like this:

      from django.db import models
      
      class MyModel(models.Model):
          title = models.CharField(max_length=255)
          description = models.TextField()
          url = models.URLField(max_length=255, unique=True)
      
    4. Your items.py and pipeline.py: I used to use DjangoItem as descriped in Rho's answer but I ran into trouble with it when running many crawls in parallel with scrapyd and using Postgresql. The exception max_locks_per_transaction was thrown at some point breaking all the running crawls. Furthermore, I did not figure out how to properly roll back a failed item.save() in the pipeline. Long story short, I ended up not using DjangoItem at all which solved all my problems. Here is how: items.py:

      from scrapy.item import Item, Field
      
      class MyItem(Item):
          title = Field()
          description = Field()
          url = Field()
      

      Note that the fields need to have the same name as in the model if you want to unpack them conveniently as in the next step! pipelines.py:

      from django.db import transaction
      from models import MyModel
      class Django_pipeline(object):
          def process_item(self, item, spider):
              with transaction.commit_on_success():
                  scraps = MyModel(**item)
                  scraps.save()
              return item
      

      As mentioned above, if you named all your item fields like you did in your models.py file you can use **item to unpack all the fields when creating your MyModel object.

    That's it!

提交回复
热议问题