Access Django models with scrapy: defining path to Django project

后端 未结 2 1572
遥遥无期
遥遥无期 2020-12-07 12:44

I\'m very new to Python and Django. I\'m currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain gi

2条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-07 13:04

    I think the main misconception is the package path vs the settings module path. In order to use django's models from an external script you need to set the DJANGO_SETTINGS_MODULE. Then, this module has to be importable (i.e. if the settings path is myproject.settings, then the statement from myproject import settings should work in a python shell).

    As most projects in django are created in a path outside the default PYTHONPATH, you must add the project's path to the PYTHONPATH environment variable.

    Here is a step-by-step guide to create a fully working (and minimal) Django models integration into a Scrapy project:

    Note: This instructions work at the date of the last edit. If it doesn't work for you, please add a comment and describe your issue and scrapy/django versions.

    1. The projects will be created within /home/rolando/projects directory.

    2. Start the django project.

      $ cd ~/projects
      $ django-admin startproject myweb
      $ cd myweb
      $ ./manage.py startapp myapp
      
    3. Create a model in myapp/models.py.

      from django.db import models
      
      
      class Person(models.Model):
          name = models.CharField(max_length=32)
      
    4. Add myapp to INSTALLED_APPS in myweb/settings.py.

      # at the end of settings.py
      INSTALLED_APPS += ('myapp',)
      
    5. Set my db settings in myweb/settings.py.

      # at the end of settings.py
      DATABASES['default']['ENGINE'] = 'django.db.backends.sqlite3'
      DATABASES['default']['NAME'] = '/tmp/myweb.db'
      
    6. Create the database.

      $ ./manage.py syncdb --noinput
      Creating tables ...
      Installing custom SQL ...
      Installing indexes ...
      Installed 0 object(s) from 0 fixture(s)
      
    7. Create the scrapy project.

      $ cd ~/projects
      $ scrapy startproject mybot
      $ cd mybot
      
    8. Create an item in mybot/items.py.

    Note: In newer versions of Scrapy, you need to install scrapy_djangoitem and use from scrapy_djangoitem import DjangoItem.

        from scrapy.contrib.djangoitem import DjangoItem
        from scrapy.item import Field
    
        from myapp.models import Person
    
    
        class PersonItem(DjangoItem):
            # fields for this item are automatically created from the django model
            django_model = Person
    

    The final directory structure is this:

    /home/rolando/projects
    ├── mybot
    │   ├── mybot
    │   │   ├── __init__.py
    │   │   ├── items.py
    │   │   ├── pipelines.py
    │   │   ├── settings.py
    │   │   └── spiders
    │   │       └── __init__.py
    │   └── scrapy.cfg
    └── myweb
        ├── manage.py
        ├── myapp
        │   ├── __init__.py
        │   ├── models.py
        │   ├── tests.py
        │   └── views.py
        └── myweb
            ├── __init__.py
            ├── settings.py
            ├── urls.py
            └── wsgi.py
    

    From here, basically we are done with the code required to use the django models in a scrapy project. We can test it right away using scrapy shell command but be aware of the required environment variables:

    $ cd ~/projects/mybot
    $ PYTHONPATH=~/projects/myweb DJANGO_SETTINGS_MODULE=myweb.settings scrapy shell
    
    # ... scrapy banner, debug messages, python banner, etc.
    
    In [1]: from mybot.items import PersonItem
    
    In [2]: i = PersonItem(name='rolando')
    
    In [3]: i.save()
    Out[3]: 
    
    In [4]: PersonItem.django_model.objects.get(name='rolando')
    Out[4]: 
    

    So, it is working as intended.

    Finally, you might not want to have to set the environment variables each time you run your bot. There are many alternatives to address this issue, although the best it is that the projects' packages are actually installed in a path set in PYTHONPATH.

    This is one of the simplest solutions: add this lines to your mybot/settings.py file to set up the environment variables.

    # Setting up django's project full path.
    import sys
    sys.path.insert(0, '/home/rolando/projects/myweb')
    
    # Setting up django's settings module name.
    # This module is located at /home/rolando/projects/myweb/myweb/settings.py.
    import os
    os.environ['DJANGO_SETTINGS_MODULE'] = 'myweb.settings'
    
    # Since Django 1.7, setup() call is required to populate the apps registry.
    import django; django.setup()
    

    Note: A better approach to the path hacking is to have setuptools-based setup.py files in both projects and run python setup.py develop which will link your project path into the python's path (I'm assuming you use virtualenv).

    That is enough. For completeness, here is a basic spider and pipeline for a fully working project:

    1. Create the spider.

      $ cd ~/projects/mybot
      $ scrapy genspider -t basic example example.com
      

      The spider code:

      # file: mybot/spiders/example.py
      from scrapy.spider import BaseSpider
      from mybot.items import PersonItem
      
      
      class ExampleSpider(BaseSpider):
          name = "example"
          allowed_domains = ["example.com"]
          start_urls = ['http://www.example.com/']
      
          def parse(self, response):
              # do stuff
              return PersonItem(name='rolando')
      
    2. Create a pipeline in mybot/pipelines.py to save the item.

      class MybotPipeline(object):
          def process_item(self, item, spider):
              item.save()
              return item
      

      Here you can either use item.save() if you are using the DjangoItem class or import the django model directly and create the object manually. In both ways the main issue is to define the environment variables so you can use the django models.

    3. Add the pipeline setting to your mybot/settings.py file.

      ITEM_PIPELINES = {
          'mybot.pipelines.MybotPipeline': 1000,
      }
      
    4. Run the spider.

      $ scrapy crawl example
      

提交回复
热议问题