Access Django models with scrapy: defining path to Django project

后端未结

关注

 2  1571

遥遥无期 2020-12-07 12:44

I\'m very new to Python and Django. I\'m currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain gi

2条回答

悲&欢浪女 (楼主)

2020-12-07 13:06
Even though Rho's answer seems very good I thought I'd share how I got scrapy working with Django Models (aka Django ORM) without a full blown Django project since the question only states the use of a "Django database". Also I do not use DjangoItem.

The following works with Scrapy 0.18.2 and Django 1.5.2. My scrapy project is called scrapping in the following.
1. Add the following to your scrapy settings.py file
```
from django.conf import settings as d_settings
d_settings.configure(
    DATABASES={
        'default': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2',
            'NAME': 'db_name',
            'USER': 'db_user',
            'PASSWORD': 'my_password',
            'HOST': 'localhost',  
            'PORT': '',
        }},
    INSTALLED_APPS=(
        'scrapping',
    )
)
```
2. Create a manage.py file in the same folder as your scrapy.cfg: This file is not needed when you run the spider itself but is super convenient for setting up the database. So here we go:
```
#!/usr/bin/env python
import os
import sys

if __name__ == "__main__":
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "scrapping.settings")

    from django.core.management import execute_from_command_line

    execute_from_command_line(sys.argv)
```
  That's the entire content of manage.py and is pretty much exactly the stock manage.py file you get after running django-admin startproject myweb but the 4th line points to your scrapy settings file. Admittedly, using DJANGO_SETTINGS_MODULE and settings.configure seems a bit odd but it works for the one manage.py commands I need: $ python ./manage.py syncdb.
3. Your models.py Your models.py should be placed in your scrapy project folder (ie. scrapping.models´). After creating that file you should be able to run you$ python ./manage.py syncdb`. It may look like this:
```
from django.db import models

class MyModel(models.Model):
    title = models.CharField(max_length=255)
    description = models.TextField()
    url = models.URLField(max_length=255, unique=True)
```
4. Your items.py and pipeline.py: I used to use DjangoItem as descriped in Rho's answer but I ran into trouble with it when running many crawls in parallel with scrapyd and using Postgresql. The exception max_locks_per_transaction was thrown at some point breaking all the running crawls. Furthermore, I did not figure out how to properly roll back a failed item.save() in the pipeline. Long story short, I ended up not using DjangoItem at all which solved all my problems. Here is how: items.py:
```
from scrapy.item import Item, Field

class MyItem(Item):
    title = Field()
    description = Field()
    url = Field()
```
  Note that the fields need to have the same name as in the model if you want to unpack them conveniently as in the next step! pipelines.py:
```
from django.db import transaction
from models import MyModel
class Django_pipeline(object):
    def process_item(self, item, spider):
        with transaction.commit_on_success():
            scraps = MyModel(**item)
            scraps.save()
        return item
```
  As mentioned above, if you named all your item fields like you did in your models.py file you can use **item to unpack all the fields when creating your MyModel object.
That's it!
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...