Scrapy crawler in Cron job

后端未结

关注

 7  1645

I want to execute my scrapy crawler from cron job .

i create bash file getdata.sh where scrapy project is located with it\'s spiders

#!/bin/bash
cd /


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-13 15:28
              
            
            
                                                                       
Adding the following lines in crontab -e runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer

PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name


Without setting $PATH, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.

Note that the complete path for my scrapy project is /home/user/project_folder/project_name.  I ran the env command in cron and noticed that the working directory is /home/user. Hence I skipped /home/user in my crontab above

The cron log can be helpful while debugging

grep CRON /var/log/syslog

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-12-13 15:31
              
            
            
                                                                       
does your shell script have execute permission?

e.g. can you do 

  /myfolder/crawlers/getdata.sh 


without the sh?

if you can then you can drop the sh in the line in cron
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-13 15:32
              
            
            
                                                                       
I solved this problem including PATH into bash file 

#!/bin/bash

cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-13 15:32
              
            
            
                                                                       
in my case scrapy is in .local/bin/scrapy give the proper path of scraper and name it worK perfect 


  0 0 * * * cd /home/user/scraper/Folder_of_scriper/ && /home/user/.local/bin/scrapy crawl "name" >> /home/user/scrapy.log 2>&1


/home/user/scrapy.log it use to save the output and error in scrapy.log for check it program work or not 

thank you.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2020-12-13 15:41
              
            
            
                                                                       
Check where scrapy is installed using "which scrapy" command.
In my case, scrapy is installed in /usr/local/bin.

Open crontab for editing using crontab -e.

PATH=$PATH:/usr/local/bin
export PATH
 */5 * * * * cd /myfolder/path && scrapy crawl spider_name

It should work.
Scrapy runs every 5 minutes.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2020-12-13 15:50
              
            
            
                                                                       
For anyone who used pip3 (or similar) to install scrapy, here is a simple inline solution:

*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1


Replace:

*/10 * * * * with your cron pattern

~/project/path with the path to your scrapy project (where your scrapy.cfg is)

something with the spider name (use scrapy list in your project to find out)

~/crawl.log with your log file position (in case you want to have logging)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复