Downloading all pdf files from google scholar search results using wget

风流意气都作罢 提交于 2019-12-04 10:54:51

问题


I'd like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research.

I have read the following pages on stackoverflow:

Crawl website using wget and limit total number of crawled links

How do web spiders differ from Wget's spider?

Downloading all PDF files from a website

How to download all files (but not HTML) from a website using wget?

The last page is probably the most inspirational of all. I did try using wget as suggested on this.

My google scholar search result page is thus but nothing was downloaded.

Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using wget, that would be absolutely awesome.


回答1:


wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23

A few things to note:

  1. Use of filetyle:pdf in the search query
  2. One level of recursion
  3. -A pdf for only accepting pdfs
  4. -H to span hosts
  5. -e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.

The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.



来源:https://stackoverflow.com/questions/12272488/downloading-all-pdf-files-from-google-scholar-search-results-using-wget

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!