Data crawler or something else

旧街凉风 提交于 2019-12-08 04:27:45

问题


I'm looking for something that I don't know exactly how it can be done. I don't have deep knowledge into crawling, scrapping and etc, but I believe the kind of technology I'm looking for are these.

  1. I've a list of around 100 websites that I'd like to monitor constantly. At least once every 3 or 4 days. In these website's I'd look for some logical matches, like:

Text contains 'ABC' AND doesn't contain 'BCZ" OR text contains 'XYZ' AND doesn't contain 'ATM' and so on so forth

  1. The tool would have to look into these websites in:

    • Web pages
    • DOC files
    • DOCX files
    • XLS files
    • XLSX files
    • TXT files
    • RTF files
    • PDF files
    • RAR and ZIP files
  2. The matches would have to be incremental (I just want the most recent ones, from the previous X days)

  3. Most importantly, out of these 100 websites, around 40 require user authentication (which I have already).

  4. Whenever there's a match, I'd like to download:

    • File
    • Link
    • Date/time
    • Report of matches

I've been playing around with tools like import.io, but I haven't figured out how to do it properly!

Does anyone know exactly which kind of technology am I looking for? Who (what kind of specialist, programmer) could build this for me? Is it too hard for a programmer who understand about data crawling to build it?

Sorry for the long post


回答1:


For the 60 websites that don't require authentication:

You can use a tool like backstitch to mark websites you want to monitor, and get an interactive thumbnail feed of pages with content that have the keywords you want. Backstitch supports using boolean operators (the AND / OR functionality you described), and has an API that may allow you to export the results in a format that you need.

Their support team (and CEO) have been very helpful in the past with describing how their API can be used for custom search cases. Good luck!



来源:https://stackoverflow.com/questions/32141516/data-crawler-or-something-else

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!