Python threading or multiprocessing for web-crawler?

一曲冷凌霜 提交于 2020-01-06 07:16:47

问题


I've made simple web-crawler with Python. So far everything it does it creates set of urls that should be visited, set of urls that was already visited. While parsing page it adds all the links on that page to the should be visited set and page url to the already visited set and so on while length of should_be_visited is > 0. So far it does everything in one thread.

Now I want to add parallelism to this application, so I need to have same kind of set of links and few threads / processes, where each will pop one url from should_be_visited and update already_visited. I'm really lost at threading and multiprocessing, which I should use, do I need some Pools, Queues?


回答1:


The rule of thumb when deciding whether to use threads in Python or not is to ask the question, whether the task that the threads will be doing, is that CPU intensive or I/O intensive. If the answer is I/O intensive, then you can go with threads.

Because of the GIL, the Python interpreter will run only one thread at a time. If a thread is doing some I/O, it will block waiting for the data to become available (from the network connection or the disk, for example), and in the meanwhile the interpreter will context switch to another thread. On the other hand, if the thread is doing a CPU intensive task, the other threads will have to wait till the interpreter decides to run them.

Web crawling is mostly an I/O oriented task, you need to make an HTTP connection, send a request, wait for response. Yes, after you get the response you need to spend some CPU to parse it, but besides that it is mostly I/O work. So, I believe, threads are a suitable choice in this case.

(And of course, respect the robots.txt, and don't storm the servers with too many requests :-)




回答2:


Another alternative is asynchronous I/O, which is much better for this kind of I/O-bound tasks (unless processing a page is really expensive). You can try both with asyncio or Tornado, using its httpclient.



来源:https://stackoverflow.com/questions/40894487/python-threading-or-multiprocessing-for-web-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!