How to design a scraper for companies such as owler? [closed]

我是研究僧i 提交于 2019-12-24 09:37:32

问题


I am trying to develop a scraper for various sites like angel.co. But I'm stuck at designing crawler for website www.owler.com as it requires login through mail when we try to access information about company. Each time we login we'll get a new login token on email that will expire after some time.

So is there any proper solution to deal with such situation? I'm just looking for guidelines to handle these type of situation. Already tried automating this task using selenium, but it wasn't very helpful though.


回答1:


I got you man! YES, this can be done via Selenium, but it will take some advanced knowledge of Selenium & basic understanding of how users are authenticated on websites & cookies.

Off the top of my head you have the following options:

  • 1. Storing the email-received authentication link & injecting the token inside it into your browser session in the form of a cookie;
  • 2. Storing your session in the form of a Selenium Profile specific to the browser you're running your tests on and loading it afterwards on the instance spawned by your script.

1. (Note: This worked like a charm from the first go so follow closely.)

  • Open www.owler.com in an incognito window (I am using Chrome) and open the cookies section;
  • Spot the cookies you are working with (see this print-screen);
  • Sign In in order to receive your email. Inspect the Sign-In link (see this print-screen);
  • Copy & load the link into another browser (not your incognito session);
  • Once you are logged-in, open the browser console (F12, or CTRL+Shift+J on Chrome) > go to Applications tab > click on Cookies section (for the Owler domain) and copy the value of OWLER_PC cookie. (see this print-screen for more details)
  • In your anonymous session (not logged in), go to the browser console and add the auth_token in the form of a cookie, via the document.cookie function, like this: document.cookie=OWLER_PC=<yourTokenHere>;
  • Refresh the page 2 times, and VOILA, you are logged in.

Note: I knew that you have to add that cookie as OWLER_PC, because I've inspected the logged-in session and that was the only cookie that was new. The cookie's value (usually) is the same as the authentication token you receive via email.

Now all that is left to do is simulate this via code. You have to store one of these email authentication tokens in your script (notice they expire in 1 year, so you should be good).

Then once you've opened your session, use the Selenium bindings for the framework/language you are using to add said cookie, then refresh the page. For WedriverIO/JavaScript (my weapons of choice) it goes something like this:

browser.setCookie({name: 'OWLER_PC', value: 'SPF-yNNJSXeXJ...'});
browser.refresh();
browser.refresh();
// Assert you are logged in 

2. Sometimes, you don't want to add cookies, or write boiler-plate code to just be logged into a website, or have a specific set of browser-extensions loaded on your Selenium driver instance. So you use Browser Profiles.

You will have to document yourself on it as it is a lengthy topic. This question might also help you as you are using Python Selenium bindings.

Hope this helps!



来源:https://stackoverflow.com/questions/44213119/how-to-design-a-scraper-for-companies-such-as-owler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!