How can I scrape LinkedIn company pages with cURL and PHP? No CSRF token found in headers error

送分小仙女□ 提交于 2019-12-10 20:56:39

问题


I want to scrape some LinkedIn company pages with cURL and PHP. The API of LinkedIn is not build for that, so I have to do this with PHP. If there are any other options, please let me know...

Before scraping the company page I have to sign in at LinkedIn with a personal account via cURL, but it doesn't seems to work.

I've got a 'No CSRF token found in headers' error.

Could someone help me out?

Thanks!

<?php

require_once 'dom/simple_html_dom.php';

$linkedin_login_page = "https://www.linkedin.com/uas/login";

$username = 'linkedin_username';
$password = 'linkedin_password';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $linkedin_login_page);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_VERBOSE, 1);

$login_content = str_get_html(curl_exec($ch));

if(curl_error($ch)) {
  echo 'error:' . curl_error($ch);
}

if ($login_content) {

  if (($login_content->find('input[name=isJsEnabled]', 0))) {
    foreach($login_content->find('input[name=isJsEnabled]') as $element) {

      $isJsEnabled = trim($element->value);

      if ($isJsEnabled === "false") {
        $isJsEnabled = "true";
      }

    }
  }

  if (($login_content->find('input[name=source_app]', 0))) {
    foreach($login_content->find('input[name=source_app]') as $element) {
      $source_app = trim($element->value);
    }
  }

  if (($login_content->find('input[name=tryCount]', 0))) {
    foreach($login_content->find('input[name=tryCount]') as $element) {
      $tryCount = trim($element->value);
    }
  }

  if (($login_content->find('input[name=clickedSuggestion]', 0))) {
    foreach($login_content->find('input[name=clickedSuggestion]') as $element) {
      $clickedSuggestion = trim($element->value);
    }
  }

  if (($login_content->find('input[name=session_redirect]', 0))) {
    foreach($login_content->find('input[name=session_redirect]') as $element) {
      $session_redirect = trim($element->value);
    }
  }

  if (($login_content->find('input[name=trk]', 0))) {
    foreach($login_content->find('input[name=trk]') as $element) {
      $trk = trim($element->value);
    }
  }

  if (($login_content->find('input[name=loginCsrfParam]', 0))) {
    foreach($login_content->find('input[name=loginCsrfParam]') as $element) {
      $loginCsrfParam = trim($element->value);
    }
  }

  if (($login_content->find('input[name=fromEmail]', 0))) {
    foreach($login_content->find('input[name=fromEmail]') as $element) {
      $fromEmail = trim($element->value);
    }
  }

  if (($login_content->find('input[name=csrfToken]', 0))) {
    foreach($login_content->find('input[name=csrfToken]') as $element) {
      $csrfToken = trim($element->value);
    }
  }

  if (($login_content->find('input[name=sourceAlias]', 0))) {
    foreach($login_content->find('input[name=sourceAlias]') as $element) {
      $sourceAlias = trim($element->value);
    }
  }

}

curl_setopt($ch, CURLOPT_URL, "https://www.linkedin.com/uas/login-submit");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, 'isJsEnabled='.$isJsEnabled.'&source_app='.$source_app.'&tryCount='.$tryCount.'&clickedSuggestion='.$clickedSuggestion.'&session_key='.$username.'&session_password='.$password.'&session_redirect='.$session_redirect.'&trk='.$trk.'&loginCsrfParam='.$loginCsrfParam.'&fromEmail='.$fromEmail.'&csrfToken='.$csrfToken.'&sourceAlias='.$sourceAlias);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$store = curl_exec($ch);

curl_setopt($ch, CURLOPT_URL, 'https://www.linkedin.com/company/facebook');
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_POSTFIELDS, "");
$content = curl_exec($ch);
curl_close($ch);

echo $content;

?>

回答1:


Here is a solution for the login , if you want to make sure that is working just save the content in a file and you will see that the login was successful

instead of using simple_html_dom we used above fetch_value, you still can use simple_html_dom

<?php
function fetch_value($str, $find_start = '', $find_end = '')
{
    if ($find_start == '')
    {
        return '';
    }
    $start = strpos($str, $find_start);
    if ($start === false)
    {
        return '';
    }
    $length = strlen($find_start);
    $substr = substr($str, $start + $length);
    if ($find_end == '')
    {
        return $substr;
    }
    $end = strpos($substr, $find_end);
    if ($end === false)
    {
        return $substr;
    }
    return substr($substr, 0, $end);
}

$linkedin_login_page = "https://www.linkedin.com/uas/login";
$linkedin_ref = "https://www.linkedin.com";

$username = 'username';
$password = 'password';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $linkedin_login_page);
curl_setopt($ch, CURLOPT_REFERER, $linkedin_ref);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7)');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');


$login_content = curl_exec($ch);


if(curl_error($ch)) {
  echo 'error:' . curl_error($ch);
}

$var = array(
            'isJsEnabled' => 'false',
            'source_app' => '',
            'clickedSuggestion' => 'false',
            'session_key' => trim($username),
            'session_password' => trim($password),
            'signin' => 'Sign In',
            'session_redirect' => '',
            'trk' => '',
            'fromEmail' => '');
        $var['loginCsrfParam'] = fetch_value($login_content, 'type="hidden" name="loginCsrfParam" value="', '"');
        $var['csrfToken'] = fetch_value($login_content, 'type="hidden" name="csrfToken" value="', '"');
        $var['sourceAlias'] = fetch_value($login_content, 'input type="hidden" name="sourceAlias" value="', '"');

        $post_array = array();
            foreach ($var as $key => $value)
            {
                $post_array[] = urlencode($key) . '=' . urlencode($value);
            }
        $post_string = implode('&', $post_array);

curl_setopt($ch, CURLOPT_URL, "https://www.linkedin.com/uas/login-submit");
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);

$store = curl_exec($ch);


if (stripos($store, "session_password-login-error") !== false){
    $err = trim(strip_tags(fetch_value($store, '<span class="error" id="session_password-login-error">', '</span>')));
    echo "Login error : ".$err;
}elseif (stripos($store, 'profile-nav-item') !== false) {
        curl_setopt($ch, CURLOPT_URL, 'https://www.linkedin.com/company-beta/10667/?pathWildcard=10667');
        curl_setopt($ch, CURLOPT_POST, false);
        curl_setopt($ch, CURLOPT_POSTFIELDS, "");
        $content = curl_exec($ch);
        curl_close($ch);

        echo $content;
}else{
    echo "unknown error";
}


?>

You will notice that the company page doesn't load , as linkedin has just changed their design and their company links to keep tracking opened companies pages.




回答2:


Instead of trying to scrape the login, just login with your browser and copy the session cookie to your curl script. This will trick linked in to thinking it is just you on your web browser. Sometimes web servers are smart enough to look at the other headers passed like browser type and invalidate the request if that is the case, just make sure that you set the same headers in your curl script as the browser you use to login with. Let me know if you need me to explain how to do this.



来源:https://stackoverflow.com/questions/42329819/how-can-i-scrape-linkedin-company-pages-with-curl-and-php-no-csrf-token-found-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!