web-crawler | 易学教程

automatic crawling using selenium

阅读更多关于 automatic crawling using selenium

问题 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC OUTPUT_FILE_NAME = 'output0.txt' driver = webdriver.Chrome() wait = WebDriverWait(driver, 10) def get_text(): driver.get("http://law.go.kr/precSc.do?tabMenuId=tab67") elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#viewHeightDiv > table > tbody > " "tr:nth-child(1) > td.s_tit >

WebClient download string is different than WebBrowser View source

阅读更多关于 WebClient download string is different than WebBrowser View source

问题 I am create a C# 4.0 application to download the webpage content using Web client. WebClient function public static string GetDocText(string url) { string html = string.Empty; try { using (ConfigurableWebClient client = new ConfigurableWebClient()) { /* Set timeout for webclient */ client.Timeout = 600000; /* Build url */ Uri innUri = null; if (!url.StartsWith("http://")) url = "http://" + url; Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out innUri); try { client.Headers.Add("User-Agent",

How can I scrape LinkedIn company pages with cURL and PHP? No CSRF token found in headers error

阅读更多关于 How can I scrape LinkedIn company pages with cURL and PHP? No CSRF token found in headers error

问题 I want to scrape some LinkedIn company pages with cURL and PHP. The API of LinkedIn is not build for that, so I have to do this with PHP. If there are any other options, please let me know... Before scraping the company page I have to sign in at LinkedIn with a personal account via cURL, but it doesn't seems to work. I've got a 'No CSRF token found in headers' error. Could someone help me out? Thanks! <?php require_once 'dom/simple_html_dom.php'; $linkedin_login_page = "https://www.linkedin

scrapy, how to separate text within a HTML tag element

阅读更多关于 scrapy, how to separate text within a HTML tag element

问题 Code containing my data: <div id="content"> <div id="content_div"> <div class="title" id="content_title_div"><img src="img/banner_outlets.jpg" width="920" height="157" alt="Outlets" /></div> <div id="menu_list"> <table border="0" cellpadding="5" cellspacing="5" width="100%"> <tbody> <tr> <td valign="top"> <p> <span class="foodTitle">Century Square</span><br /> 2 Tampines Central 5<br /> #01-44-47 Century Square<br /> Singapore 529509</p> <p>

How to get meta description content using Goutte

阅读更多关于 How to get meta description content using Goutte

问题 Can you please help me to find a way to get a content from meta description, meta keywords and robots content using Goutte. Also, how can I target <link rel="stylesheet" href=""> and <script> ? Below is PHP that I used to get <title> content: require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', 'http://stackoverflow.com/'); $crawler->filter('title')->each(function ($node) { $content .= "Title: ".$node->text().""; echo $content; }); Here is

How to detect web crawlers for SEO, using Express?

阅读更多关于 How to detect web crawlers for SEO, using Express?

问题 I've been searching for npm packages but they all seem unmaintained and rely on the outdated user-agent databases. Is there a reliable and up-to-date package out there that helps me detect crawlers? (mostly from Google, Facebook,... for SEO) or if there's no packages, can I write it myself? (probably based on an up-to-date user-agent database) To be clearer, I'm trying to make an isomorphic/universal React website and I want it to be indexed by search engines and its title/meta data can be

How to get JavaScript object in JavaScript code?

阅读更多关于 How to get JavaScript object in JavaScript code?

问题 TL;DR I want parseParameter that parse JSON like the following code. someCrawledJSCode is crawled JavaScript code. const data = parseParameter(someCrawledJSCode); console.log(data); // data1: {...} Problem I'm crawling some JavaScript code with puppeteer and I want to extract a JSON object from it, but I don't know how to parse the given JavaScript code. Crawled JavaScript Code Example: const somecode = 'somevalue'; arr.push({ data1: { prices: [{ prop1: 'hi', prop2: 'hello', }, { prop1: 'foo'

Python, Selenium : 'Element is no longer attached to the DOM'

阅读更多关于 Python, Selenium : 'Element is no longer attached to the DOM'

问题 I am scraping a website, www.lipperleaders.com. I want to extract the funds detail of Singapore. I have successfully implemented drop-down selection and extracted the content of the first page appearing after submission of the options. But when I try to go to next pages (by making the code to click next button) I am getting error 'Element is no longer attached to the DOM' . My code is about 100 lines but I can give a general idea of the flow of execution of my code: ... # creating driver

Is there a way or a tool to automatically visit all pages of my site

阅读更多关于 Is there a way or a tool to automatically visit all pages of my site

问题 I want to automatically visit / crawl all the pages on my site in order to generate a cache file. Is there any way or tool to do this? 回答1: Just use any robot that downloads your entire page: https://superuser.com/questions/14403/how-can-i-download-an-entire-website For example wget: wget -r --no-parent http://site.com/songs/ 回答2: You can use wget 's recursive option to do this. Change example.com to your domain: wget --recursive --no-parent --domains=example.com --level=inf --delete-after

BOT/Spider Trap Ideas

阅读更多关于 BOT/Spider Trap Ideas

问题 I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Windows Zombies. The clients had issues in the past with SPAM attacks--even had to point MX at Postini to get the 6.7 GB/day of junk to stop server-side. I want to