web-crawler

automatic crawling using selenium

不羁的心 提交于 2019-12-11 01:19:23
问题 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC OUTPUT_FILE_NAME = 'output0.txt' driver = webdriver.Chrome() wait = WebDriverWait(driver, 10) def get_text(): driver.get("http://law.go.kr/precSc.do?tabMenuId=tab67") elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#viewHeightDiv > table > tbody > " "tr:nth-child(1) > td.s_tit >

WebClient download string is different than WebBrowser View source

主宰稳场 提交于 2019-12-10 21:05:40
问题 I am create a C# 4.0 application to download the webpage content using Web client. WebClient function public static string GetDocText(string url) { string html = string.Empty; try { using (ConfigurableWebClient client = new ConfigurableWebClient()) { /* Set timeout for webclient */ client.Timeout = 600000; /* Build url */ Uri innUri = null; if (!url.StartsWith("http://")) url = "http://" + url; Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out innUri); try { client.Headers.Add("User-Agent",

How can I scrape LinkedIn company pages with cURL and PHP? No CSRF token found in headers error

送分小仙女□ 提交于 2019-12-10 20:56:39
问题 I want to scrape some LinkedIn company pages with cURL and PHP. The API of LinkedIn is not build for that, so I have to do this with PHP. If there are any other options, please let me know... Before scraping the company page I have to sign in at LinkedIn with a personal account via cURL, but it doesn't seems to work. I've got a 'No CSRF token found in headers' error. Could someone help me out? Thanks! <?php require_once 'dom/simple_html_dom.php'; $linkedin_login_page = "https://www.linkedin

scrapy, how to separate text within a HTML tag element

久未见 提交于 2019-12-10 20:14:08
问题 Code containing my data: <div id="content"><!-- InstanceBeginEditable name="EditRegion3" --> <div id="content_div"> <div class="title" id="content_title_div"><img src="img/banner_outlets.jpg" width="920" height="157" alt="Outlets" /></div> <div id="menu_list"> <table border="0" cellpadding="5" cellspacing="5" width="100%"> <tbody> <tr> <td valign="top"> <p> <span class="foodTitle">Century Square</span><br /> 2 Tampines Central 5<br /> #01-44-47 Century Square<br /> Singapore 529509</p> <p>

How to get meta description content using Goutte

狂风中的少年 提交于 2019-12-10 19:28:54
问题 Can you please help me to find a way to get a content from meta description, meta keywords and robots content using Goutte. Also, how can I target <link rel="stylesheet" href=""> and <script> ? Below is PHP that I used to get <title> content: require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', 'http://stackoverflow.com/'); $crawler->filter('title')->each(function ($node) { $content .= "Title: ".$node->text().""; echo $content; }); Here is

How to detect web crawlers for SEO, using Express?

…衆ロ難τιáo~ 提交于 2019-12-10 19:10:24
问题 I've been searching for npm packages but they all seem unmaintained and rely on the outdated user-agent databases. Is there a reliable and up-to-date package out there that helps me detect crawlers? (mostly from Google, Facebook,... for SEO) or if there's no packages, can I write it myself? (probably based on an up-to-date user-agent database) To be clearer, I'm trying to make an isomorphic/universal React website and I want it to be indexed by search engines and its title/meta data can be

How to get JavaScript object in JavaScript code?

泪湿孤枕 提交于 2019-12-10 17:10:31
问题 TL;DR I want parseParameter that parse JSON like the following code. someCrawledJSCode is crawled JavaScript code. const data = parseParameter(someCrawledJSCode); console.log(data); // data1: {...} Problem I'm crawling some JavaScript code with puppeteer and I want to extract a JSON object from it, but I don't know how to parse the given JavaScript code. Crawled JavaScript Code Example: const somecode = 'somevalue'; arr.push({ data1: { prices: [{ prop1: 'hi', prop2: 'hello', }, { prop1: 'foo'

Python, Selenium : 'Element is no longer attached to the DOM'

北城以北 提交于 2019-12-10 16:49:06
问题 I am scraping a website, www.lipperleaders.com. I want to extract the funds detail of Singapore. I have successfully implemented drop-down selection and extracted the content of the first page appearing after submission of the options. But when I try to go to next pages (by making the code to click next button) I am getting error 'Element is no longer attached to the DOM' . My code is about 100 lines but I can give a general idea of the flow of execution of my code: ... # creating driver

Is there a way or a tool to automatically visit all pages of my site

守給你的承諾、 提交于 2019-12-10 16:18:17
问题 I want to automatically visit / crawl all the pages on my site in order to generate a cache file. Is there any way or tool to do this? 回答1: Just use any robot that downloads your entire page: https://superuser.com/questions/14403/how-can-i-download-an-entire-website For example wget: wget -r --no-parent http://site.com/songs/ 回答2: You can use wget 's recursive option to do this. Change example.com to your domain: wget --recursive --no-parent --domains=example.com --level=inf --delete-after

BOT/Spider Trap Ideas

血红的双手。 提交于 2019-12-10 14:59:42
问题 I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Windows Zombies. The clients had issues in the past with SPAM attacks--even had to point MX at Postini to get the 6.7 GB/day of junk to stop server-side. I want to