web-crawler | 易学教程

Getting list of likers for an instagram post - Python & Selenium

阅读更多关于 Getting list of likers for an instagram post - Python & Selenium

问题 I'm training to web crawling. To do so, I've challenged myself to get the list of all people having liked a post on instagram. My problem is that I'm stuck to the point where I only get the first 11 usernames of likers. I cannot find the right way to automate the scrolling process while getting the likes. Here is my process in Jupyter Notebook (it doesn't work as a script yet): from selenium import webdriver import pandas as pd driver = webdriver.Chrome() driver.get('https://www.instagram.com

Why is Bing crawler not fetching the dynamic content of my angular web page?

阅读更多关于 Why is Bing crawler not fetching the dynamic content of my angular web page?

问题 I got my SPA website (based on Node/Express/Mongo/Angular X) up and running. I created a sitemap.xml and submitted to Microsoft Bing, and from the server log, I see they started crawling. However, I noticed the page URL is called, but not the associated API for that page. So, basically it's just indexing the static skeleton of each page, not the dynamic real content. I googled and see people saying "google can't index dynamic content" as suggested in this article. However, I also see other

find link in pdf extension

阅读更多关于 find link in pdf extension

问题 i need to get links with pdf extension. my code is : <?php set_time_limit (0); curl_setopt($ch, CURLOPT_URL,"http://example.com"); curl_setopt($ch, CURLOPT_TIMEOUT, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); $result=curl_exec ($ch); curl_close ($ch); preg_match_all( '/<a href="(http:\/\/www.[^0-9].+?)"/', $result, $output, PREG_SET_ORDER); // read all links foreach($output as $item ){ $n=strlen($item); $m=$n-3; $buffer_n=$item; $buffer_m=""; $buffer_m=$buffer_n[$m].$buffer_n[$m+1].

Why can't I play the MIDI files I have downloaded programmatically, but I can play them when I download them manually?

阅读更多关于 Why can't I play the MIDI files I have downloaded programmatically, but I can play them when I download them manually?

问题 I want to download the MIDI files from this website for a project. I have written the following code to download the files: from bs4 import BeautifulSoup import requests import re, os import urllib.request import string base_url = "http://www.midiworld.com/files/" base_path = 'path/where/I/will/save/the/downloaded/MIDI/files' os.chdir(base_path + '/MIDI Files') for i in range(1,2386): page = requests.get(base_url + str(i)) soup = BeautifulSoup(page.text, "html.parser") li_box = soup.select(

Apache Nutch 2.1 - How get complete source code

阅读更多关于 Apache Nutch 2.1 - How get complete source code

问题 I am trying to write my own Nutch plugin for crawling webpages. The problem is that I need to identify if there is some special tag, e.g. on the webpage. There is some note in official documentation that this is possible using Document.getElementsByTagName("foo") but this is not working for me. Do you have any idea? My second question is that if I identified tag above, I would like to get some other tags from this webpage where tag was identified... is there any way to store complete source

File system crawler - iteration bugs

阅读更多关于 File system crawler - iteration bugs

问题 I'm currently building a file system crawler with the following code: require 'find' require 'spreadsheet' Spreadsheet.client_encoding = 'UTF-8' count = 0 Find.find('/Users/Anconia/crawler/') do |file| if file =~ /\b.xls$/ # check if filename ends in desired format contents = Spreadsheet.open(file).worksheets contents.each do |row| if row =~ /regex/ puts file count += 1 end end end end puts "#{count} files were found" And am receiving the following output: 0 files were found The regex is

PHPCrawl fails to create SSL socket

阅读更多关于 PHPCrawl fails to create SSL socket

问题 I'm trying to use PHPCrawl (http://sourceforge.net/projects/phpcrawl/) to trawl a website delivered over HTTPS. I can see that there is support for SSL in the PHPCrawlerHTTPRequest class (openSocket method): // If ssl -> perform Server name indication if ($this->url_parts["protocol"] == "https://") { $context = stream_context_create(array('ssl' => array('SNI_server_name' => $this->url_parts["host"]))); $this->socket = @stream_socket_client($protocol_prefix.$ip_address.":".$this->url_parts[

Using Python Requests to pass through a login/password

阅读更多关于 Using Python Requests to pass through a login/password

问题 I have looked at related answers and I haven't found something that quite works. I'm trying to scrape some fantasy baseball information from my team's CBS Sportsline page. I want to post the login and password and then when I use the get command, see the data specific to my account. Here's what I attempted: import requests myurl = 'http://bbroto.baseball.cbssports.com/transactions' payload = { 'userid': '(my username)', 'Password': '(my password)', 'persistent': '1'} session = requests

Resolving absolute path from relative path

阅读更多关于 Resolving absolute path from relative path

问题 I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS. In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz , then my crawler will append and make it http://example.com/xyz/index.php . But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php . So if I crawl http:/

Can Search Engines bots crawl pages requiring login?

阅读更多关于 Can Search Engines bots crawl pages requiring login?

问题 If a homepage on a website has a content if a user is not logged in and another content when the user login, would a search engine bot be able to crawl the user specific content? If they are not able to crawl, then I can duplicate the content from another part of the website to make it easily accessible to users who have mentioned their needs at the registration time. My guess is no, but I would rather make sure before I do something stupid. 回答1: You cannot assume that crawler support cookies