web-crawler

Getting list of likers for an instagram post - Python & Selenium

与世无争的帅哥 提交于 2019-12-11 05:09:17
问题 I'm training to web crawling. To do so, I've challenged myself to get the list of all people having liked a post on instagram. My problem is that I'm stuck to the point where I only get the first 11 usernames of likers. I cannot find the right way to automate the scrolling process while getting the likes. Here is my process in Jupyter Notebook (it doesn't work as a script yet): from selenium import webdriver import pandas as pd driver = webdriver.Chrome() driver.get('https://www.instagram.com

Why is Bing crawler not fetching the dynamic content of my angular web page?

白昼怎懂夜的黑 提交于 2019-12-11 05:09:14
问题 I got my SPA website (based on Node/Express/Mongo/Angular X) up and running. I created a sitemap.xml and submitted to Microsoft Bing, and from the server log, I see they started crawling. However, I noticed the page URL is called, but not the associated API for that page. So, basically it's just indexing the static skeleton of each page, not the dynamic real content. I googled and see people saying "google can't index dynamic content" as suggested in this article. However, I also see other

find link in pdf extension

北战南征 提交于 2019-12-11 04:37:22
问题 i need to get links with pdf extension. my code is : <?php set_time_limit (0); curl_setopt($ch, CURLOPT_URL,"http://example.com"); curl_setopt($ch, CURLOPT_TIMEOUT, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); $result=curl_exec ($ch); curl_close ($ch); preg_match_all( '/<a href="(http:\/\/www.[^0-9].+?)"/', $result, $output, PREG_SET_ORDER); // read all links foreach($output as $item ){ $n=strlen($item); $m=$n-3; $buffer_n=$item; $buffer_m=""; $buffer_m=$buffer_n[$m].$buffer_n[$m+1].

Why can't I play the MIDI files I have downloaded programmatically, but I can play them when I download them manually?

試著忘記壹切 提交于 2019-12-11 04:27:55
问题 I want to download the MIDI files from this website for a project. I have written the following code to download the files: from bs4 import BeautifulSoup import requests import re, os import urllib.request import string base_url = "http://www.midiworld.com/files/" base_path = 'path/where/I/will/save/the/downloaded/MIDI/files' os.chdir(base_path + '/MIDI Files') for i in range(1,2386): page = requests.get(base_url + str(i)) soup = BeautifulSoup(page.text, "html.parser") li_box = soup.select(

Apache Nutch 2.1 - How get complete source code

给你一囗甜甜゛ 提交于 2019-12-11 04:23:24
问题 I am trying to write my own Nutch plugin for crawling webpages. The problem is that I need to identify if there is some special tag, e.g. on the webpage. There is some note in official documentation that this is possible using Document.getElementsByTagName("foo") but this is not working for me. Do you have any idea? My second question is that if I identified tag above, I would like to get some other tags from this webpage where tag was identified... is there any way to store complete source

File system crawler - iteration bugs

余生颓废 提交于 2019-12-11 04:16:54
问题 I'm currently building a file system crawler with the following code: require 'find' require 'spreadsheet' Spreadsheet.client_encoding = 'UTF-8' count = 0 Find.find('/Users/Anconia/crawler/') do |file| if file =~ /\b.xls$/ # check if filename ends in desired format contents = Spreadsheet.open(file).worksheets contents.each do |row| if row =~ /regex/ puts file count += 1 end end end end puts "#{count} files were found" And am receiving the following output: 0 files were found The regex is

PHPCrawl fails to create SSL socket

丶灬走出姿态 提交于 2019-12-11 04:08:30
问题 I'm trying to use PHPCrawl (http://sourceforge.net/projects/phpcrawl/) to trawl a website delivered over HTTPS. I can see that there is support for SSL in the PHPCrawlerHTTPRequest class (openSocket method): // If ssl -> perform Server name indication if ($this->url_parts["protocol"] == "https://") { $context = stream_context_create(array('ssl' => array('SNI_server_name' => $this->url_parts["host"]))); $this->socket = @stream_socket_client($protocol_prefix.$ip_address.":".$this->url_parts[

Using Python Requests to pass through a login/password

笑着哭i 提交于 2019-12-11 03:53:13
问题 I have looked at related answers and I haven't found something that quite works. I'm trying to scrape some fantasy baseball information from my team's CBS Sportsline page. I want to post the login and password and then when I use the get command, see the data specific to my account. Here's what I attempted: import requests myurl = 'http://bbroto.baseball.cbssports.com/transactions' payload = { 'userid': '(my username)', 'Password': '(my password)', 'persistent': '1'} session = requests

Resolving absolute path from relative path

僤鯓⒐⒋嵵緔 提交于 2019-12-11 03:07:29
问题 I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS. In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz , then my crawler will append and make it http://example.com/xyz/index.php . But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php . So if I crawl http:/

Can Search Engines bots crawl pages requiring login?

馋奶兔 提交于 2019-12-11 02:53:55
问题 If a homepage on a website has a content if a user is not logged in and another content when the user login, would a search engine bot be able to crawl the user specific content? If they are not able to crawl, then I can duplicate the content from another part of the website to make it easily accessible to users who have mentioned their needs at the registration time. My guess is no, but I would rather make sure before I do something stupid. 回答1: You cannot assume that crawler support cookies