web-crawler | 易学教程

How to input values and click button with Requests?

阅读更多关于 How to input values and click button with Requests?

问题 With the requests module i eventually want to download a song. if you head to youtube-mp3.org, there is one input bar and one convert button. Shortly after the convert is finished there is a download button. Now i want to go throught the process with my python script. so far i have this: def download_song(song_name): import requests with requests.Session() as c: url = r"http://www.youtube-mp3.org/" c.get(url) it barely anything... i have tried to check the documentation on there website. i

Why is this condition not working? Div with class

阅读更多关于 Why is this condition not working? Div with class

问题 I have a condition where i want to retrieve text from a specific tag, but it does not seem to be returning true.. any help? #!/usr/bin/perl use HTML::TreeBuilder; use warnings; use strict; my $URL = "http://prospectus.ulster.ac.uk/modules/index/index/selCampus/JN/selProgramme/2132/hModuleCode/COM137"; my $tree = HTML::TreeBuilder->new_from_content($URL); if (my $div = $tree->look_down(_tag => "div ", class => "col col60 moduledetail")) { printf $div->as_text(); print "test"; open (FILE, '

Perl print matched content only

阅读更多关于 Perl print matched content only

问题 I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content. Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern. my $uu = LWP::UserAgent->new('Mozilla 1.3'); my $extractorr = HTML::ContentExtractor->new(); #

Calling Controller.Start in loop in Crawler4j?

阅读更多关于 Calling Controller.Start in loop in Crawler4j?

问题 I asked one question here. But this is kind of other question that sounds similar. Using crawler4j, I want to crawl multiple seed urls with restriction on domain name (that is domain name check in shouldVisit). Here is an example of how to do it. In short, you set list of domain names using customData and then pass it to crawler class (from controller) and in shouldVisit function, we loop through this data (which is a list, see linked url) to see if domain name is there in list, if so return

Getting Text From Tweets

阅读更多关于 Getting Text From Tweets

问题 I am tring to read my tweets from a csv file (which I have downloaded previously), and I am having some problems: sia.list <- searchTwitter('#singaporeair', n=10, since=NULL, until=NULL, cainfo="cacert.pem") sia.df = twListToDF(sia.list) write.csv(sia.df, file='C:/temp/siaTweets.csv', row.names=F) I am trying to extract the text from the list and the problems is with the third line below: sia.df <- read.csv(file=paste(path,"siaTweets.csv",sep="")) sia.list <- as.list(t(sia.df)) sia_txt =

Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

阅读更多关于 Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

问题 I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url. I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local

Should a web-crawler pick up queries? [closed]

阅读更多关于 Should a web-crawler pick up queries? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . The latest days I have coded a web-crawler. The only question I have left is, does "standard" web-crawlers crawl links queries like this one: https://www.google.se/?q=stackoverflow or does it skip the queries and pick them up like this: https://www.google.se 回答1: In case you are referring to crawling for some

Implementation of crawler4j

阅读更多关于 Implementation of crawler4j

问题 I am attempting to get the basic form of crawler4j running as seen here. I have modified the first few lines by defining the rootFolder and numberOfCrawlers as follows: public class BasicCrawlController { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("Needed parameters: "); System.out.println("\t rootFolder (it will contain intermediate crawl data)"); System.out.println("\t numberOfCralwers (number of concurrent threads)"); return; } /* *

How to run a program in WCF?

阅读更多关于 How to run a program in WCF?

问题 I am new to WCF and i am designing a project in which i want to run a crawler program (coded in c#) which crawlers some websites and it stores the crawled data in the tables of database (sql server db). I want that crawler runs repeatedly after 30 minutes and updated the database. I want to then use the service on my hosted platform so that i can use the data from tables in web form (i.e. .aspx page) Is it okay to use WCF for this purpose ? Please suggest me how to move on ? Thanks 回答1: You

Data Scraping, aspx

阅读更多关于 Data Scraping, aspx

问题 I have written web crawlers before using Python, but the page I am scraping has been resistant to my efforts so far. I am scraping data from a website using Python and BeautifulSoup. The way I'm doing it, there are two steps: generate a list of pages to be indexed, then parse those pages. The parsing part is easy, but I haven't figured out how to navigate the .aspx pages to so I can generate the links using Python. I can currently save the search pages manually in order to scrape them, but I