scrape | 易学教程

Renaming HTML files using <title> tags

阅读更多关于 Renaming HTML files using tags

问题 I'm a relatively new to programming. I have a folder, with subfolders, which contain several thousand html files that are generically named, i.e. 1006.htm, 1007.htm, that I would like to rename using the tag from within the file. For example, if file 1006.htm contains Page Title , I would like to rename it Page Title.htm. Ideally spaces are replaced with dashes. I've been working in the shell with a bash script with no luck. How do I do this, with either bash or python? this is what I have so

How can scrape website via PHP that requires POST data?

阅读更多关于 How can scrape website via PHP that requires POST data?

问题 I'm trying to scrape a website that takes in POST data to return the correct page (sans POST it returns 15 results, with POST data it returns all results). Currently my code is looking like this: $curl = curl_init(); curl_setopt($curl,CURLOPT_URL,"http://www.thisismyurl.com/awesome"); curl_setopt($curl, CURLOPT_POST, true); curl_setopt($curl, CURLOPT_POSTFIELDS, XXXXXX); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $result= curl_exec($curl); I know that I need to put my postfields into the

Web page scraping gems/tools available in Ruby [closed]

阅读更多关于 Web page scraping gems/tools available in Ruby [closed]

I'm trying to scrape web pages in a Ruby script that I'm working on. The purpose of the project is to show which ETFs and stock mutual funds are most compatible with the value investing philosophy. Some examples of pages I'd like to scrape are: http://finance.yahoo.com/q/pr?s=SPY+Profile http://finance.yahoo.com/q/hl?s=SPY+Holdings http://www.marketwatch.com/tools/mutual-fund/list/V What web scraping tools do you recommend for Ruby, and why? Keep in mind that there are thousands of stock funds out there, so any tool I use has to be reasonably quick. I am new to Ruby, but I have experience

Web page scraping gems/tools available in Ruby [closed]

阅读更多关于 Web page scraping gems/tools available in Ruby [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . I'm trying to scrape web pages in a Ruby script that I'm working on. The purpose of the project is to show which ETFs and stock mutual funds are most compatible with the value investing philosophy. Some examples of pages I'd like to scrape are: http://finance.yahoo.com/q/pr?s

curl 302 redirect not working (command line)

阅读更多关于 curl 302 redirect not working (command line)

In the browser, navigating to this URL initiates a 302 (moved temporarily) request which in turn downloads a file. http://www.targetsite.com/target.php/?event=download&task_id=123 When I view what is actually happening via Chrome network tools I see that the redirect is going to a dynamically generated path that cancels itself immediately after download. In other words, even if I know that full path I will not have time to manually call it. So, how in using the command line can I mimic the browser actions? I tried curl --cookies bin/cookies.txt -O -L " http://www.targetsite.com/target.php/

Html-Agility-Pack not loading the page with full content?

阅读更多关于 Html-Agility-Pack not loading the page with full content?

问题 i am using Html Agility Pack to fetch data from website(scrapping) My problem is the website from i am fetching the data is load some of the content after few seconds of page load. SO whenever i am trying to read the particular data from particular Div its giving me null. but in var page i just not getting the division reviewBox ..becuase its not loaded yet. public void FetchAllLinks(String Url) { Url = "http://www.tripadvisor.com/"; HtmlDocument page = new HtmlWeb().Load(Url); var link_list=

How to properly use mechanize to scrape AJAX sites

阅读更多关于 How to properly use mechanize to scrape AJAX sites

So I am fairly new to web scraping. There is this site that has a table on it, the values of the table are controlled by Javascript. The values will determine the address of future values that my browser is told to request from the Javascript. These new pages have JSON responses that the script updates the table with in my browser. So I wanted to build a class with a mechanize method that takes in an url and spits out the body response, the first time a HTML, afterwards, the body response will be JSON, for remaining iterations. I have something that works but I want to know if I am doing it

Scraping attempts getting 403 error

阅读更多关于 Scraping attempts getting 403 error

I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try: wget CURL (command line and PHP) Perl WWW::Mechanize PhantomJS I tried all of the above with and without proxies, changing user-agent, and adding a referrer header. I even copied the request header from my Chrome browser and tried sending with my request using PHP Curl and I am still getting a 403 Forbidden error. Any input or suggestions on what is triggering the website to block the request and how to bypass? PHP CURL Example: $url ='https://www.vitacost.com/productResults.aspx?allCategories=true&N

Find next siblings until a certain one using beautifulsoup

阅读更多关于 Find next siblings until a certain one using beautifulsoup

The webpage is something like this: <h2>section1</h2> <p>article</p> <p>article</p> <p>article</p> <h2>section2</h2> <p>article</p> <p>article</p> <p>article</p> How can I find each section with articles within them? That is, after finding h2, find nextsiblings until the next h2. If the webpage were like: (which is normally the case) <div> <h2>section1</h2> <p>article</p> <p>article</p> <p>article</p> </div> <div> <h2>section2</h2> <p>article</p> <p>article</p> <p>article</p> </div> I can write codes like: for section in soup.findAll('div'): ... for post in section.findAll('p') But what should

curl 302 redirect not working (command line)

阅读更多关于 curl 302 redirect not working (command line)

问题 In the browser, navigating to this URL initiates a 302 (moved temporarily) request which in turn downloads a file. http://www.targetsite.com/target.php/?event=download&task_id=123 When I view what is actually happening via Chrome network tools I see that the redirect is going to a dynamically generated path that cancels itself immediately after download. In other words, even if I know that full path I will not have time to manually call it. So, how in using the command line can I mimic the