html-parsing | 易学教程

How to scrape the next pages in python using Beautifulsoup

阅读更多关于 How to scrape the next pages in python using Beautifulsoup

问题 Suppose I am scraping a url http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha and it contents no of pages which contains the data which I want to scrape. So how can I scrape the data of all the next pages. I am using python 3.5.1 and Beautifulsoup. Note: I can't use scrapy and lxml as it is giving me some installation error. 回答1: Determine the last page by extracting the page argument of the "Go to the last page" element. And loop over

Should I use regex or just DOM/string manipulation? [closed]

阅读更多关于 Should I use regex or just DOM/string manipulation? [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . Sometimes I am not sure when do I have to use one or another. I usually parse all sort of things with Python, but I would like to focus this question on HTML parsing. Personally I find DOM manipulation really useful when having to parse more than two regular elements (i.e.

How to get the count of tables in an html file with C# and html-agility-pack

阅读更多关于 How to get the count of tables in an html file with C# and html-agility-pack

问题 This is a newbie question so please provide working code. How do I count the tables in an html file using C# and the html-agility-pack? (I will need to get values from specific tables in an html file based on the count of tables. I will then perform some math on the values retrieved.) Here is a sample file with three tables for your convenience: <html> <head> <title>Tables</title> </head> <body> <table border="1"> <tr> <th>Name</th> <th>Phone</th> <th>City</th> <th>Number</th> </tr> <tr> <td

Processing HTML files Python

阅读更多关于 Processing HTML files Python

问题 I dont know much about html... How do you remove just text from the page? For example if the html page reads as: <meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers"> <title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title> I just want to extract this. How can I make money at home online? No gimmicks please? - Yahoo! Answers I am using re function: def striphtml(data): p = re.compile(r'<.*?>') return p.sub(' '

MySQL like Query fails in PHP

阅读更多关于 MySQL like Query fails in PHP

问题 $result = mysql_query("SELECT * FROM MasjidMaster WHERE MasjidName LIKE ('%moh%')") or die mysql_error(); The error i get is Parse error: syntax error, unexpected T_STRING in /home/maximtec/public_html/masjid_folder/MasjidFinderScripts/find_by_name.php on line 24 This query does work when i use it in MySQL but it doesn't when I place it in a PHP Script Please suggest a solution ------------EDIT :After changing query from the received answers------------------------------------- Well I updated

Why would this regex return an error?

阅读更多关于 Why would this regex return an error?

问题 Why does the following evaluate to true ? if(preg_match_all('%<tr.*?>.*?<b>.*?</b>.*?</tr>%ims', $contents, $x)===FALSE) {...} $contents , is retrieved using file_get_contents() from this source. The regex was simplified to troublshoot the problem. The code I was actually using was: if(preg_match( '%Areas of Study: </P>.*?<TABLE BORDER="0">(.*?)<TBODY>.*?</TBODY>.*? </TABLE>%ims', $contents, $course_list) ) { if(preg_match_all('%<TR>.*?<TD.*?>.*?<B>(.*?)</B>.*?</TD>.*?<TD.*?>.*?</TD>.*?<TD.*?

He!p with PHP DOM elements

阅读更多关于 He!p with PHP DOM elements

问题 I'm trying to get automatically synonyms to words using CURL, but I'm having trouble. This is the part on the HTML downloaded with curl where the synonyms are: "vagabunda", "piriguete", "vagabundagem", "gandaia", etc.. <div class="box_palavra_escolhida"> <img src="../img/icone-livro.png" width="41px" height="35px" border="0" alt="imagem icone livro" /> <a class="link_escolhida" href="dicsin_edicao.php?id=26708" title="Vagabunda"> Vagabunda </a> <a class="link_escolhida_sinonimo" href="dicsin

Trying to get inputs / getelementbyID or Class and put into richtextbox

阅读更多关于 Trying to get inputs / getelementbyID or Class and put into richtextbox

问题 I am currently using HtmlAgility Pack to parse some HTML for a forms input tags first, then the get the name of the ID or Class and list the input and the id="something here or input: class="something here" into a RichTextbox to review. Here is my code. Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb() Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(TextBox1.Text) Dim threadLinks As IEnumerable(Of HtmlNode) = doc.DocumentNode.SelectNodes("/input") For Each link In threadLinks Dim str As

HTML::TableExtract: applying the right attribs to specify the attributes of interest

阅读更多关于 HTML::TableExtract: applying the right attribs to specify the attributes of interest

问题 I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself. #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use YAML; my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 ); $table->parse($html); foreach my $row ($table->rows) sub cleanup { for ( @_ ) { s/\s+//; s/[\xa0 ]+\z//; s/\s+/ /g; }

Extracting anchor tag from html using Java

阅读更多关于 Extracting anchor tag from html using Java

问题 I have several anchor tags in a text, Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a> Output: http://stackoverflow.com How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ??? 回答1: public static void main(String[] args) { String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd" + "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf"; String regex =