html-parsing

How to scrape the next pages in python using Beautifulsoup

ⅰ亾dé卋堺 提交于 2019-12-12 03:36:23
问题 Suppose I am scraping a url http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha and it contents no of pages which contains the data which I want to scrape. So how can I scrape the data of all the next pages. I am using python 3.5.1 and Beautifulsoup. Note: I can't use scrapy and lxml as it is giving me some installation error. 回答1: Determine the last page by extracting the page argument of the "Go to the last page" element. And loop over

Should I use regex or just DOM/string manipulation? [closed]

微笑、不失礼 提交于 2019-12-12 03:11:51
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . Sometimes I am not sure when do I have to use one or another. I usually parse all sort of things with Python, but I would like to focus this question on HTML parsing. Personally I find DOM manipulation really useful when having to parse more than two regular elements (i.e.

How to get the count of tables in an html file with C# and html-agility-pack

一笑奈何 提交于 2019-12-12 02:54:15
问题 This is a newbie question so please provide working code. How do I count the tables in an html file using C# and the html-agility-pack? (I will need to get values from specific tables in an html file based on the count of tables. I will then perform some math on the values retrieved.) Here is a sample file with three tables for your convenience: <html> <head> <title>Tables</title> </head> <body> <table border="1"> <tr> <th>Name</th> <th>Phone</th> <th>City</th> <th>Number</th> </tr> <tr> <td

Processing HTML files Python

两盒软妹~` 提交于 2019-12-12 02:47:04
问题 I dont know much about html... How do you remove just text from the page? For example if the html page reads as: <meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers"> <title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title> I just want to extract this. How can I make money at home online? No gimmicks please? - Yahoo! Answers I am using re function: def striphtml(data): p = re.compile(r'<.*?>') return p.sub(' '

MySQL like Query fails in PHP

有些话、适合烂在心里 提交于 2019-12-12 02:45:51
问题 $result = mysql_query("SELECT * FROM MasjidMaster WHERE MasjidName LIKE ('%moh%')") or die mysql_error(); The error i get is Parse error: syntax error, unexpected T_STRING in /home/maximtec/public_html/masjid_folder/MasjidFinderScripts/find_by_name.php on line 24 This query does work when i use it in MySQL but it doesn't when I place it in a PHP Script Please suggest a solution ------------EDIT :After changing query from the received answers------------------------------------- Well I updated

Why would this regex return an error?

我们两清 提交于 2019-12-12 02:39:21
问题 Why does the following evaluate to true ? if(preg_match_all('%<tr.*?>.*?<b>.*?</b>.*?</tr>%ims', $contents, $x)===FALSE) {...} $contents , is retrieved using file_get_contents() from this source. The regex was simplified to troublshoot the problem. The code I was actually using was: if(preg_match( '%Areas of Study: </P>.*?<TABLE BORDER="0">(.*?)<TBODY>.*?</TBODY>.*? </TABLE>%ims', $contents, $course_list) ) { if(preg_match_all('%<TR>.*?<TD.*?>.*?<B>(.*?)</B>.*?</TD>.*?<TD.*?>.*?</TD>.*?<TD.*?

He!p with PHP DOM elements

一笑奈何 提交于 2019-12-12 02:26:34
问题 I'm trying to get automatically synonyms to words using CURL, but I'm having trouble. This is the part on the HTML downloaded with curl where the synonyms are: "vagabunda", "piriguete", "vagabundagem", "gandaia", etc.. <div class="box_palavra_escolhida"> <img src="../img/icone-livro.png" width="41px" height="35px" border="0" alt="imagem icone livro" /> <a class="link_escolhida" href="dicsin_edicao.php?id=26708" title="Vagabunda"> Vagabunda </a> <a class="link_escolhida_sinonimo" href="dicsin

Trying to get inputs / getelementbyID or Class and put into richtextbox

一个人想着一个人 提交于 2019-12-12 02:15:35
问题 I am currently using HtmlAgility Pack to parse some HTML for a forms input tags first, then the get the name of the ID or Class and list the input and the id="something here or input: class="something here" into a RichTextbox to review. Here is my code. Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb() Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(TextBox1.Text) Dim threadLinks As IEnumerable(Of HtmlNode) = doc.DocumentNode.SelectNodes("/input") For Each link In threadLinks Dim str As

HTML::TableExtract: applying the right attribs to specify the attributes of interest

て烟熏妆下的殇ゞ 提交于 2019-12-12 01:27:11
问题 I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself. #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use YAML; my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 ); $table->parse($html); foreach my $row ($table->rows) sub cleanup { for ( @_ ) { s/\s+//; s/[\xa0 ]+\z//; s/\s+/ /g; }

Extracting anchor tag from html using Java

对着背影说爱祢 提交于 2019-12-12 01:03:35
问题 I have several anchor tags in a text, Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a> Output: http://stackoverflow.com How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ??? 回答1: public static void main(String[] args) { String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd" + "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf"; String regex =