html-parsing

Selenium into android project

夙愿已清 提交于 2019-12-13 04:44:17
问题 I want to parse HTML page base on javascript and content load when I clicked on buttons and after javascript worked. I make my application on PC on Java using libs Jsoup, Selenium. I want it works on android. I added Selenium to new my Android application project and added dependencies in gradle: compile 'org.seleniumhq.selenium:selenium-htmlunit-driver:2.48.2' But I see many same messages: Warning:Dependency org.apache.httpcomponents:httpclient:4.5.1 is ignored for debug as it may be

HTML Version 4 vs 5

穿精又带淫゛_ 提交于 2019-12-13 04:43:16
问题 Is there a way to avoid HTML 5 parser ? My app has the following doctype: DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" And I wish that it's interpreted with HTML4 definitions not HTML5 EDIT: My question reason is to solve this: Chrome popup Please Fill Out this Field 回答1: New answer based on updated question: It isn't the HTML 5 parsing rules you have a problem with, it is support for HTML 5 attributes. No, you can't override this. If you

Get BeautifulSoup to correctly parse php tags or ignore them

不羁岁月 提交于 2019-12-13 04:12:13
问题 I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine. The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output INPUT <?php $stars = $this->getData('sideBarCoStars', []); if (!$stars) return; $sideBarCoStarsCount = $this->getData('sideBarCoStarsCount'); $title = $this-

Selenium grid sessions not applied

こ雲淡風輕ζ 提交于 2019-12-13 03:36:23
问题 I'm using Selenium stanalone + Chrome headless + PHP + UwAmp server on my computer to parse some data (system: WIN7_32bit, 4GB RAM). I need to start 22 Chrome sessions at the same time so I'm using selenium grid with this settings: java -jar selenium-server-standalone-2.53.1.jar -role hub java -jar selenium-server-standalone-2.53.1.jar -role node -hub http://localhost:4444/grid/register -browser "browserName=chrome,maxInstances=22,seleniumProtocol=WebDriver" -maxSession 22 My problem is that

How to scrape a website which redirects for some time

三世轮回 提交于 2019-12-13 03:31:28
问题 I am trying to scrape a website which has a delay of 5 sec while displaying a ddos prevention page, the website is Koinex I am using Python3 and BeuwtifulSoup, I think I would need to intrduce a time delayafter sending a request and before reading content. Here is what I have done so far import requests from bs4 import BeautifulSoup url = 'https://koinex.in/' response = requests.get(url) html = response.content 回答1: It uses JavaScript to generate some value which is send to page https:/

Trimming whitespace from HTML content?

巧了我就是萌 提交于 2019-12-13 03:29:28
问题 I have a CRUD maintenance screen with a custom rich text editor control (FCKEditor actually) and the program extracts the formatted text as HTML from the control for saving to the database. However, part of our standards is that leading and trailing whitespace needs to be stripped from the content before saving, so I have to remove extraneous   and <br> and such from the beginning and end of the HTML string. I can opt to either do it on the client side (using Javascript) or on the server side

Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python

我只是一个虾纸丫 提交于 2019-12-13 03:15:04
问题 I want to be able to read a html file and extract only the tags out of it. Read one character at a time from the file, ignoring everything to get "<"(ignore "<" as well) Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well) <html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is

Can i read iframe through WebClient (i want the outer html)?

蓝咒 提交于 2019-12-13 03:09:56
问题 Well my program is reading a web target that somewhere in the body there is the iframe that i want to read. My html source <html> ... <iframe src="http://www.mysite.com" ></iframe> ... </html> in my program i have a method that is returning the source as a string public static string get_url_source(string url) { using (WebClient client = new WebClient()) { return client.DownloadString(url); } } My problem is that i want to get the source of the iframe when it's reading the source, as it would

parsing invalid anchor tag with BeautifulSoup or Regex

醉酒当歌 提交于 2019-12-13 02:53:52
问题 I wanted parse to parse a raw document containing html anchor tag but unfortunately it contains invalid tag such as: <a href="A 4"drive bay">some text here</a> I know the href value may not be an actual link but let's just leave it that way. now what i need is to retrieve the href value 'A 4"drive bay' and the link text 'some text here' . I am using python and i have tried the python library "BeautifulSoup" and it works pretty well in retrieving all the anchor tags. the problem though is that

Navigating to second string text using BeautifulSoup

随声附和 提交于 2019-12-13 02:35:22
问题 Here is the lxml, it's saved as sample.html. <html> <body> <div class ="ecopyramid"> <ul id ="producers"> <li class ="producerlist"> <div class ="name">A1</div> <div class ="number">100000</div> </li> <li class ="producerlist"> <div class ="name">B1</div> <div class ="number">100000</div> </li> </ul> <ul id ="primaryconsumers"> <li class ="primaryconsumerlist"> <div class ="name">A2</div> <div class ="number">1000</div> </li> <li class ="primaryconsumerlist"> <div class ="name">B2</div> <div