xml2

Can R read html-encoded emoji characters?

霸气de小男生 提交于 2020-01-13 09:34:52
问题 Question My question, explained below, is how can R be used to read a string that includes HTML emoji codes like �� , and either (1) represent the emoji symbol (e.g., as a unicode symbol: 🤗 ) in the parsed string, or (2) convert it into its text equivalent (" :hugging face: ")? Background I have an XML dataset of text messages (from the Android/iOS app [Signal])(https://signal.org/) that I am reading into R for a text mining project. The data look like this, with each text

Parsing large and complicated XML file to data.frame

送分小仙女□ 提交于 2019-12-21 23:09:22
问题 So, I have large XML file with lots of reports. I created data example below to approximately show the size of xml and its structure: x <- "<Report><Agreements><AgreementList /></Agreements><CIP><RecordList><Record><Date>2017-05-26T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description>local</Description></Reason></ReasonsList><Score>xxx</Score></Record><Record><Date>2017-04-30T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description/></Reason><

xml2 in R: extract children attributes from parents (everything is named the same)

隐身守侯 提交于 2019-12-13 08:43:35
问题 I have the following xml, in which nodes can have the same names but their attributes may differ. <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5"> <protein> <sequence md5="6e7e4fcef214ab5cf97714e899af0b96"

Short xPath in R for use with RSelenium

戏子无情 提交于 2019-12-11 10:18:57
问题 I have a problem when using findElement() from RSelenium with xPath on this page where the xPath expression for an element is very long, i.e. the element is nested deeply (I use firefox for the remote driver). findElement() works fine on the page if I use a short xPath expression that I get from inspecting the element e.g. in Google Chrome. However, in R (as far as I know) I can only retrieve the long xPath expression using for example xml_path() from package xml2 . Is there a way to get a

Parsing large XML to dataframe in R

回眸只為那壹抹淺笑 提交于 2019-12-11 06:12:23
问题 I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS. Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later. The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs) I've made some progress but processing for

Difference between read_html(url) and read_html(content(GET(url), “text”))

别等时光非礼了梦想. 提交于 2019-12-11 03:59:07
问题 I am looking at this great answer: https://stackoverflow.com/a/58211397/3502164. The beginning of the solution includes: library(httr) library(xml2) gr <- GET("https://nzffdms.niwa.co.nz/search") doc <- read_html(content(gr, "text")) xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value") Output is constant across multiple requests: "59243d3a233492e9461f8f73136118f9" My Default way so far would have been: doc <- read_html("https://nzffdms.niwa.co.nz/search") xml_attr(xml

Adding whitespace to text elements

余生长醉 提交于 2019-12-11 03:54:49
问题 is there a way to add whitespace to each elements that contain text? For this example: movie <- read_html("http://www.imdb.com/title/tt1490017/") cast <- html_nodes(movie, "#titleCast span.itemprop") cast %>% html_structure() [[1]] <span.itemprop [itemprop]> {text} [[2]] <span.itemprop [itemprop]> {text} I would want to add a trailing whitespace to each text element before using html_text() . I have another use case where I want to use html_text() higher up in the document hierarchy. The

Why does xpath find excluded nodes again?

江枫思渺然 提交于 2019-12-10 21:05:00
问题 Consider this page: <n1 class="a"> 1 </n1> <n1 class="b"> <b>bold</b> 2 </n1> If I first select the first n1 using class="a" , I should be excluding the second n1 , and indeed this appears true: library(rvest) b_nodes = read_html('<n1 class="a">1</n1> <n1 class="b"><b>bold</b>2</n1>') %>% html_nodes(xpath = '//n1[@class="b"]') b_nodes # {xml_nodeset (1)} # [1] <n1 class="b"><b>bold</b>2</n1> However if we now use this "subsetted" page: b_nodes %>% html_nodes(xpath = '//n1') # {xml_nodeset (2)

Parsing XML in R: Incorrect namespaces

我的梦境 提交于 2019-12-07 02:19:53
问题 I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions. XML files are like this: xml_nons.xml <?xml version="1.0" encoding="UTF-8"?> <XML> <Node> <Name>Name 1</Name> <Title>Title 1</Title> <Date>2015</Date> </Node> </XML> And the other: xml_ns.xml <?xml version="1.0" encoding=

Can R read html-encoded emoji characters?

半腔热情 提交于 2019-12-05 07:32:27
Question My question, explained below, is how can R be used to read a string that includes HTML emoji codes like &#55358;&#56599; , and either (1) represent the emoji symbol (e.g., as a unicode symbol: 🤗 ) in the parsed string, or (2) convert it into its text equivalent (" :hugging face: ")? Background I have an XML dataset of text messages (from the Android/iOS app [Signal])( https://signal.org/ ) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node: <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <!-- File