Scraping HTML Text from a Tag

问题

I have a description list that I downloaded from a website with an agenda, and I am trying to create a data.frame without success. the description list has the following structure:

<dl>
<dt> (which contains a <p = "day"> for day)
<dd> (which contains a <p = "hour"> for hour and a <p = "event"> for the event)

I managed to extract this data with the following code:

library(rvest)
url <- read_html("www.mypage.com")
day <- data.frame(day = html_text(html_nodes(url, '.day')))
hour <- data.frame(hour = html_text(html_nodes(url, '.hour')))
event <- data.frame(event = html_text(html_nodes(url, '.event')))

day$ID <- seq.int(nrow(day))
hour$ID <- seq.int(nrow(hour))
event$ID <- seq.int(nrow(event))

Then I created a data frame by joining them BY ID.

The problem is when I have this:

<dl>
<dt>
<dd>
<dd>
<dd>

Which is more than one event for each day. How can I create my data.frame, taking into account that I might have several <dd> for the same <dt>? Thanks!

回答1:

dl/dt/dd scraping is one of those "Why did HTML creators do this to us" kind of things. This shld get you what you want:

library(rvest)
library(tidyverse)

pg <- read_html("http://www.presidencia.pt/?idc=11&fano=2016")

# grab ALL the dt/dd elements under each dl into one big node list
entries <- html_nodes(pg, xpath=".//dl[@id='ms_agend3']/*")

# this finds all of the "dt" elements
starts <- which(xml_name(entries) == "dt")

# this tells us where ^^ "dd"'s stop
ends <- c(starts[-1]-1, length(entries))

# it took 30s for me, so progress bars make the time pass visually
pb <- progress_estimated(length(starts))

# now we iterate over the start/end pairs
map2_df(starts, ends, ~{

  pb$tick()$print() # tick off the progress bar

  # we're only going to work on the part of the node list for this dt/dd set
  start <- .x
  end <- .y

  # get the day
  dt <- html_text(entries[start], trim=TRUE)

  # now iterate over each associated dd and pull out the info
  map_df((start+1):end, ~{
    data_frame(
      hour = html_text(html_node(entries[.x], "div.hora"), trim=TRUE),
      text = html_text(html_node(entries[.x], "div.texto"), trim=TRUE),
    )
  }) %>% 
    mutate(day = dt) # add the day in

}) %>% 
  select(day, hour, text) -> agenda # rearrange and store

It's a tad slow due the way it makes data frames but it will capture the day/hour/text of the agendas (including the blank hours which I assume are informational or all-day events).

This:

pb <- progress_estimated(length(starts))
map2_df(starts, ends, ~{

  pb$tick()$print()

  start <- .x
  end <- .y

  data_frame(
    hour = html_text(html_nodes(entries[(start+1):end], "div.hora"), trim=TRUE),
    text = html_text(html_nodes(entries[(start+1):end], "div.texto"), trim=TRUE),
    day = html_text(entries[start], trim=TRUE)
  )

}) %>% 
  select(day, hour, text) -> agenda

is a bit faster and produces the same results as far as my eyes tell me.

来源：https://stackoverflow.com/questions/47572750/scraping-html-text-from-a-dl-tag

标签

web-scraping

rvest

Scraping HTML Text from a <dl> Tag

问题

回答1: