How do I find html_node on search form?

会有一股神秘感。 提交于 2020-05-28 05:40:28

问题


I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.

The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400

The site requires you enter a last name and first name, then it gives you a list of results.

I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.

Here is the code I have so far. Right now I can't figure out what html_node to use.

library(rvest)

# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"

# start session
jail <- html_session(fc.url)

# Grab initial form
form.unfilled <- jail %>% html_node("form")

form.unfilled

The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.

I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.

Thanks.


回答1:


It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:

library(rvest)

# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"

# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")

#list the form's fields
html_form(jail2)[[1]]

# Grab initial form
form.unfilled <- jail2 %>% html_node("form")

Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scrapping.




回答2:


The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.

rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.




回答3:


Thanks to Dave2e.

Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)

Note: I cannot find any Terms of Service on this site that I'm querying

library(rvest)

# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")

#list the form's fields
html_form(jail2)[[1]]


# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()

form.unfilled

#name values
lname <- "DOE"
fname <- "JOHN"

# Fille the form with name values
form.filled <- form.unfilled %>% 
        set_values("LastName" = lname,
                   "FirstName" = fname)

#Submit form
r <- submit_form(jail2, form.filled,
            submit = "SearchSubmit")

#grab tables from submitted form
table <- r %>% html_nodes("table")

#grab a table with some data
table[[5]] %>% html_table()

# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."


来源:https://stackoverflow.com/questions/61819491/how-do-i-find-html-node-on-search-form

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!