web-scraping

LoadError: cannot load such file — capybara Stand Alone Code

独自空忆成欢 提交于 2021-02-08 15:32:53
问题 I'm working on building a simple post miner using Ruby and the following tutorial (http://ngauthier.com/2014/06/scraping-the-web-with-ruby.html) Here is my code I currently have: #!/usr/bin/ruby require 'capybara' require 'capybara/poltergeist' include Capybara::DSL Capybara.default_driver = :poltergeist visit "http://dilloncarter.com" all(".posts .post ").each do |post| title = post.find("h1 a").text url = post.find("h1 a")["href"] date = post.find("a")["datetime"] summary = post.find("p

Passing web data into Beautiful Soup - Empty list

守給你的承諾、 提交于 2021-02-08 13:47:22
问题 I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form: >>> from bs4 import BeautifulSoup >>> from urllib3 import poolmanager >>> connectBuilder = poolmanager.PoolManager() >>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/') >>> content <urllib3.response.HTTPResponse object at 0x00000000032EC390> >>> soup =

Python - find a substring between two strings based on the last occurence of the later string

£可爱£侵袭症+ 提交于 2021-02-08 12:12:07
问题 I am trying to find a substring which is between to strings. The first string is <br> and the last string is <br><br> . The first string I look for is repetitive, while the later string can serve as an anchor. Here is an example: <div class="linkTabBl" style="float:left;padding-top:6px;width:240px"> Anglo American plc <br> 20 Carlton House Terrace <br> SW1Y 5AN London <br> United Kingdom <br><br> Phone : +44 (0)20 7968 8888 <br> Fax : +44 (0)20 7968 8500 <br> Internet : <a class="pageprofil

Python - find a substring between two strings based on the last occurence of the later string

半城伤御伤魂 提交于 2021-02-08 12:11:44
问题 I am trying to find a substring which is between to strings. The first string is <br> and the last string is <br><br> . The first string I look for is repetitive, while the later string can serve as an anchor. Here is an example: <div class="linkTabBl" style="float:left;padding-top:6px;width:240px"> Anglo American plc <br> 20 Carlton House Terrace <br> SW1Y 5AN London <br> United Kingdom <br><br> Phone : +44 (0)20 7968 8888 <br> Fax : +44 (0)20 7968 8500 <br> Internet : <a class="pageprofil

Python Selenium Traceback (most recent call last):

帅比萌擦擦* 提交于 2021-02-08 12:01:32
问题 I'm trying to use selenium for a python web scrapper but when I try to run the program I get the following error: "/Applications/Python 3.8/IDLE.app/Contents/MacOS/Python" "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 52548 --file /Users/xxxx/git/python/python_crawler_example_01/naver_crawling.py pydev debugger: process 3004 is connecting Connected to pydev debugger (build 192.7142.56) Traceback (most recent call last):

Python Selenium Traceback (most recent call last):

旧巷老猫 提交于 2021-02-08 12:01:29
问题 I'm trying to use selenium for a python web scrapper but when I try to run the program I get the following error: "/Applications/Python 3.8/IDLE.app/Contents/MacOS/Python" "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 52548 --file /Users/xxxx/git/python/python_crawler_example_01/naver_crawling.py pydev debugger: process 3004 is connecting Connected to pydev debugger (build 192.7142.56) Traceback (most recent call last):

How to print paragraphs and headings simultaneously while scraping in Python?

坚强是说给别人听的谎言 提交于 2021-02-08 11:52:06
问题 I am a beginner in python. I am currently using Beautifulsoup to scrape a website. str='' #my_url source = urllib.request.urlopen(str); soup = bs.BeautifulSoup(source,'lxml'); match=soup.find('article',class_='xyz'); for paragraph in match.find_all('p'): str+=paragraph.text+"\n" My tag Structure - <article class="xyz" > <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> </article> I am getting output like this (as I am able to extract the

R: Webscraping various <div>-classes into lists with (sub-)elements

♀尐吖头ヾ 提交于 2021-02-08 11:49:16
问题 I use rvest to scrape this website. It contains data in such a form (simplified): <div class="editor-type">Editors</div> <div class="editor"> <div class="editor-name"><h3>Otto Heath</h3></div> <span class="editor-affiliation">Royal Holloway University of London</span> </div> <div class="editor"> <div class="editor-name"><h3>Kathrin Smets</h3></div> <span class="editor-affiliation">Royal Holloway University of London</span> </div> <div class="editor-type">Associate Editor</div> <div class=

Trying to use rvest to loop a command to scrape tables from multiple pages

穿精又带淫゛_ 提交于 2021-02-08 11:31:30
问题 I'm trying to scrape HTML tables from different football teams. Here is the table I want to scrape, however I want to scrape that same table from all of the teams to ultimately create a single CSV file that has the player names and their data. http://www.pro-football-reference.com/teams/tam/2016_draft.htm # teams teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE",

why xpath derived from chrome does not work

隐身守侯 提交于 2021-02-08 11:19:27
问题 I am trying to scrap data from web of science And here is the specific page I am going to work with. Below is the code I use for extract the abstract: import lxml import requests url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no' s = requests.Session() d = s.get(url) soup1 = etree.HTML(d.text) And here is the xpath I got through the copy xpath in Chrome: //*[@id="records_form"]/div