bs4 again from website and save to text file

久未见 提交于 2020-01-15 15:38:46

问题


I am learning on how to extract data from websites now and have managed to get alot of information. However for my next website I am failing for some unknown reason as nothing is saved to the text files nor do I get any output in print. Here is my piece of code:

import json
import urllib.request
from bs4 import BeautifulSoup
import requests


url = 'https://www.jaffari.org/'
request = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
response = urllib.request.urlopen(request)
html = response.read()
soup = BeautifulSoup(html.decode("utf-8"), "html.parser")

table = soup.find('div', attrs={"class":"textwidget"})
name = table.text.encode('utf-8').strip()

with open('/home/pi/test.txt', 'w') as outfile:
    json.dump(name, outfile)
print (name)

Can anyone help please?


回答1:


The prayer times are rendered by java-scripts therefore you need to use browser tool like selenium to load the page and then use beautiful soup to get the data.

You need to download compatible ChromeDriver from this link and passed the chrome driver path as i have provided.

Code here to fetch name and prayer times and saved in a text file.

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import re

options = Options()
# Runs Chrome in headless mode.
options.add_argument("--headless")
#path of the chrome driver
driver=webdriver.Chrome(executable_path="D:\Software\chromedriver.exe", chrome_options=options)
driver.headless=True
driver.get('https://www.jaffari.org/')
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,'div.sidebar-widget.widget_text>div>table')))
print("Data rendered successfully!!!")
#Get the page source
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
#Close the driver
driver.close()

with open('testPrayers.txt', 'w') as outfile:
     for row in soup.select("div.sidebar-widget.widget_text>div>table tr"):
         name=row.select("td")[0].text.strip()
         time=re.findall('(\d{1,2}:?\d{1,2}\W[A|P]M$)',row.select("td")[1].text.strip())
         outfile.write(name + " " + time[0] + "\n")
         print(name + " " + time[0])

outfile.close()
print('Done')

Updated data with different file name.


from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import re

options = Options()
# Runs Chrome in headless mode.
options.add_argument("--headless")
#path of the chrome driver
driver=webdriver.Chrome(executable_path="D:\Software\chromedriver.exe", chrome_options=options)
driver.headless=True
driver.get('https://www.jaffari.org/')
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,'div.sidebar-widget.widget_text>div>table')))
print("Data rendered successfully!!!")
#Get the page source
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
#Close the driver
driver.close()

for row in soup.select("div.sidebar-widget.widget_text>div>table tr"):
    name=row.select("td")[0].text.strip()
    time=re.findall('(\d{1,2}:?\d{1,2}\W[A|P]M$)',row.select("td")[1].text.strip())

    print(name + " " + time[0])
    with open(name+'.txt', 'w') as outfile:
        outfile.write(time[0])
        outfile.close()


print('Done')



回答2:


The name variable needs to be a string rather than a bytes object. Try with

with open('/home/pi/test.txt', 'w') as outfile:
    json.dump(name.decode(), outfile)
print (name.decode())

Hope it helps.



来源:https://stackoverflow.com/questions/59698151/bs4-again-from-website-and-save-to-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!