findall

How to print paragraphs and headings simultaneously while scraping in Python?

坚强是说给别人听的谎言 提交于 2021-02-08 11:52:06
问题 I am a beginner in python. I am currently using Beautifulsoup to scrape a website. str='' #my_url source = urllib.request.urlopen(str); soup = bs.BeautifulSoup(source,'lxml'); match=soup.find('article',class_='xyz'); for paragraph in match.find_all('p'): str+=paragraph.text+"\n" My tag Structure - <article class="xyz" > <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> </article> I am getting output like this (as I am able to extract the

soup.findAll returning empty list

梦想的初衷 提交于 2021-01-29 19:10:15
问题 I am trying to scrape with soup and am obtaining an empty set when I call findAll from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F

re.findall() where I want all unique instances of the regex on the page

最后都变了- 提交于 2021-01-27 16:21:18
问题 As the title suggests, I want to run code like this (top_url_list is just a list of urls I'm looping through to find instances of these filename conventions that I'm looking for with regex: name_files = [] for i in top_url_list: result = re.findall("\/([a-z]+[0-9][0-9]\W[a-z]+)", str(urlopen(i).read())) Where the objective is to grab all of the instances where the regex checks out, hence the 'findall()" function. The problem is, it's important that I only get distinct/uniques of each instance

re.findall() where I want all unique instances of the regex on the page

拟墨画扇 提交于 2021-01-27 16:17:38
问题 As the title suggests, I want to run code like this (top_url_list is just a list of urls I'm looping through to find instances of these filename conventions that I'm looking for with regex: name_files = [] for i in top_url_list: result = re.findall("\/([a-z]+[0-9][0-9]\W[a-z]+)", str(urlopen(i).read())) Where the objective is to grab all of the instances where the regex checks out, hence the 'findall()" function. The problem is, it's important that I only get distinct/uniques of each instance

How to get the hidden input's value by using python?

随声附和 提交于 2020-07-05 08:37:05
问题 How can i get input value from html page like <input type="hidden" name="captId" value="AqXpRsh3s9QHfxUb6r4b7uOWqMT" ng-model="captId"> I have input name [ name="captId" ] and need his value import re , urllib , urllib2 a = urllib2.urlopen('http://www.example.com/','').read() thanx update 1 I installed BeautifulSoup and used it but there some errors code import re , urllib , urllib2 a = urllib2.urlopen('http://www.example.com/','').read() soup = BeautifulSoup(a) value = soup.find('input', {

Python regex findall returns empty string when not asked

狂风中的少年 提交于 2020-05-23 06:42:28
问题 I'm trying to extract salaries from a list of strings. I'm using the regex findall() function but it's returning many empty strings as well as the salaries and this is causing me problems later in my code. sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors regex = ' ?([0-9]* ?[0-9]?[0-9]?[0-9]?)'#this is my regex re.findall(regex,sal)[0] #returns '41 000' as expected but: re.findall(regex,sal)[1] #returns: '' #Desired result : '63 000' #the whole list of matches is

Finding regex in PDF with PDFminer (python) not working

两盒软妹~` 提交于 2020-04-16 02:54:50
问题 I'm trying to find occurrences of a regular expression in a short pdf. However, it doesn't work. I don't understand why, because if I try to search a simple string I don't have problems. The text is rendered correctly. Here is my code: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO import re def convert_pdf_to_txt(path): #\[\s

Finding regex in PDF with PDFminer (python) not working

这一生的挚爱 提交于 2020-04-16 02:54:01
问题 I'm trying to find occurrences of a regular expression in a short pdf. However, it doesn't work. I don't understand why, because if I try to search a simple string I don't have problems. The text is rendered correctly. Here is my code: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO import re def convert_pdf_to_txt(path): #\[\s

Regex separate urls in text that has no separators

。_饼干妹妹 提交于 2020-01-30 11:28:30
问题 Apologies for yet another regex question! I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n this example contains just two urls, but it could be more. I'm trying to separate the urls, into a list using python I've

Python regular expression findall *

☆樱花仙子☆ 提交于 2019-12-31 07:44:07
问题 I am not able to understand the following code behavior. >>> import re >>> text = 'been' >>> r = re.compile(r'b(e)*') >>> r.search(text).group() 'bee' #makes sense >>> r.findall(text) ['e'] #makes no sense I read some already existing question and answers about capturing groups and all. But still I am confused. Could someone please explain me. 回答1: The answer is simplified in the Regex Howto As you can read here, group returns the string matched by the Regular Expression. group() returns the