findall | 易学教程

How to print paragraphs and headings simultaneously while scraping in Python?

阅读更多关于 How to print paragraphs and headings simultaneously while scraping in Python?

问题 I am a beginner in python. I am currently using Beautifulsoup to scrape a website. str='' #my_url source = urllib.request.urlopen(str); soup = bs.BeautifulSoup(source,'lxml'); match=soup.find('article',class_='xyz'); for paragraph in match.find_all('p'): str+=paragraph.text+"\n" My tag Structure - <article class="xyz" > <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> </article> I am getting output like this (as I am able to extract the

soup.findAll returning empty list

阅读更多关于 soup.findAll returning empty list

问题 I am trying to scrape with soup and am obtaining an empty set when I call findAll from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F

re.findall() where I want all unique instances of the regex on the page

阅读更多关于 re.findall() where I want all unique instances of the regex on the page

问题 As the title suggests, I want to run code like this (top_url_list is just a list of urls I'm looping through to find instances of these filename conventions that I'm looking for with regex: name_files = [] for i in top_url_list: result = re.findall("\/([a-z]+[0-9][0-9]\W[a-z]+)", str(urlopen(i).read())) Where the objective is to grab all of the instances where the regex checks out, hence the 'findall()" function. The problem is, it's important that I only get distinct/uniques of each instance

re.findall() where I want all unique instances of the regex on the page

阅读更多关于 re.findall() where I want all unique instances of the regex on the page

How to get the hidden input's value by using python?

阅读更多关于 How to get the hidden input's value by using python?

问题 How can i get input value from html page like <input type="hidden" name="captId" value="AqXpRsh3s9QHfxUb6r4b7uOWqMT" ng-model="captId"> I have input name [ name="captId" ] and need his value import re , urllib , urllib2 a = urllib2.urlopen('http://www.example.com/','').read() thanx update 1 I installed BeautifulSoup and used it but there some errors code import re , urllib , urllib2 a = urllib2.urlopen('http://www.example.com/','').read() soup = BeautifulSoup(a) value = soup.find('input', {

Python regex findall returns empty string when not asked

阅读更多关于 Python regex findall returns empty string when not asked

问题 I'm trying to extract salaries from a list of strings. I'm using the regex findall() function but it's returning many empty strings as well as the salaries and this is causing me problems later in my code. sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors regex = ' ?([0-9]* ?[0-9]?[0-9]?[0-9]?)'#this is my regex re.findall(regex,sal)[0] #returns '41 000' as expected but: re.findall(regex,sal)[1] #returns: '' #Desired result : '63 000' #the whole list of matches is

Finding regex in PDF with PDFminer (python) not working

阅读更多关于 Finding regex in PDF with PDFminer (python) not working

问题 I'm trying to find occurrences of a regular expression in a short pdf. However, it doesn't work. I don't understand why, because if I try to search a simple string I don't have problems. The text is rendered correctly. Here is my code: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO import re def convert_pdf_to_txt(path): #\[\s

Finding regex in PDF with PDFminer (python) not working

阅读更多关于 Finding regex in PDF with PDFminer (python) not working

Regex separate urls in text that has no separators

阅读更多关于 Regex separate urls in text that has no separators

问题 Apologies for yet another regex question! I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n this example contains just two urls, but it could be more. I'm trying to separate the urls, into a list using python I've

Python regular expression findall *

阅读更多关于 Python regular expression findall *

问题 I am not able to understand the following code behavior. >>> import re >>> text = 'been' >>> r = re.compile(r'b(e)*') >>> r.search(text).group() 'bee' #makes sense >>> r.findall(text) ['e'] #makes no sense I read some already existing question and answers about capturing groups and all. But still I am confused. Could someone please explain me. 回答1: The answer is simplified in the Regex Howto As you can read here, group returns the string matched by the Regular Expression. group() returns the