问题
I am trying to extract and print urls and their name (between <a href='url' title='smth'>NAME</a>
existing in an html file (saved in disk) without using BeautifulSoup or another library. Just a beginner's Python code.
The wishing print format is:
http://..filepath/filename.pdf
File's Name
so on...
I was able to extract and print the all urls or all the names solely, but I fail to append all the names that follows after a while in the code included just before the tag and print them below each url. My code gets messy and I am pretty stack. That's my code so far:
import os
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
txt = html.read()
# for urls
nolp = 0
urlarrow = []
while nolp == 0:
pos = txt.find("href")
if pos >= 0:
txtcount = len(txt)
txt = txt[pos:txtcount]
pos = txt.find('"')
txtcount = len(txt)
txt = txt[pos+1:txtcount]
pos = txt.find('"')
url = txt[0:pos]
if url.startswith("http") and url.endswith("pdf"):
urlarrow.append(url)
else:
nolp = 1
for item in urlarrow:
print(item)
#for names
almost identical code to the above
html.close()
How to make it work? I need to unite them into one function or def but how? ps. I posted an answer below, but I think there may be a more simple and Pythonic solution
回答1:
This is makes the correct output I need, but I am sure there is a better way.
import os
with open ('~/SomeFolder/page.html'),'r') as html:
txt = html.read()
text = txt
#for urls
nolp = 0
urlarrow = []
while nolp == 0:
pos = txt.find("href")
if pos >= 0:
txtcount = len(txt)
txt = txt[pos:txtcount]
pos = txt.find('"')
txtcount = len(txt)
txt = txt[pos+1:txtcount]
pos = txt.find('"')
url = txt[0:pos]
if url.startswith("http") and url.endswith("pdf"):
urlarrow.append(url)
else:
nolp = 1
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
text = html.read()
#for names
noloop = 0
namearrow = []
while noloop == 0:
posB = text.find("title")
if posB >= 0:
textcount = len(text)
text = text[posB:textcount]
posB = text.find('"')
textcount = len(text)
text = text[posB+19:textcount] #because string starts 19 chars after the posB
posB = text.find('</')
name = text[1:posB]
if text[0].startswith('>'):
namearrow.append(name)
else:
noloop = 1
fullarrow = []
for pair in zip(urlarrow, namearrow):
for item in pair:
fullarrow.append(item)
for instance in fullarrow:
print(instance)
html.close()
来源:https://stackoverflow.com/questions/40952892/extract-url-their-names-of-an-html-file-stored-on-disk-and-print-them-respecti