Extract url & their names of an html file stored on disk and print them respectively - Python

问题

I am trying to extract and print urls and their name (between <a href='url' title='smth'>NAME</a> existing in an html file (saved in disk) without using BeautifulSoup or another library. Just a beginner's Python code. The wishing print format is:

http://..filepath/filename.pdf
File's Name
so on...

I was able to extract and print the all urls or all the names solely, but I fail to append all the names that follows after a while in the code included just before the tag and print them below each url. My code gets messy and I am pretty stack. That's my code so far:

import os
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
    txt = html.read()
# for urls
nolp = 0
urlarrow = []
while nolp == 0:
    pos = txt.find("href")
    if pos >= 0:
      txtcount = len(txt)
      txt = txt[pos:txtcount]
      pos = txt.find('"')
      txtcount = len(txt)
      txt = txt[pos+1:txtcount]
      pos = txt.find('"')
      url = txt[0:pos]
      if url.startswith("http") and url.endswith("pdf"):
          urlarrow.append(url)
    else:
      nolp = 1
for item in urlarrow:
  print(item)

#for names
almost identical code to the above

html.close()

How to make it work? I need to unite them into one function or def but how? ps. I posted an answer below, but I think there may be a more simple and Pythonic solution

回答1:

This is makes the correct output I need, but I am sure there is a better way.

import os
with open ('~/SomeFolder/page.html'),'r') as html:
    txt = html.read()
    text = txt
#for urls    
nolp = 0
urlarrow = []
while nolp == 0:
    pos = txt.find("href")
    if pos >= 0:
      txtcount = len(txt)
      txt = txt[pos:txtcount]
      pos = txt.find('"')
      txtcount = len(txt)
      txt = txt[pos+1:txtcount]
      pos = txt.find('"')
      url = txt[0:pos]
      if url.startswith("http") and url.endswith("pdf"):
          urlarrow.append(url)
    else:
      nolp = 1

with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
    text = html.read()

#for names  
noloop = 0
namearrow = []
while noloop == 0:
    posB = text.find("title")
    if posB >= 0:
      textcount = len(text)
      text = text[posB:textcount]
      posB = text.find('"')
      textcount = len(text)
      text = text[posB+19:textcount] #because string starts 19 chars after the posB
      posB = text.find('</')
      name = text[1:posB]
      if text[0].startswith('>'):
          namearrow.append(name)
    else:
      noloop = 1

fullarrow = []
for pair in zip(urlarrow, namearrow):
    for item in pair:
        fullarrow.append(item)
for instance in fullarrow:
    print(instance)

html.close()

来源：https://stackoverflow.com/questions/40952892/extract-url-their-names-of-an-html-file-stored-on-disk-and-print-them-respecti

标签

python

html-parsing

extract