Converting html to text with Python

后端 未结 9 833
一生所求
一生所求 2020-12-12 17:49

I am trying to convert an html block to text using Python.

Input:

9条回答
  •  臣服心动
    2020-12-12 18:40

    I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.

    import urllib 
    
    def html2text(strText):
        str1 = strText
        int2 = str1.lower().find("0:
           str1 = str1[int2:]
        int2 = str1.lower().find("")
        if int2>0:
           str1 = str1[:int2]
        list1 = ['
    ', '', 'span>', 'li>', '' ] list2 = [chr(13), chr(13), chr(9), chr(13), chr(13), chr(13), chr(13), chr(13)] bolFlag1 = True bolFlag2 = True strReturn = "" for int1 in range(len(str1)): str2 = str1[int1] for int2 in range(len(list1)): if str1[int1:int1+len(list1[int2])].lower() == list1[int2]: strReturn = strReturn + list2[int2] if str1[int1:int1+7].lower() == '' or str1[int1:int1+11].lower() == '': bolFlag1 = True if str2 == '<': bolFlag2 = False if bolFlag1 and bolFlag2 and (ord(str2) != 10) : strReturn = strReturn + str2 if str2 == '>': bolFlag2 = True if bolFlag1 and bolFlag2: strReturn = strReturn.replace(chr(32)+chr(13), chr(13)) strReturn = strReturn.replace(chr(9)+chr(13), chr(13)) strReturn = strReturn.replace(chr(13)+chr(32), chr(13)) strReturn = strReturn.replace(chr(13)+chr(9), chr(13)) strReturn = strReturn.replace(chr(13)+chr(13), chr(13)) strReturn = strReturn.replace(chr(13), '\n') return strReturn url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis" html = urllib.urlopen(url).read() print html2text(html)

提交回复
热议问题