How to remove nonAscii characters in python

后端 未结 3 930
梦如初夏
梦如初夏 2020-12-22 01:00

This is my code:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import          


        
相关标签:
3条回答
  • 2020-12-22 01:38

    To remove non ASCII characters from text.

    import string
    
    text = [word for word in text if word not in string.ascii_letters]
    
    0 讨论(0)
  • 2020-12-22 01:52

    Try to normalize the string and then ASCII encode it ignoring errors.

    # -*- coding: utf-8 -*-
    from unicodedata import normalize
    
    string = 'úäô§'
    
    if isinstance(string, str):
        string = string.decode('utf-8')
    
    print normalize('NFKD', string).encode('ASCII', 'ignore')
    >>> uao
    
    0 讨论(0)
  • 2020-12-22 01:59

    characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128

    chr convert a integer to a character, ord converts a character to an integer.

    text = ''.join((c for c in str(div) if ord(c) < 128)
    

    this should be your final code

    #!C:/Python27/python
    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    import urllib2
    import sys
    import urlparse
    import io
    
    url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
    #url = "http://eqa.unibo.it/article/view/4554"
    #r = requests.get(url)
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    #soup = BeautifulSoup(r.text,'lxml')
    
    if url.find("http://www.dlib.org") != -1:
        div = soup.find('td', valign='top')
    else:
        div = soup.find('div',id='content')
    
    f = open('path/file_name.html', 'w')
    text = ''.join((c for c in str(div) if ord(c) < 128)
    f.write(text)
    f.close()
    
    0 讨论(0)
提交回复
热议问题