Gibberish text output because of encoding in web scraping

人盡茶涼 提交于 2020-07-09 14:20:37

问题


I'm trying to get a text in Persian language from Google Translate, and the best encoding type for Persian is UTF-8.

Google Translate uses Javascript to render its HTML codes, so I'm using html-requests module for this.

What I have problem with is the output that I get each time, both either when I use print() or when I try to write it into a file. Both ways will give me a gibberish non-Persian text, and I know it's because of the encoding or something like this.

So I was trying to change encoding to utf-8 whenever I could, this my code:

import requests_html
from bs4 import BeautifulSoup as BS

url = "https://translate.google.com/#view=home&op=translate&sl=en&tl=hy&text={}"
text = input("text: ")

session = requests_html.HTML(url=url.format(text), html='str')

session.render() # for executing js scripts
content = session.raw_html
            
soup = BS(content, "html.parser", from_encoding='utf-8')
table_rows = soup.find("table", "gt-baf-table").find_all('span')

# this is my way for write the output into a file
with open('file.txt', 'wb') as file:
    for table_row in table_rows:
        file.write(table_row.text.encode('utf-8'))

This is the output I got for word space:

nounտարածությունտարածությունspacedistanceareaspreadroomtractծավալծավալvolumesizemagnitudebulkspacecontentնստելատեղնստելատեղsiegespaceհեռավորությունհեռավորությունdistancelengthspaceintervalwayտևողությունտևողությունspacestanding

Note: I tried to write all of the HTML codes that I got from session.raw_html and into a file, then search for that Persian texts in HTML code, but that resulted in the same as the above output, I got gibberish and nonsense text.


回答1:


&sl=en&tl=hy means English to Armenian. in your url. Use &tl=fa for Persian. See complete list in Google Translate Two-Letter Language Codes:

No. Language Name         Native Language Name Code 
--- -------------         -------------------- ---- 
1   Afrikaans             Afrikaans            af   
2   Albanian              Shqip                sq   
3   Arabic                عربي                 ar   
4   Armenian              Հայերէն              hy   
5   Azerbaijani           آذربایجان دیلی       az   
6   Basque                Euskara              eu   
7   Belarusian            Беларуская           be   
8   Bulgarian             Български            bg   
9   Catalan               Català               ca   
10  Chinese (Simplified)  中文简体                 zh-CN
11  Chinese (Traditional) 中文繁體                 zh-TW
12  Croatian              Hrvatski             hr   
13  Czech                 Čeština              cs   
14  Danish                Dansk                da   
15  Dutch                 Nederlands           nl   
16  English               English              en   
17  Estonian              Eesti keel           et   
18  Filipino              Filipino             tl   
19  Finnish               Suomi                fi   
20  French                Français             fr   
21  Galician              Galego               gl   
22  Georgian              ქართული              ka   
23  German                Deutsch              de   
24  Greek                 Ελληνικά             el   
25  Haitian Creole        Kreyòl ayisyen       ht   
26  Hebrew                עברית                iw   
27  Hindi                 हिन्दी               hi   
28  Hungarian             Magyar               hu   
29  Icelandic             Íslenska             is   
30  Indonesian            Bahasa Indonesia     id   
31  Irish                 Gaeilge              ga   
32  Italian               Italiano             it   
33  Japanese              日本語                  ja   
34  Korean                한국어                  ko   
35  Latvian               Latviešu             lv   
36  Lithuanian            Lietuvių kalba       lt   
37  Macedonian            Македонски           mk   
38  Malay                 Malay                ms   
39  Maltese               Malti                mt   
40  Norwegian             Norsk                no   
41  Persian               فارسی                fa   
42  Polish                Polski               pl   
43  Portuguese            Português            pt   
44  Romanian              Română               ro   
45  Russian               Русский              ru   
46  Serbian               Српски               sr   
47  Slovak                Slovenčina           sk   
48  Slovenian             Slovensko            sl   
49  Spanish               Español              es   
50  Swahili               Kiswahili            sw   
51  Swedish               Svenska              sv   
52  Thai                  ไทย                  th   
53  Turkish               Türkçe               tr   
54  Ukrainian             Українська           uk   
55  Urdu                  اردو                 ur   
56  Vietnamese            Tiếng Việt           vi   
57  Welsh                 Cymraeg              cy   
58  Yiddish               ייִדיש               yi   

FYI, the following script works for me:

import requests_html
from bs4 import BeautifulSoup as BS

url = "https://translate.google.com/#view=home&op=translate&sl=en&tl=fa&text={}"
#text = input("text: ")
text = 'I have a problem with the output that I get each time.'

session = requests_html.HTML(url=url.format(text), html='str')

session.render() # for executing js scripts
content = session.raw_html
            
soup = BS(content, "html.parser", from_encoding='utf-8')
table_rows = soup.find('span', attrs={'class':'tlid-translation translation'}).find_all('span')

for table_row in table_rows:
    print(table_row.text)

Output:

D:\bat\SO\62499600.py
من با خروجی که هر بار می گیرم مشکلی دارم.

Unfortunately, I don't understand Farsi so tried using &tl=ru (Russian):

У меня проблема с выводом, который я получаю каждый раз.



回答2:


As @JosefZ already explained you need to changed from Armenian to Persian. For extracting the desired content, which I assume is the translated part, I suggest using the snippet below and then writing them to a file using the proper encoding.

desired_rows = soup.findAll("span", {"class": "gt-baf-cell"})

Hope this helps!



来源:https://stackoverflow.com/questions/62499600/gibberish-text-output-because-of-encoding-in-web-scraping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!