encoding

Beautiful Soup default decode charset?

旧时模样 提交于 2021-02-05 08:44:07
问题 I have a huge set of web pages with different encodings, and I try to parse it using Beautiful Soup. As I have noticed, BS detects encoding using meta-charset or xml-encoding tags. But there are documents with no such tags or typos in charset name - and BS fails on all of them. I suppose it's default guess is utf-8, which is wrong. Luckily, all such pages (or nearly all of them) have the same encoding. Is there any way to set it as default? I've also tried to grep charset and use iconv to

Diamonds with question marks

自作多情 提交于 2021-02-04 18:58:07
问题 I'm getting these little diamonds with question marks in them in my HTML attributes when I present data from my database. I'm using EPiServer and a few custom properties. This is the information I've gathered, I save my data as a XML document, since I use custom EPiServer properties which need more than one defined value. This is saved as UTF8. It's only attributes in element tags which have this problem, such as align=left becomes align=�left�. There is no " character there, but I get the

Perl regular expression matching on large Unicode code points

折月煮酒 提交于 2021-02-04 18:16:26
问题 I am trying to replace various characters with either a single quote or double quote. Here is my test file: # Replace all with double quotes " fullwidth “ left ” right „ low " normal # Replace all with single quotes ' normal ‘ left ’ right ‚ low ‛ reverse ` backtick I'm trying to do this... perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt But only the backtick character gets replaced

R- Changing encoding of column in dataframe?

最后都变了- 提交于 2021-02-04 15:32:29
问题 I am trying to change the encoding of a column in a dataframe. stri_enc_mark(data_updated$text) # [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8" # [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" # [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII" # [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII" When I try to convert it, it does not throw an error, but still has no effect on the

Open URL that contains umlaut with batch

我的未来我决定 提交于 2021-02-04 08:11:50
问题 I want to open an URL in chrome with a batch file. This works for normal URLs, but it doesn't for URLs with umlauts. start chrome.exe https://trends.google.de/trends/explore?q=mähroboter I cannot use "ae" as a replacement for "ä", as it will give me different results on Google Trends. When I keep it like this, the URL in my browser changes to https://trends.google.de/trends/explore?q=mA4hroboter which again gives me the wrong results. It needs to be "ä". I tried playing around with the file

Parsing HTML - PHP DOMDocument loadHTML UTF-8 encoding

£可爱£侵袭症+ 提交于 2021-01-29 14:55:06
问题 Previous posts here and here both suggest appending a resource with the correct encoding i.e. UTF-8. Additionally, in reading similar articles here and here, the recommendation is to use <?xml version="1.0" encoding="UTF-8"?> instead. It isn't immediately clear (to me), if a page already includes <meta charset="UTF-8"> , that loadHTML can be limited to $output = $dom->loadHTML($output, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); . I am assuming yes. The page being parsed is HTML. Equally,

parse an xml file without changing the encoding and preserving the file format

感情迁移 提交于 2021-01-29 12:53:43
问题 The original xml file is encoded with UTF-8 without BOM <?xml version="1.0" encoding="UTF-8"?> <some_text> <ada/> <file/> <title><![CDATA[]]></title> <code/> <parathrhseis/> </some_text> I try to set text to title in this function: Dim myXmlDocument As XmlDocument = New XmlDocument() Dim node As XmlNode Dim s As String s = "name.xml" If System.IO.File.Exists(s) = False Then Return False End If myXmlDocument.Load(s) node = myXmlDocument.DocumentElement Try For Each node In node.ChildNodes If

parse an xml file without changing the encoding and preserving the file format

好久不见. 提交于 2021-01-29 11:56:02
问题 The original xml file is encoded with UTF-8 without BOM <?xml version="1.0" encoding="UTF-8"?> <some_text> <ada/> <file/> <title><![CDATA[]]></title> <code/> <parathrhseis/> </some_text> I try to set text to title in this function: Dim myXmlDocument As XmlDocument = New XmlDocument() Dim node As XmlNode Dim s As String s = "name.xml" If System.IO.File.Exists(s) = False Then Return False End If myXmlDocument.Load(s) node = myXmlDocument.DocumentElement Try For Each node In node.ChildNodes If

Performing one hot encoding on two columns of string data

邮差的信 提交于 2021-01-29 10:52:15
问题 I am trying to predict 'Full_Time_Home_Goals' My code is: import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_absolute_error from sklearn.ensemble import RandomForestRegressor import os import xlrd import datetime import numpy as np # Set option to display all the rows and columns in the dataset. If there are more rows, adjust number accordingly. pd.set_option('display.max_rows', 5000) pd.set

Python EncodeDecode error: UnicodeDecodeError: 'charmap' codec can't decode byte [duplicate]

怎甘沉沦 提交于 2021-01-29 09:29:09
问题 This question already has answers here : error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (18 answers) Closed 1 year ago . I'm trying to manipulate images but I can't get rid of that error : fichier=open("photo.jpg","r") lignes=fichier.readlines() Traceback (most recent call last): File "<ipython-input-32-87422df77ac2>", line 1, in <module> lignes=fichier.readlines() File "C:\Winpython\python-3.5.4.amd64\lib\encodings\cp1252.py", line 23, in