encoding | 易学教程

Beautiful Soup default decode charset?

阅读更多关于 Beautiful Soup default decode charset?

问题 I have a huge set of web pages with different encodings, and I try to parse it using Beautiful Soup. As I have noticed, BS detects encoding using meta-charset or xml-encoding tags. But there are documents with no such tags or typos in charset name - and BS fails on all of them. I suppose it's default guess is utf-8, which is wrong. Luckily, all such pages (or nearly all of them) have the same encoding. Is there any way to set it as default? I've also tried to grep charset and use iconv to

Diamonds with question marks

阅读更多关于 Diamonds with question marks

问题 I'm getting these little diamonds with question marks in them in my HTML attributes when I present data from my database. I'm using EPiServer and a few custom properties. This is the information I've gathered, I save my data as a XML document, since I use custom EPiServer properties which need more than one defined value. This is saved as UTF8. It's only attributes in element tags which have this problem, such as align=left becomes align=�left�. There is no " character there, but I get the

Perl regular expression matching on large Unicode code points

阅读更多关于 Perl regular expression matching on large Unicode code points

问题 I am trying to replace various characters with either a single quote or double quote. Here is my test file: # Replace all with double quotes ＂ fullwidth “ left ” right „ low " normal # Replace all with single quotes ' normal ‘ left ’ right ‚ low ‛ reverse ` backtick I'm trying to do this... perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt But only the backtick character gets replaced

R- Changing encoding of column in dataframe?

阅读更多关于 R- Changing encoding of column in dataframe?

问题 I am trying to change the encoding of a column in a dataframe. stri_enc_mark(data_updated$text) # [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8" # [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" # [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII" # [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII" When I try to convert it, it does not throw an error, but still has no effect on the

Open URL that contains umlaut with batch

阅读更多关于 Open URL that contains umlaut with batch

问题 I want to open an URL in chrome with a batch file. This works for normal URLs, but it doesn't for URLs with umlauts. start chrome.exe https://trends.google.de/trends/explore?q=mähroboter I cannot use "ae" as a replacement for "ä", as it will give me different results on Google Trends. When I keep it like this, the URL in my browser changes to https://trends.google.de/trends/explore?q=mA4hroboter which again gives me the wrong results. It needs to be "ä". I tried playing around with the file

Parsing HTML - PHP DOMDocument loadHTML UTF-8 encoding

阅读更多关于 Parsing HTML - PHP DOMDocument loadHTML UTF-8 encoding

问题 Previous posts here and here both suggest appending a resource with the correct encoding i.e. UTF-8. Additionally, in reading similar articles here and here, the recommendation is to use <?xml version="1.0" encoding="UTF-8"?> instead. It isn't immediately clear (to me), if a page already includes <meta charset="UTF-8"> , that loadHTML can be limited to $output = $dom->loadHTML($output, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); . I am assuming yes. The page being parsed is HTML. Equally,

parse an xml file without changing the encoding and preserving the file format

阅读更多关于 parse an xml file without changing the encoding and preserving the file format

问题 The original xml file is encoded with UTF-8 without BOM <?xml version="1.0" encoding="UTF-8"?> <some_text> <ada/> <file/> <title><![CDATA[]]></title> <code/> <parathrhseis/> </some_text> I try to set text to title in this function: Dim myXmlDocument As XmlDocument = New XmlDocument() Dim node As XmlNode Dim s As String s = "name.xml" If System.IO.File.Exists(s) = False Then Return False End If myXmlDocument.Load(s) node = myXmlDocument.DocumentElement Try For Each node In node.ChildNodes If

parse an xml file without changing the encoding and preserving the file format

阅读更多关于 parse an xml file without changing the encoding and preserving the file format

Performing one hot encoding on two columns of string data

阅读更多关于 Performing one hot encoding on two columns of string data

问题 I am trying to predict 'Full_Time_Home_Goals' My code is: import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_absolute_error from sklearn.ensemble import RandomForestRegressor import os import xlrd import datetime import numpy as np # Set option to display all the rows and columns in the dataset. If there are more rows, adjust number accordingly. pd.set_option('display.max_rows', 5000) pd.set

Python EncodeDecode error: UnicodeDecodeError: 'charmap' codec can't decode byte [duplicate]

阅读更多关于 Python EncodeDecode error: UnicodeDecodeError: 'charmap' codec can't decode byte [duplicate]

问题 This question already has answers here : error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (18 answers) Closed 1 year ago . I'm trying to manipulate images but I can't get rid of that error : fichier=open("photo.jpg","r") lignes=fichier.readlines() Traceback (most recent call last): File "<ipython-input-32-87422df77ac2>", line 1, in <module> lignes=fichier.readlines() File "C:\Winpython\python-3.5.4.amd64\lib\encodings\cp1252.py", line 23, in