Excel CSV output appearing wrong after calling Python File

问题

Currently struggling with the following output in .csv where their is various random character within the is the players names and values where there shouldn't be

(I've given a picture below of the output)

I'm wondering where I'm going wrong in the code where I'm struggling to eliminate the random characters

I'm trying to remove the characters below such as Â, Ã, ©, ‰ and so on. Any suggestions?

Python Code

#importing

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/47.0.2526.106 Safari/537.36'}

#calling websites
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

#calling players names
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
#Let's look at the first name in the Players list.
Players[0].text

#calling value of players
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
#Let's look at the first name in the Values list.
Values[0].text

PlayersList = []
ValuesList = []

for i in range(0,25):
   PlayersList.append(Players[i].text)
   ValuesList.append(Values[i].text)

df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.to_csv('2000.csv', index=False)

df.head()

====================================================================

My Excel output

回答1:

...
utf8_bom = '\xEF\xBB\xBF'
with open('2000.csv', 'w') as csv_file:
    csv_file.write(utf8_bom)
    df.to_csv(csv_file, index=False, mode='a')

Explanation: The BOM is the byte order mark (q.v.). If Excel finds it at the beginning of the CSV file, it uses it to determine the encoding, which in your case is UTF-8 (the default encoding – correctly – for Python 3).

EDIT

As Mark Tolonen pointed out, the compact version of the above is the following code:

df.to_csv('2000.csv', encoding='utf-8-sig', index=False)

The -sig in the name of the encoding stands for “signature”, i.e., the BOM at the beginning which is used by Microsoft software to detect the encoding. See also the Encodings and Unicode section of the codecs manual.

回答2:

You system seems to be writing the file encoded as UTF-8. Excel expects UTF-8 files to have a BOM signature, else it assumes a text file is encoded in a locale-specific ANSI encoding. This is for backward compatibility due to Windows existing before UTF-8 did.

Python has an encoding that writes the UTF-8 BOM signature, utf-8-sig, so simply use:

df.to_csv('2000.csv', encoding='utf-8-sig', index=False)

回答3:

UPDATE:

I've fixed this situation from the following answer in the link below..

https://stackoverflow.com/a/6488070/10675615

Save the exported file as a csv in the cmd prompt

Open Excel

Import the data using Data-->Import External Data/ Get Text/CSV --> Import Data

Select the file type of "csv" and browse to your file

In the import wizard change the File_Origin to "65001 UTF" (or choose the correct language character identifier)

Change the Delimiter to comma

Select where to import to and Finish This way the special characters should show correctly.

来源：https://stackoverflow.com/questions/53797028/excel-csv-output-appearing-wrong-after-calling-python-file

标签

python

excel

csv

web-scraping

export-to-csv