Web scraping urlopen in python

后端未结

关注

 3  741

小蘑菇 2021-01-06 07:09

I am trying to get the data from this website: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don\'t get the

3条回答

无人及你 (楼主)

2021-01-06 07:29

Personally , I write:

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

Et si tu parles français,.. bonjour sur stackoverflow.com !

update 1

In fact, I prefer now to employ the following code, because it is faster:

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

Changing httplib to http.client in this code should be enough to adapt it to Python 3.

I confirm that, with these two codes, I obtain the source code in which I see the data in which you are interested:

update 2

Adding the following snippet to the above code will allow you to extract the data I suppose you want:

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t(\d\d:\d\d:\d\d)\r\n'
                  '\t\t\t\t\t\t([\d.]+)\r\n'
                  '\t\t\t\t\t\t(\d+)\r\n')

print regx.findall(content)

result (only the end)

.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '\n'
104 '
\n'
114 '\n'
128 '\n'
129 ''



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

I hope you don't plan to "play" trading on the Forex: it's one of the best way to loose money rapidly.

update 3

SORRY ! I forgot you are with Python 3. So I think you must define the regex like that:

regx = re.compile(b'\t\t\t\t\t......)

that is to say with b before the string, otherwise you'll get an error like in this question

0 讨论(0)

查看其它3个回答