urllib2

Extract News article content from stored .html pages

拈花ヽ惹草 提交于 2019-12-03 03:55:21
问题 I am reading text from html files and doing some analysis. These .html files are news articles. Code: html = open(filepath,'r').read() raw = nltk.clean_html(html) raw.unidecode(item.decode('utf8')) Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python? I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one

Python and urllib2: how to make a GET request with parameters

。_饼干妹妹 提交于 2019-12-03 03:33:09
问题 I'm building an "API API", it's basically a wrapper for a in house REST web service that the web app will be making a lot of requests to. Some of the web service calls need to be GET rather than post, but passing parameters. Is there a "best practice" way to encode a dictionary into a query string? e.g.: ?foo=bar&bla=blah I'm looking at the urllib2 docs, and it looks like it decides by itself wether to use POST or GET based on if you pass params or not, but maybe someone knows how to make it

Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

戏子无情 提交于 2019-12-03 03:21:52
问题 I was looking to find a way to optimize my code when I heard some good things about threads and urllib3. Apparently, people disagree which solution is the best. The problem with my script below is the execution time: so slow! Step 1 : I fetch this page http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on Step 2 : I parse the page with BeautifulSoup Step 3: I put the data in an excel doc Step 4: I do it again, and again, and again for all the countries in my

Trying to get Tor to work with Python, but keep getting connection refused.?

匿名 (未验证) 提交于 2019-12-03 03:10:03
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I've been trying to get Tor to work with Python, but I've been hitting a brick wall. I simply can't get any of the examples to work. Here is one from Stackoverflow import urllib2 proxy = urllib2.ProxyHandler({'http':'127.0.0.1:8118'}) opener = urllib2.build_opener(proxy) print opener.open('http://check.torproject.org/').read() I've installed Tor and it works fine while browsing through Aurora. However running this python script I get Traceback (most recent call last): File "/home/x/Tor.py", line 4, in <module> print opener.open('http://check

Python multiprocessing Pool.apply_async with shared variables (Value)

匿名 (未验证) 提交于 2019-12-03 03:04:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: For my college project I am trying to develop a python based traffic generator.I have created 2 CentOS machines on vmware and I am using 1 as my client and 1 as my server machine. I have used IP aliasing technique to increase number of clients and severs using just single client/server machine. Upto now I have created 50 IP alias on my client machine and 10 IP alias on my server machine. I am also using multiprocessing module to generate traffic concurrently from all 50 clients to all 10 servers. I have also developed few profiles(1kb,10kb

Why I get urllib2.HTTPError with urllib2 and no errors with urllib?

匿名 (未验证) 提交于 2019-12-03 03:00:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have the following simple code: import urllib2 import sys sys.path.append('../BeautifulSoup/BeautifulSoup-3.1.0.1') from BeautifulSoup import * page='http://en.wikipedia.org/wiki/Main_Page' c=urllib2.urlopen(page) This code generates the following error messages: c=urllib2.urlopen(page) File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen return _opener.open(url, data) File "/usr/lib64/python2.4/urllib2.py", line 364, in open response = meth(req, response) File "/usr/lib64/python2.4/urllib2.py", line 471, in http_response response

HTTPS log in with urllib2

匿名 (未验证) 提交于 2019-12-03 02:52:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I currently have a little script that downloads a webpage and extracts some data I'm interested in. Nothing fancy. Currently I'm downloading the page like so: import commands command = 'wget --output-document=- --quiet --http-user=USER --http-password=PASSWORD https://www.example.ca/page.aspx' status, text = commands.getstatusoutput(command) Although this works perfectly, I thought it'd make sense to remove the dependency on wget. I thought it should be trivial to convert the above to urllib2, but thus far I've had zero success. The Internet

Python urllib2 URLError HTTP status code.

匿名 (未验证) 提交于 2019-12-03 02:49:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I want to grab the HTTP status code once it raises a URLError exception: I tried this but didn't help: except URLError , e : logger . warning ( 'It seems like the server is down. Code:' + str ( e . code ) ) 回答1: You shouldn't check for a status code after catching URLError , since that exception can be raised in situations where there's no HTTP status code available, for example when you're getting connection refused errors. Use HTTPError to check for HTTP specific errors, and then use URLError to check for other problems: try :

Tell urllib2 to use custom DNS

匿名 (未验证) 提交于 2019-12-03 02:45:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'd like to tell urllib2.urlopen (or a custom opener ) to use 127.0.0.1 (or ::1 ) to resolve addresses. I wouldn't change my /etc/resolv.conf , however. One possible solution is to use a tool like dnspython to query addresses and httplib to build a custom url opener. I'd prefer telling urlopen to use a custom nameserver though. Any suggestions? 回答1: Looks like name resolution is ultimately handled by socket.create_connection . -> urllib2.urlopen -> httplib.HTTPConnection -> socket.create_connection Though once the "Host:" header has been set

Python handling socket.error: [Errno 104] Connection reset by peer

匿名 (未验证) 提交于 2019-12-03 02:44:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: When using Python 2.7 with urllib2 to retrieve data from an API, I get the error [Errno 104] Connection reset by peer . Whats causing the error, and how should the error be handled so that the script does not crash? ticker.py def urlopen(url): response = None request = urllib2.Request(url=url) try: response = urllib2.urlopen(request).read() except urllib2.HTTPError as err: print "HTTPError: {} ({})".format(url, err.code) except urllib2.URLError as err: print "URLError: {} ({})".format(url, err.reason) except httplib.BadStatusLine as err: