Fetch data of variables inside script tag in Python or Content added from js

后端 未结 2 795
礼貌的吻别
礼貌的吻别 2020-12-14 11:23

I want to fetch data from another url for which I am using urllib and Beautiful Soup , My data is inside table tag (which I have figure out using Firefox co

相关标签:
2条回答
  • 2020-12-14 11:45

    EDIT

    This will do the trick using re module to extract the data and loading it as JSON:

    import urllib
    import json
    import re
    from bs4 import BeautifulSoup
    
    web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
    soup = BeautifulSoup(web.read(), 'lxml')
    data  = soup.find_all("script")[19].string
    p = re.compile('var table_body = (.*?);')
    m = p.match(data)
    stocks = json.loads(m.groups()[0])
    
    >>> for stock in stocks:
    ...     print stock
    ... 
    [u'ASPS', u'Altisource Portfolio Solutions S.A.', 116.96, 2.2, 1.92, 86635, u'N', u'N']
    [u'AGNC', u'American Capital Agency Corp.', 23.76, 0.13, 0.55, 3184303, u'N', u'N']
    .
    .
    .
    [u'ZION', u'Zions Bancorporation', 29.79, 0.46, 1.57, 2154017, u'N', u'N']
    

    The problem with this is that the script tag offset is hard-coded and there is not a reliable way to locate it within the page. Changes to the page could break your code.

    ORIGINAL answer

    Rather than try to screen scrape the data, you can download a CSV representation of the same data from http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download.

    Then use the Python csv module to parse and process it. Not only is this more convenient, it will be a more resilient solution because any changes to the HTML could easily break your screen scraping code.

    Otherwise, if you look at the actual HTML you will find that the data is available within the page in the following script tag:

    <script type="text/javascript">var table_body = [["ATVI", "Activision Blizzard, Inc", 20.92, 0.21, 1.01, 6182877,  .1, "N", "N"],
    ["ADBE", "Adobe Systems Incorporated", 66.91, 1.44, 2.2, 3629837,  .6, "N", "N"],
    ["AKAM", "Akamai Technologies, Inc.", 57.47, 1.57, 2.81, 2697834,  .3, "N", "N"],
    ["ALXN", "Alexion Pharmaceuticals, Inc.", 170.2, 0.7, 0.41, 659817,  .1, "N", "N"],
    ["ALTR", "Altera Corporation", 33.82, -0.06, -0.18, 1928706,  .0, "N", "N"],
    ["AMZN", "Amazon.com, Inc.", 329.67, 6.1, 1.89, 5246300,  2.5, "N", "N"],
    ....
    ["YHOO", "Yahoo! Inc.", 35.92, 0.98, 2.8, 18705720,  .9, "N", "N"]];
    
    0 讨论(0)
  • 2020-12-14 11:46

    Just to add to @mhawke 's answer, rather than hardcoding the offset of the script tag, you loop through all the script tags and match the one that matches your pattern;

    web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
    pattern = re.compile('var table_body = (.*?);')
    
    soup = BeautifulSoup(web.read(), "lxml")
    scripts = soup.find_all('script')
    for script in scripts:
       if(pattern.match(str(script.string))):
           data = pattern.match(script.string)
           stock = json.loads(data.groups()[0])
           print stock
    
    0 讨论(0)
提交回复
热议问题