How to retrieve data from HTML canvas using python?

穿精又带淫゛_ 提交于 2020-05-25 08:55:13

问题


I am new of web scraping and for one of the project I am working on, I need to retrieve data of bitcoin transactions over time from an interactive chart (https://bitinfocharts.com/comparison/bitcoin-transactions.html) using Python 2.7. I found that all the data I want is hidden in the 855x455 canvas instead of directly in the html file. However, I could find those data in Page source in the form of [new Date("2018/02/18"),159333]]. Why is that? And how can I scrape those data? Appreciate for the help!


回答1:


On looking the html response I found that there is a script tag with all the entires added in Canvas.

<script>
var gIsLog = 0;
var gIsZoomed = "";
var d;
$(function() {
            $(".average").each(function() {
                $(this).html('Average ' + $(this).html());
            });
            $(".simple").each(function() {
                $(this).html('Simple ' + $(this).html());
            });
            $(".exponential").each(function() {
                $(this).html('Exponential ' + $(this).html());
            });
            $(".weighted").each(function() {
                $(this).html('Weighted ' + $(this).html());
            });
            $("#container").height(($(window).height() - 355 - $('#buttonsHDiv').height() > 200) ? $(window).height() - 355 - $('#buttonsHDiv').height() : 200);
            $(window).resize(function() {
                $("#container").height(($(window).height() - 355 - $('#buttonsHDiv').height() > 200) ? $(window).height() - 355 - $('#buttonsHDiv').height() : 200);
            });
            d = new Dygraph(document.getElementById("container"), [
                        [new Date("2009/01/03"), null],
                        [new Date("2009/01/04"), null],
                        [new Date("2009/01/05"), null],
                        [new Date("2009/01/06"), null],
                        [new Date("2009/01/07"), null],
                        [new Date("2009/01/08"), null],

With the help of this fact I managed to write below code using regex. It does what you want. I parsed response text and then found script tag with requried data and applied regex over it. Please have a look.

import os
import re
import requests
from bs4 import BeautifulSoup


url = 'https://bitinfocharts.com/comparison/bitcoin-transactions.html'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
script_tag = soup.findAll('script')[5]
script_text = script_tag.text


pattern = re.compile(r'\[new Date\("\d{4}/\d{2}/\d{2}"\),\d*\w*\]')
records = pattern.findall(script_text)


def parse_record(record):
    date = record[11:21]
    value = record[24:-1]
    return [date,value]

transactions = []

for record in records:
    transactions.append(parse_record(record))


来源:https://stackoverflow.com/questions/48862653/how-to-retrieve-data-from-html-canvas-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!