I can't get the whole source code of an HTML page

前端 未结 2 989
清酒与你
清酒与你 2021-01-22 19:47

Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user).

Say the URL is the URL I am trying to crawl. I run the

2条回答
  •  误落风尘
    2021-01-22 20:08

    This page may execute some javascript and javascript generates some content.
    Try Twill.
    It based on Mechanize, but executes javascript.
    Sample in Python:

    from twill.commands import *
    go("http://google.com/")
    fv("f", "q", "test")
    submit("btnG")
    info() #shows page info
    show() #shows html
    

    Another option is to use Zombie.js on Node.js.
    This library works even better then Twill and it is browserless solution.
    Sample in Coffeescript:

    zombie = require "zombie"
    browser = new zombie()
    browser.visit "https://www.google.ru/", =>
        browser.fill "q", "node.js"
        browser.pressButton "Поиск в Google", ->
            for item in browser.queryAll "h3.r a"
                console.log item.innerHTML
    

提交回复
热议问题