screen scraping using Ghost.py

南笙酒味 提交于 2019-12-07 16:00:27

Seem like people are reporting similar issues to yours, without really getting any explanation (for example: https://github.com/jeanphix/Ghost.py/issues/26)

Adjust the evaluate line to the following, which is referenced by a ghost.py documentation:

links = gh.evaluate("""
                        var links = document.querySelectorAll("a");
                        var listRet = [];
                        for (var i=0; i<links.length; i++){
                            listRet.push(links[i].href);
                        }
                        listRet;
                    """)

I was getting this error with every page I tried when I first got Ghost.py, the way I went about solving it was I scrapped PyQt and installed PySide instead. That fixed it for me anyway.

I had to add extra logic in the ghost.py wait_for_page_loaded func:

    reTmp = str(resource.url)
    if "PyQt4" in reTmp:
        reTmp = str(reTmp).replace("PyQt4.QtCore.QUrl(u\'", "").replace("\')","")
    if url == reTmp:
        page = resource

PyQt was adding stupid junk to resource.url, so url==resource.url could never load a page properly.

ghost.py requires either PySide (preferred) or PyQt Qt bindings:

pip install pyside
pip install ghost.py --pre

try install pyside instead of pyqt. this work for me.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!