Saving html page from MATLAB web browser

心不动则不痛 提交于 2020-01-05 05:50:29

问题


Following this question I get a message on the retrieved page that "Your browser does not support JavaScript so some functionality may be missing!"

If I open this page with web(url) in MATLAB web browser and accept certificate (once per session), the page opens properly.

How can I save the page source from the browser with a script? Or from system browser? Or may be there is a way to get that page even without browser?

url='https://cgwb.nci.nih.gov/cgi-bin/hgTracks?position=chr7:55054218-55242525';

回答1:


From what I could tell the page source gets downloaded just fine, just make sure to let Javascript run when you open the saved page locally.

[...]
<script type='text/javascript' src='../js/hgTracks.js'></script>
<noscript><b>Your browser does not support JavaScript so some functionality may be missing!</b></noscript>
[...]

Note that the solution you are using only downloads the web page without any of the attached stuff (images, .css, .js, etc..).

What you can do is call wget to get the page with all of its files:

url = 'https://cgwb.nci.nih.gov/cgi-bin/hgTracks?position=chr7:55054218-55242525';
command = ['wget --no-check-certificate --page-requisites ' url];
system( command );

If you are on a Windows machine, you can always get wget from the GnuWin32 project or from one of the many other implementations.




回答2:


Will saving cookies be sufficient for solving your problem? wget can do that with --keep-session-cookies and --save-cookies filename; then you use --load-cookies filename to get your cookies back on subsequent requests. Something like the following (note I have not tested this from Matlab, so quoting, etc, might not be exactly right, but I do use a similar shell construction in other contexts):

command_init = ['wget --no-check-certificate \
                      --page-requisites \
                      --keep-session-cookies \
                      --save-cookies cookie_file.txt \
                      --post-data \'user=X&pass=Y&whatever=TRUE\'' \
                      init_url];
command_get  = ['wget --no-check-certificate \
                      --page-requisites \
                      --load-cookies cookie_file.txt' \
                      url];

If you don't have any post-data, but rather subsequent gets will update cookies, you can simply use keep and save on successive get requests.



来源:https://stackoverflow.com/questions/2656624/saving-html-page-from-matlab-web-browser

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!