Using MATLAB to parse HTML for URL in anchors, help fast

丶灬走出姿态 提交于 2019-12-11 03:34:16

问题


I'm on a strict time limit and I really need a regex to parse this type of anchor (they're all in this format)

<a href="20120620_0512_c2_1024.jpg">20120620_0512_c2_102..&gt;</a>

for the URL

20120620_0512_c2_1024.jpg

I know its not a full URL, it's relative, please help

Here's my code so far

year = datestr(now,'yyyy');
timestamp = datestr(now,'yyyymmdd');
html = urlread(['http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed/' year '/c2/' timestamp '/']);
links = regexprep(html, '<a href=.*?>', '');

回答1:


Try the following:

url = 'http://sohowww.nascom.nasa.gov/data/REPROCESSING/Completed/2012/c2/20120620/';
html = urlread(url);
t = regexp(html, '<a href="([^"]*\.jpg)">', 'tokens');
t = [t{:}]'

The resulting cell array (truncated):

t = 
    '20120620_0512_c2_1024.jpg'
    '20120620_0512_c2_512.jpg'
    ...
    '20120620_2200_c2_1024.jpg'
    '20120620_2200_c2_512.jpg'



回答2:


I think this is what you are looking for:

htmlLink = '<a href="20120620_0512_c2_1024.jpg">20120620_0512_c2_102..&gt;</a>';

link = regexprep(htmlLink, '(<a href=")(.*\.jpg)(">.*</a>)', '$2');

link =
20120620_0512_c2_1024.jpg

regexprep works also for cell arrays of strings, so this works too:

htmlLinksCellArray = { '<a href="20120620_0512_c2_1024.jpg">20120620_0512_c2_102..&gt;</a>', '<a href="20120620_0512_c2_1025.jpg">20120620_0512_c2_102..&gt;</a>', '<a href="20120620_0512_c2_1026.jpg">20120620_0512_c2_102..&gt;</a>' };

linksCellArray = regexprep(htmlLinksCellArray, '(<a href=")(.*\.jpg)(">.*</a>)', '$2')

linksCellArray = 
'20120620_0512_c2_1024.jpg'  '20120620_0512_c2_1025.jpg'  '20120620_0512_c2_1026.jpg'


来源:https://stackoverflow.com/questions/11126721/using-matlab-to-parse-html-for-url-in-anchors-help-fast

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!