Parse HTML and get multidimensional array with date wise using regex (scraping data)?

蹲街弑〆低调 提交于 2019-12-25 07:20:20

问题


I'm trying to group the results i get date wise.

Please refer my previous question. How to ignore http link in string and return everything else?

Basically right now i get the schedule list but that doesn't include any date in it, So it's hard to understand which event is going to go live on which date and time, it's confusing people because of no date as it shows same timing for multiple events which is actually going to go live on a different date.

From the previous question, I got a solution which is perfect (Thanks Denomales for the solution!) but just no date.

Here's the solution regex:

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?green['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)

And here's the sample data:

<font color="black" size="6">---</font><p>
<font color="red" size="6">FRIDAY 6TH SEPTEMBER</font><p>
<font color="gold"> *ENGLISH* </font> Some event with quotes, comma, slashes, dots and more 9.00pm-5.00pm <font color="red">Channel 18</font><p>
<font color="gold"> *ITALIAN* </font> Some event with quotes, comma, slashes, dots and more 9.50pm-10.00pm <font color="red">Channel 02</font><p>
<font color="gold"> *ENGLISH* </font> Some event with quotes, comma, slashes, dots and more 10:00AM-12:00pm <font color="red">Channel 05</font><p>
<font color="gold"> *JAPANESE* </font> Some Event Name 11.20am-1.20pm <font color="red">CHANNEL IP 2 STREAM http://domain.com/abc/channel2.html</font><p>
<font color="black" size="6">---</font><p>
<font color="red" size="6">FRIDAY 7TH SEPTEMBER</font><p>
<font color="gold"> *ENGLISH* </font> Some event with quotes, comma, slashes, dots and more 9.00pm-5.00pm <font color="red">Channel 18</font><p>
<font color="gold"> *ITALIAN* </font> Some event with quotes, comma, slashes, dots and more 9.50pm-10.00pm <font color="red">Channel 02</font><p>
<font color="gold"> *ENGLISH* </font> Some event with quotes, comma, slashes, dots and more 10:00AM-12:00pm <font color="red">Channel 05</font><p>
<font color="gold"> *JAPANESE* </font> Some Event Name 11.20am-1.20pm <font color="red">CHANNEL IP 2 STREAM http://domain.com/abc/channel2.html</font><p>

Now I'm trying to get the date (FRIDAY 6TH SEPTEMBER) in YYYY-MM-DD format and then the events schedule.

Example output expecting:

Array(
  ['2013-09-06'] => Array (
    [0] => Array (
      'language'   => 'ENGLISH',
      'title'      => 'Some event name',
      'startTime'  => '9:00pm',
      'endTime'    => '5:00pm',
      'channel'    => 'channel 18',
      'channelNum' => '18'
    ),
    [1] => Array (
      'language'   => 'ITALIAN',
      'title'      => 'Some event name',
      'startTime'  => '12:00pm',
      'endTime'    => '2:00pm',
      'channel'    => 'Channel IP 2',
      'channelNum' => '2'
    ),
    [2] => Array (
      'language'   => 'ENGLISH',
      'title'      => 'Some event name',
      'startTime'  => '6:00pm',
      'endTime'    => '8:00pm',
      'channel'    => 'channel 20',
      'channelNum' => '20'
    ),
  ),
  ['2013-09-07'] => Array (
    [0] => Array (
      'language'   => 'ENGLISH',
      'title'      => 'Some event name',
      'startTime'  => '9:00pm',
      'endTime'    => '5:00pm',
      'channel'    => 'channel 18',
      'channelNum' => '18'
    ),
    [1] => Array (
      'language'   => 'ITALIAN',
      'title'      => 'Some event name',
      'startTime'  => '12:00pm',
      'endTime'    => '2:00pm',
      'channel'    => 'Channel IP 2',
      'channelNum' => '2'
    ),
    [2] => Array (
      'language'   => 'ENGLISH',
      'title'      => 'Some event name',
      'startTime'  => '6:00pm',
      'endTime'    => '8:00pm',
      'channel'    => 'channel 20',
      'channelNum' => '20'
    ),
  ),
)

Example output is just random made up output, not a real data or anything.

Can anyone help ? Would really appreciate.

Note: I don't want to use any HTML parsing libs, So please don't recommend unless you have the solution which is much better than regex which i have right now.

来源:https://stackoverflow.com/questions/18664311/parse-html-and-get-multidimensional-array-with-date-wise-using-regex-scraping-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!