crawling a html page using php?

后端 未结 5 545
时光说笑
时光说笑 2020-12-03 23:38

This website lists over 250 courses in one list. I want to get the name of each course and insert that into my mysql database using php. The courses are listed like this:

相关标签:
5条回答
  • 2020-12-04 00:19

    Regular expressions work well.

    $page = // get the page
    $page = preg_split("/\n/", $page);
    for ($text in $page) {
        $matches = array();
        preg_match("/^<td>(.*)<\/td>$/", $text, $matches);
        // insert $matches[1] into the database
    }
    

    See the documentation for preg_match.

    0 讨论(0)
  • 2020-12-04 00:27

    How to parse HTML has been asked and answered countless times before. While (for your specific UseCase) Regular Expressions will work, it is - in general - better and more reliable to use a proper parser for this task. Below is how to do it with DOM:

    $dom = new DOMDocument;
    $dom->loadHTMLFile('http://courses.westminster.ac.uk/CourseList.aspx');
    foreach($dom->getElementsByTagName('td') as $title) {
        echo $title->nodeValue;
    }
    

    For inserting the data into MySql, you should use the mysqli extension. Examples are plentiful on StackOverflow. so please use the search function.

    0 讨论(0)
  • 2020-12-04 00:30

    You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/

    0 讨论(0)
  • 2020-12-04 00:30

    I encountered the same problem. Here is a good class library called the html dom http://simplehtmldom.sourceforge.net/. This like jquery

    0 讨论(0)
  • 2020-12-04 00:36

    Just for fun, here's a quick shell script to do the same thing.

    curl http://courses.westminster.ac.uk/CourseList.aspx \
    | sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \
    | uniq > courses.txt
    
    0 讨论(0)
提交回复
热议问题