Parsing specific data items from website

问题

I tried to retrieve the following data variables from the this webpage

Address
City
State
Zip Code
Store Phone
Pharmacy Phone
Open Hours
Pharmacy Hours
Pickup Options
At this store/location
Site to Store Hours

I tried in this way, but i can't separate out some data to store in the above data variables so need some help and suggestion from some PHP expert

 $html = file_get_html('http://www.walmart.com/storeLocator/ca_storefinder_results.do?serviceName=&rx_title=com.wm.www.apps.storelocator.page.serviceLink.title.default&rx_dest=%2Findex.gsp&sfrecords=50&sfsearch_single_line_address=K6T');
foreach($html->find('div[class=StoreAddress] div[1]') as $name)
{
echo $name->innertext.'<br>';
}

The html of this website is complex to identify each data item with it's tag because their are no proper id assigned to tags. Can anyone please suggest easy and scalable way to parse above data items from this website.

Thanks

回答1:

The html isn't really that complex. Php's iterators and dom/regex functions are clumsy for tasks like this but it can be done:

$dom = new DOMDocument();
@$dom->loadHTMLFile('http://www.walmart.com/storeLocator/ca_storefinder_details_short.do?rx_dest=/index.gsp&rx_title=com.wm.www.apps.storelocator.page.serviceLink.title.default&edit_object_id=2092&sfsearch_single_line_address=K6T');
$xpath = new DOMXPath($dom);

foreach($xpath->query('//div[@class="StoreAddress"]') as $div) {
  // title
  echo $xpath->query(".//div[1]", $div)->item(0)->nodeValue . "\n";
  // street
  echo $xpath->query(".//div[2]", $div)->item(0)->nodeValue . "\n";
  // city state and zip
  preg_match('/(.*), ([A-Z]{2}) (\d{5})/', $xpath->query(".//div[3]", $div)->item(0)->nodeValue, $m);
  // city
  echo $m[1] . "\n";
  // state
  echo $m[2] . "\n";
  // zip
  echo $m[3] . "\n";
}

回答2:

i see that they implement a nice hr tag before the adress. explode it on the hr tag and use the remaining partwith the adress to rebuild the html object. then iterate through the divs and use preg_match to see if the object contains any reference to your wanted data.

foreach($html->find(’div’) as $test)
    {
     if(preg_match(’/Adress/’,$test->innertext))
        {
        filter out addy
        }
    }

回答3:

try out simple_html_dom library. On the page there are straight-forward examples that will get you up to speed.

I have been using that successfully for exactly the kind of things you are trying to do.

来源：https://stackoverflow.com/questions/10762051/parsing-specific-data-items-from-website

标签

php

parsing

screen-scraping

web-scraping