问题
I'm parsing an external document and making all of the links in it absolute. For instance:
<link rel="stylesheet" type="text/css" href="/css/style.css" />
would be replaced with:
<link rel="stylesheet" type="text/css" href="http://www.hostsite.com/css/style.css" />
where http://www.hostsite.com is the base URL for the document.
This is what I've tried and failed at:
$linkfix1 = str_replace('href=\"\/', 'href=\"$url\/', $code);
There are several questions on the site related to doing this replacement on a single URL string, but I couldn't find any that work on URLs embedded in a document. Are there any good suggestions on how to make all these links absolute?
回答1:
You don't need to escape double quotes in a string that uses single quotes.
You also don't need to escape forward slashes at all.
You simply want:
str_replace('href="', 'href="http://hostsite.com', $replace_me);
To be safe, so that you don't replace every link with hostsite:
str_replace('href="/css/', 'href="http://hostsite.com/css/', $replace_me);
回答2:
Public service announcement: do not use regexes to rewrite elements of a formatted document.
The correct way to do this is to load the document as an entity (either DOMDocument
or SimpleXMLElement
) and do your processing based on nodes and values. The original solution also didn't handle src
tags or resolution of base-relative URLs (e.g. /css/style.css
).
Here's a mostly proper solution that could be expanded upon if need be:
# Example URL
$url = "http://www.stackoverflow.com/";
# Get the root and current directory
$pattern = "/(.*\/\/[^\/]+\/)([^?#]*\/)?/";
/* The pattern has two groups: one for the domain (anything before
the first two slashes, the slashes, anything until the next slash,
and the next slash) and one for the current directory (anything
that isn't an anchor or query string, then the last slash before
any anchor or query string). This yields:
- [0]: http://stackoverflow.com/question/123412341234
- [1]: http://stackoverflow.com/
- [2]: question/
We only need [0] (the entire match) and [1] (just the first group).
*/
$matches = array();
preg_match($pattern, $url, $matches);
$cd = $matches[0];
$root = $matches[1];
# Normalizes the URL on the provided element's attribute
function normalizeAttr($element, $attr){
global $pattern, $cd, $root;
$href = $element->getAttribute($attr);
# If this is an external URL, ignore
if(preg_match($pattern, $href))
return;
# If this is a base-relative URL, prepend the base
elseif(substr($href, 0, 1) == '/')
$element->setAttribute($attr, $root . substr($href, 1));
# If this is a relative URL, prepend the current directory
elseif(substr($href, 0, strlen($cd)) != $cd)
$element->setAttribute($attr, $cd . $href);
}
# Load in the data, ignoring HTML5 errors
$page = new DOMDocument();
libxml_use_internal_errors(true);
$page->loadHTMLFile($url);
libxml_use_internal_errors(false);
$page->normalizeDocument();
# Normalize <link href="..."/>
foreach($page->getElementsByTagName('link') as $link)
normalizeAttr($link, 'href');
# Normalize <a href="...">...</a>
foreach($page->getElementsByTagName('a') as $anchor)
normalizeAttr($anchor, 'href');
# Normalize <img src="..."/>
foreach($page->getElementsByTagName('img') as $image)
normalizeAttr($image, 'src');
# Normalize <script src="..."></script>
foreach($page->getElementsByTagName('script') as $script)
normalizeAttr($script, 'src');
# Render normalized data
print $page->saveHTML();
来源:https://stackoverflow.com/questions/15394198/fixing-relative-links-in-php