How to find the final destination (url) of an ad (programmatically)

爷,独闯天下 提交于 2019-12-04 12:09:38

Sample PHP Implementation:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

Which should return something like https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true

Note: You might need to use curl_setopt() to turn off CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER if you want to reliably follow across HTTPS/SSL

curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>
  • --head restricts it to HEAD requests only, so that you don't have to actually download the pages

  • -L tells curl to keep following redirects

  • -s gets rid of any progress meters, etc

  • -o /dev/null tells curl to throw away the headers retrieved (we don't care about them)

  • -w %{url_effective} tells curl to write out the last fetched url as the result to stdout

The result will be that the effective url is written to stdout, and nothing else.

You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address.

The Net::HTTP library has a Following Redirection example.

Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed.

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

Notice that after reading the body the URL was redirected to Google's / dir.

Both cases will only handle server redirects. For meta-redirects you'll have to look at the code, see where they're redirecting you and go there.

This will get you started:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil

cURL can retrieve HTTP headers. Keep stepping through the chain until you're no longer getting Location: headers and the last Location: header you received is the final URL.

The Mechanize gem is handy for this:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s

The solution I ended up using was simulating a browser, loading the ad, and clicking. The click was the key ingredient. Solutions offered by others were good for a given URL but would not handle Flash, JavaScript, etc. Appreciate everyones' help.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!