Easy way to parse a url in C++ cross platform?

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

I need to parse a URL to get the protocol, host, path, and query in an application I am writing in C++. The application is intended to be cross-platform. I'm surprised I can't find anything that does this in the boost or POCO libraries. Is it somewhere obvious I'm not looking? Any suggestions on appropriate open source libs? Or is this something I just have to do my self? It's not super complicated but it seems like such a common task I am surprised there isn't a common solution.

回答1:

There is a library that's proposed for Boost inclusion and allows you to parse HTTP URI's easily. It uses Boost.Spirit and is also released under the Boost Software License. The library is cpp-netlib which you can find the documentation for at http://cpp-netlib.github.com/ -- you can download the latest release from http://github.com/cpp-netlib/cpp-netlib/downloads .

The relevant type you'll want to use is boost::network::http::uri and is documented here.



回答2:

Terribly sorry, couldn't help it. :s

url.hh

#ifndef URL_HH_ #define URL_HH_     #include <string> struct url {     url(const std::string& url_s); // omitted copy, ==, accessors, ... private:     void parse(const std::string& url_s); private:     std::string protocol_, host_, path_, query_; }; #endif /* URL_HH_ */ 

url.cc

#include "url.hh" #include <string> #include <algorithm> #include <cctype> #include <functional> using namespace std;  // ctors, copy, equality, ...  void url::parse(const string& url_s) {     const string prot_end("://");     string::const_iterator prot_i = search(url_s.begin(), url_s.end(),                                            prot_end.begin(), prot_end.end());     protocol_.reserve(distance(url_s.begin(), prot_i));     transform(url_s.begin(), prot_i,               back_inserter(protocol_),               ptr_fun<int,int>(tolower)); // protocol is icase     if( prot_i == url_s.end() )         return;     advance(prot_i, prot_end.length());     string::const_iterator path_i = find(prot_i, url_s.end(), '/');     host_.reserve(distance(prot_i, path_i));     transform(prot_i, path_i,               back_inserter(host_),               ptr_fun<int,int>(tolower)); // host is icase     string::const_iterator query_i = find(path_i, url_s.end(), '?');     path_.assign(path_i, query_i);     if( query_i != url_s.end() )         ++query_i;     query_.assign(query_i, url_s.end()); } 

main.cc

// ...     url u("HTTP://stackoverflow.com/questions/2616011/parse-a.py?url=1");     cout << u.protocol() << '\t' << u.host() << ... 


回答3:

Wstring version of above, added other fields I needed. Could definitely be refined, but good enough for my purposes.

#include <string> #include <algorithm>    // find  struct Uri { public: std::wstring QueryString, Path, Protocol, Host, Port;  static Uri Parse(const std::wstring &uri) {     Uri result;      typedef std::wstring::const_iterator iterator_t;      if (uri.length() == 0)         return result;      iterator_t uriEnd = uri.end();      // get query start     iterator_t queryStart = std::find(uri.begin(), uriEnd, L'?');      // protocol     iterator_t protocolStart = uri.begin();     iterator_t protocolEnd = std::find(protocolStart, uriEnd, L':');            //"://");      if (protocolEnd != uriEnd)     {         std::wstring prot = &*(protocolEnd);         if ((prot.length() > 3) && (prot.substr(0, 3) == L"://"))         {             result.Protocol = std::wstring(protocolStart, protocolEnd);             protocolEnd += 3;   //      ://         }         else             protocolEnd = uri.begin();  // no protocol     }     else         protocolEnd = uri.begin();  // no protocol      // host     iterator_t hostStart = protocolEnd;     iterator_t pathStart = std::find(hostStart, uriEnd, L'/');  // get pathStart      iterator_t hostEnd = std::find(protocolEnd,          (pathStart != uriEnd) ? pathStart : queryStart,         L':');  // check for port      result.Host = std::wstring(hostStart, hostEnd);      // port     if ((hostEnd != uriEnd) && ((&*(hostEnd))[0] == L':'))  // we have a port     {         hostEnd++;         iterator_t portEnd = (pathStart != uriEnd) ? pathStart : queryStart;         result.Port = std::wstring(hostEnd, portEnd);     }      // path     if (pathStart != uriEnd)         result.Path = std::wstring(pathStart, queryStart);      // query     if (queryStart != uriEnd)         result.QueryString = std::wstring(queryStart, uri.end());      return result;  }   // Parse };  // uri 

Tests/Usage

Uri u0 = Uri::Parse(L"http://localhost:80/foo.html?&q=1:2:3"); Uri u1 = Uri::Parse(L"https://localhost:80/foo.html?&q=1"); Uri u2 = Uri::Parse(L"localhost/foo"); Uri u3 = Uri::Parse(L"https://localhost/foo"); Uri u4 = Uri::Parse(L"localhost:8080"); Uri u5 = Uri::Parse(L"localhost?&foo=1"); Uri u6 = Uri::Parse(L"localhost?&foo=1:2:3");  u0.QueryString, u0.Path, u0.Protocol, u0.Host, u0.Port.... 


回答4:

For completeness, there is one written in C that you could use (with a little wrapping, no doubt): http://uriparser.sourceforge.net/

[RFC-compliant and supports Unicode]


Here's a very basic wrapper I've been using for simply grabbing the results of a parse.

#include <string> #include <uriparser/Uri.h>   namespace uriparser {     class Uri //: boost::noncopyable     {         public:             Uri(std::string uri)                 : uri_(uri)             {                 UriParserStateA state_;                 state_.uri = &uriParse_;                 isValid_   = uriParseUriA(&state_, uri_.c_str()) == URI_SUCCESS;             }              ~Uri() { uriFreeUriMembersA(&uriParse_); }              bool isValid() const { return isValid_; }              std::string scheme()   const { return fromRange(uriParse_.scheme); }             std::string host()     const { return fromRange(uriParse_.hostText); }             std::string port()     const { return fromRange(uriParse_.portText); }             std::string path()     const { return fromList(uriParse_.pathHead, "/"); }             std::string query()    const { return fromRange(uriParse_.query); }             std::string fragment() const { return fromRange(uriParse_.fragment); }          private:             std::string uri_;             UriUriA     uriParse_;             bool        isValid_;              std::string fromRange(const UriTextRangeA & rng) const             {                 return std::string(rng.first, rng.afterLast);             }              std::string fromList(UriPathSegmentA * xs, const std::string & delim) const             {                 UriPathSegmentStructA * head(xs);                 std::string accum;                  while (head)                 {                     accum += delim + fromRange(head->text);                     head = head->next;                 }                  return accum;             }     }; } 


回答5:

POCO's URI class can parse URLs for you. The following example is shortened version of the one in POCO URI and UUID slides:

#include "Poco/URI.h" #include <iostream>  int main(int argc, char** argv) {     Poco::URI uri1("http://www.appinf.com:88/sample?example-query#frag");      std::string scheme(uri1.getScheme()); // "http"     std::string auth(uri1.getAuthority()); // "www.appinf.com:88"     std::string host(uri1.getHost()); // "www.appinf.com"     unsigned short port = uri1.getPort(); // 88     std::string path(uri1.getPath()); // "/sample"     std::string query(uri1.getQuery()); // "example-query"     std::string frag(uri1.getFragment()); // "frag"     std::string pathEtc(uri1.getPathEtc()); // "/sample?example-query#frag"      return 0; } 


回答6:

The Poco library now has a class for dissecting URI's and feeding back the host, path segments and query string etc.

http://www.appinf.com/docs/poco/Poco.URI.html



回答7:

Facebook's Folly library can do the job for you easily. Simply use the Uri class:

#include <folly/Uri.h>  int main() {     folly::Uri folly("https://code.facebook.com/posts/177011135812493/");      folly.scheme(); // https     folly.host();   // code.facebook.com     folly.path();   // posts/177011135812493/ } 


回答8:

//sudo apt-get install libboost-all-dev; #install boost //g++ urlregex.cpp -lboost_regex; #compile #include <string> #include <iostream> #include <boost/regex.hpp>  using namespace std;  int main(int argc, char* argv[]) {     string url="https://www.google.com:443/webhp?gws_rd=ssl#q=cpp";     boost::regex ex("(http|https)://([^/ :]+):?([^/ ]*)(/?[^ #?]*)\\x3f?([^ #]*)#?([^ ]*)");     boost::cmatch what;     if(regex_match(url.c_str(), what, ex))      {         cout << "protocol: " << string(what[1].first, what[1].second) << endl;         cout << "domain:   " << string(what[2].first, what[2].second) << endl;         cout << "port:     " << string(what[3].first, what[3].second) << endl;         cout << "path:     " << string(what[4].first, what[4].second) << endl;         cout << "query:    " << string(what[5].first, what[5].second) << endl;         cout << "fragment: " << string(what[6].first, what[6].second) << endl;     }     return 0; } 


回答9:

QT has QUrl for this. GNOME has SoupURI in libsoup, which you'll probably find a little more light-weight.



回答10:

Also of interest could be http://code.google.com/p/uri-grammar/ which like Dean Michael's netlib uses boost spirit to parse a URI. Came across it at Simple expression parser example using Boost::Spirit?



回答11:

There is the newly released google-url lib:

http://code.google.com/p/google-url/

The library provides a low-level url parsing API as well as a higher-level abstraction called GURL. Here's an example using that:

#include <googleurl\src\gurl.h>  wchar_t url[] = L"http://www.facebook.com"; GURL parsedUrl (url); assert(parsedUrl.DomainIs("facebook.com")); 

Two small complaints I have with it: (1) it wants to use ICU by default to deal with different string encodings and (2) it makes some assumptions about logging (but I think they can be disabled). In other words, the library is not completely stand-alone as it exists, but I think it's still a good basis to start with, especially if you are already using ICU.



回答12:

This library is very tiny and lightweight: https://github.com/corporateshark/LUrlParser

However, it is parsing only, no URL normalization/validation.



回答13:

I was looking for easy standalone URI library for C++ too. Being unable to find one i took URI class from Poco, recomended in this topic, and made it independent by making few modifications to original source files. Made of only 2 source files and doesnt require any exgernal libraries, only uses few headers from STL. I've done some testing with GCC and MS compilers and put it here on my website: http://ikk.byethost9.com/index.php?MainMenu=hef_uri_syntax It has renamed namespace Poco -> hef and renamed main class URI -> HfURISyntax. Its enncouraged to rename these when using in Your own projects. (Original copyright included. There is a text document that contains summary of modifications.)



回答14:

You could try the open-source library called C++ REST SDK (created by Microsoft, distributed under the Apache License 2.0). It can be built for several platforms including Windows, Linux, OSX, iOS, Android). There is a class called web::uri where you put in a string and can retrieve individual URL components. Here is a code sample (tested on Windows):

#include <cpprest/base_uri.h> #include <iostream> #include <ostream>  web::uri sample_uri( L"http://dummyuser@localhost:7777/dummypath?dummyquery#dummyfragment" ); std::wcout << L"scheme: "   << sample_uri.scheme()     << std::endl; std::wcout << L"user: "     << sample_uri.user_info()  << std::endl; std::wcout << L"host: "     << sample_uri.host()       << std::endl; std::wcout << L"port: "     << sample_uri.port()       << std::endl; std::wcout << L"path: "     << sample_uri.path()       << std::endl; std::wcout << L"query: "    << sample_uri.query()      << std::endl; std::wcout << L"fragment: " << sample_uri.fragment()   << std::endl; 

The output will be:

scheme: http user: dummyuser host: localhost port: 7777 path: /dummypath query: dummyquery fragment: dummyfragment 

There are also other easy-to-use methods, e.g. to access individual attribute/value pairs from the query, split the path into components, etc.



回答15:

There is yet another library https://snapwebsites.org/project/libtld which handles all possible top level domains and URI shema



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!