How to get webpage title without downloading all the page source

前端 未结 2 944
小鲜肉
小鲜肉 2020-12-31 23:29

I\'m looking for a method that will allow me to get the title of a webpage and store it as a string.

However all the solutions I have found so far involve downloadin

2条回答
  •  无人及你
    2021-01-01 00:03

    As the </code> tag is in the HTML itself, there will be no way to <em>not</em> download the file to find "just the title". You should be able download a portion of the file until you've read in the <code><title></code> tag, or the <code></head></code> tag and then stop, but you'll still need to download (at least a portion of) the file.</p> <p>This can be accomplished with <code>HttpWebRequest</code>/<code>HttpWebResponse</code> and reading in data from the response stream until we've either read in a <code><title> block, or the tag. I added the tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).

    The following should be able to accomplish this task:

    string title = "";
    try {
        HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
        HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
    
        using (Stream stream = response.GetResponseStream()) {
            // compiled regex to check for  block
            Regex titleCheck = new Regex(@"\s*(.+?)\s*", RegexOptions.Compiled | RegexOptions.IgnoreCase);
            int bytesToRead = 8092;
            byte[] buffer = new byte[bytesToRead];
            string contents = "";
            int length = 0;
            while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
                // convert the byte-array to a string and add it to the rest of the
                // contents that have been downloaded so far
                contents += Encoding.UTF8.GetString(buffer, 0, length);
    
                Match m = titleCheck.Match(contents);
                if (m.Success) {
                    // we found a  match =]
                    title = m.Groups[1].Value.ToString();
                    break;
                } else if (contents.Contains("")) {
                    // reached end of head-block; no title found =[
                    break;
                }
            }
        }
    } catch (Exception e) {
        Console.WriteLine(e);
    }
    

    UPDATE: Updated the original source-example to use a compiled Regex and a using statement for the Stream for better efficiency and maintainability.

提交回复
热议问题