Scraping HTML from Google Translate

问题

I want to translate a string using Google Translator.

My sample string is "this is my string".

I want to use HTML Agility Pack to parse HTML documents.

I tried this:

using HtmlAgilityPack; 

........

var webGet = new HtmlWeb();
var document = webGet.Load(
    "http://translate.google.com/#en/bn/this%20is%20my%20string");

var node = document.DocumentNode.SelectNodes(
    "//span[@class='short_text' and @id='result_box']");

if (node != null)
{
    foreach (var xx in node)
    {
        x = xx.InnerText;
        MessageBox.Show(x);
    }
}

But I get no results.

My aim is to translate a complete string using Google Translate and to show the translated string in a label in Windows Forms.

How can I do this?

回答1:

This is a bad idea. As commenters have pointed out, Google offers a programmatic interface as a paid service. Google surely has security features in place to try to block exactly what you are doing, and that is why it isn't working. Perhaps you could get it working somehow, but even then you would always be in danger of Google improving its security and your script being blocked again. In addition, you are almost certainly breaking the Google terms of use.

2017 Update: Microsoft Translator API now supports Bengali, and is free for up to two million characters per month.

回答2:

Basic example using HTML Agility Pack

using System;
using HtmlAgilityPack;    
class Traslator
    {
        private string url;
        private HtmlWeb web;
        private HtmlDocument htmlDoc;

        public Translator(string langPair) // LangPair = "SL|TL" ( Source Lang | Target Lang - Ex.: "en|pt"
        {
            this.url = "http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair=" + langPair;
            this.web = new HtmlWeb();
            this.htmlDoc = new HtmlDocument();
        }

        public string Translate(string input)
        {
            this.htmlDoc = web.Load(String.Format(this.url, Uri.EscapeUriString(input)));
            HtmlNode htmlNode = htmlDoc.DocumentNode.SelectSingleNode("//*[@id=\"result_box\"]");
            return htmlNode.InnerText;
        }
    }

Whats wrong in your example: Just url used... try inspect the document.Text prop to get the html received from webGet... u will se that span.result_box will be empty.

回答3:

Rather than relying on screen-scraping, you should consider looking into using the API that google makes available for the translate service.

Some documentation can be found here

Update:

I belive your problems with screen-scraping approach may be that the translate application uses Ajax to call the server-side and retrieve the translation. The page you get when downloading using HtmlWeb is merely the JS application, it doesn't actually contain the translation. That doesn't get filled in until after a call has been made from the page to the server.

来源：https://stackoverflow.com/questions/14339190/scraping-html-from-google-translate

标签

.net

winforms

html-agility-pack

google-translate