Getting imdb movie titles in a specific language

穿精又带淫゛_ 提交于 2019-12-12 06:43:28

问题


I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.

After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.

The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.

I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?

edit: Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem) code:

public static Info crawlUrl(String urlPath) throws IOException{
        Info info = new Info();

        //
        URL url = new URL(urlPath);
        URLConnection uc = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                uc.getInputStream(), "UTF-8"));
        String inputLine;
        while ((inputLine = in.readLine()) != null){
            if(inputLine.contains("<title>")) System.out.println(inputLine);
        }
        in.close();
        //
        return info;
    }

this code goes trough a page and prints the main title on console.


回答1:


Try to look at the request headers used by your crawler, mine is containing Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 so I get the title in French.

EDIT :

I checked with ModifyHeaders add-on on Google Chrome and the value en-US is getting me the English title for the movie =)




回答2:


You don't need to crawl IMDB, you can use the dumps they provide: http://www.imdb.com/interfaces

There's also a parser for the data they provide: https://code.google.com/p/imdbdumpimport/ it's not perfect but maybe it will help you (you can expect spending some effort to make it work).

An alternative parser: https://github.com/dedeler/imdb-data-parser

EDIT You're saying you want to crawl IMDB anyway for learning purposes. So you'll probably have to go with http://en.wikipedia.org/wiki/Content_negotiation as suggested in the other answer:

uc.setRequestProperty("Accept-Language", "de; q=1.0, en; q=0.5");


来源:https://stackoverflow.com/questions/20913728/getting-imdb-movie-titles-in-a-specific-language

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!