Easy way to export Wikipedia's translated titles

泄露秘密 提交于 2019-12-12 09:53:17

问题


Is there an easy way to export Wikipedia's translated titles to get a set like this:
russian_title -> english_title?

I tried to get ones from ruwiki-latest-pages-meta-current.xml.bz2 and ruwiki-latest-pages-articles.xml.bz2, however, there are less than 25k translations.

I found out some are not present. E.g. one can see a link to English wiki here, but there is no link [[en:Yandex]] in the dump.

Maybe I should try to parse English Wikipedia, but I'm sure there is a nicer solution.

BTW, I'm using wikixmlj + tried to find en:Yandex with grep.

UPD: link to @svick's solution data: http://dumps.wikimedia.org/ [language code] wiki/latest/ e.g. http://dumps.wikimedia.org/ruwiki/latest/


回答1:


Most of the links between Wikipedia articles in various languages is now on Wikidata. So, if you wanted to get to the source, you could download the dump of Wikidata and parse that (it's in JSON).

But I think a better way would be to use the dump of the langlinks table. This contains exactly the information you want, both for links from Wikidata and links that are still in the old form.

This dump is in SQL format. You can import that dump into an MySQL database, or you can parse it directly (I have written a .Net library that does that).

The table contains mappings from page id of your wiki (in your case the Russian Wikipedia) to page titles in other wikis. This means you will need the page ids of the pages you're interested in. For small number of pages, you can look them up manually using the “Page information” link, or you could use the API. But if you need this for large number of pages, you should download the dump of the page table, which contains this mapping.



来源:https://stackoverflow.com/questions/21000834/easy-way-to-export-wikipedias-translated-titles

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!