wikipedia

Parser for Wikipedia

 ̄綄美尐妖づ 提交于 2019-12-20 12:14:07
问题 I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML? 回答1: See java-wikipedia-parser. I have never used it but according to the docs : The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface. 回答2: I do not know how exactly looks xml format of Wikipedia dump. But, if

How to get all Wikipedia article titles?

不问归期 提交于 2019-12-20 08:11:05
问题 How to get all Wikipedia article titles in one place without extra characters and pageids. Just the article's title. Something like this: When I download wikipedia dump, I get this Maybe I know a movement that might get me all pages but I wanted to get all pages in one take. 回答1: You'll find it on https://dumps.wikimedia.org The latest List of page titles in main namespace for English Wikipedia as a database dump is here (69 MB). If you rather want it through the API you use query and list

How to get all Wikipedia article titles?

蓝咒 提交于 2019-12-20 08:10:21
问题 How to get all Wikipedia article titles in one place without extra characters and pageids. Just the article's title. Something like this: When I download wikipedia dump, I get this Maybe I know a movement that might get me all pages but I wanted to get all pages in one take. 回答1: You'll find it on https://dumps.wikimedia.org The latest List of page titles in main namespace for English Wikipedia as a database dump is here (69 MB). If you rather want it through the API you use query and list

XPath to get markup between two headings

三世轮回 提交于 2019-12-20 04:06:06
问题 I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2 tags. Example: <h2>Title</h2> <div>Some Content</div> <h2>Title</h2> Here I would want to get the div between

Example using WikipediaTokenizer in Lucene

ε祈祈猫儿з 提交于 2019-12-19 11:42:31
问题 I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it. Thank you. 回答1: In Lucene 3.0, next() method is removed. Now you should use

Indexing wikipedia dump with solr

心已入冬 提交于 2019-12-19 11:31:13
问题 I have solr 3.6.2 installed on my machine, perfectly running with tomcat. I want to index a wikipedia dump file using solr. How do I do this using DataImportHandler? Any other way? I don't have any knowledge of xml. The file I have mentioned has size of around 45GB when extracted. Any help would be greatly appreciated. Update- I tried doing whats said on the DataImportHandler page. But there is some error maybe because their version of solr is much older. My data.config- <dataConfig>

Create a HashMap with a fixed Key corresponding to a HashSet. point of departure

泪湿孤枕 提交于 2019-12-19 11:17:35
问题 My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings. OUTPUT This is what the output looks like now: Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]] According to my idea, it should look like this: [Hudson+(surname)=[Q2720681

API to get Wikipedia revision id by date [closed]

冷暖自知 提交于 2019-12-19 10:04:46
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . Is there any API to get wikipedia revision id by date, instead of checking all the revision history and extract out the most recent revision before that date? Thank you! 回答1: The revision query api allows you to pass timestamps to get only revisions from a specified interval. Use api.php?action=query&prop

Get first lines of Wikipedia Article

本小妞迷上赌 提交于 2019-12-18 11:53:32
问题 I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article. The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code. For example: Albert

Getting hyperlinks of a Wikipedia page using DBpedia

耗尽温柔 提交于 2019-12-18 07:17:29
问题 I have two resources in DBPedia: dbr:Diabetes_mellitus and dbr:Hyperglycemia. In Wikipedia, the corresponding pages are wikipedia-en:Diabetes_mellitus and wikipedia-en:Hyperglycemia. In Wikipedia there is a hyperlink from Diabetes_mellitus page to Hyperglycemia page. But when I try to find the link between the 2 resources in DBpedia, I cannot find it. I tried to find the link using the following SPARQL query. SELECT ?prop WHERE { { dbr:Diabetes_mellitus ?prop dbr:Hyperglycemia } UNION { dbr