wikipedia | 易学教程

Parser for Wikipedia

阅读更多关于 Parser for Wikipedia

问题 I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML? 回答1: See java-wikipedia-parser. I have never used it but according to the docs : The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface. 回答2: I do not know how exactly looks xml format of Wikipedia dump. But, if

How to get all Wikipedia article titles?

阅读更多关于 How to get all Wikipedia article titles?

问题 How to get all Wikipedia article titles in one place without extra characters and pageids. Just the article's title. Something like this: When I download wikipedia dump, I get this Maybe I know a movement that might get me all pages but I wanted to get all pages in one take. 回答1: You'll find it on https://dumps.wikimedia.org The latest List of page titles in main namespace for English Wikipedia as a database dump is here (69 MB). If you rather want it through the API you use query and list

How to get all Wikipedia article titles?

阅读更多关于 How to get all Wikipedia article titles?

XPath to get markup between two headings

阅读更多关于 XPath to get markup between two headings

问题 I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2 tags. Example: <h2>Title</h2> <div>Some Content</div> <h2>Title</h2> Here I would want to get the div between

Example using WikipediaTokenizer in Lucene

阅读更多关于 Example using WikipediaTokenizer in Lucene

问题 I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it. Thank you. 回答1: In Lucene 3.0, next() method is removed. Now you should use

Indexing wikipedia dump with solr

阅读更多关于 Indexing wikipedia dump with solr

问题 I have solr 3.6.2 installed on my machine, perfectly running with tomcat. I want to index a wikipedia dump file using solr. How do I do this using DataImportHandler? Any other way? I don't have any knowledge of xml. The file I have mentioned has size of around 45GB when extracted. Any help would be greatly appreciated. Update- I tried doing whats said on the DataImportHandler page. But there is some error maybe because their version of solr is much older. My data.config- <dataConfig>

Create a HashMap with a fixed Key corresponding to a HashSet. point of departure

阅读更多关于 Create a HashMap with a fixed Key corresponding to a HashSet. point of departure

问题 My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings. OUTPUT This is what the output looks like now: Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]] According to my idea, it should look like this: [Hudson+(surname)=[Q2720681

API to get Wikipedia revision id by date [closed]

阅读更多关于 API to get Wikipedia revision id by date [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . Is there any API to get wikipedia revision id by date, instead of checking all the revision history and extract out the most recent revision before that date? Thank you! 回答1: The revision query api allows you to pass timestamps to get only revisions from a specified interval. Use api.php?action=query&prop

Get first lines of Wikipedia Article

阅读更多关于 Get first lines of Wikipedia Article

问题 I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article. The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code. For example: Albert

Getting hyperlinks of a Wikipedia page using DBpedia

阅读更多关于 Getting hyperlinks of a Wikipedia page using DBpedia

问题 I have two resources in DBPedia: dbr:Diabetes_mellitus and dbr:Hyperglycemia. In Wikipedia, the corresponding pages are wikipedia-en:Diabetes_mellitus and wikipedia-en:Hyperglycemia. In Wikipedia there is a hyperlink from Diabetes_mellitus page to Hyperglycemia page. But when I try to find the link between the 2 resources in DBpedia, I cannot find it. I tried to find the link using the following SPARQL query. SELECT ?prop WHERE { { dbr:Diabetes_mellitus ?prop dbr:Hyperglycemia } UNION { dbr