问题
My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings.
OUTPUT
This is what the output looks like now:
Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]
According to my idea, it should look like this:
[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]
The purpose is to store a particular name in Wikidata and then all of the Q values associated with it's disambiguation, so for example:
This is the page for "Bush".
I want Bush to be the Key, and then for all of the different points of departure, all of the different ways that Bush
could be associated with a terminal page of Wikidata, I want to store the corresponding "Q value", or unique alpha-numeric identifier.
What I'm actually doing is trying to scrape the different names, values, from the wikipedia disambiguation and then look up the unique alpha-numeric identifier associated with that value in wikidata.
For example, with Bush
we have:
George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)
Accordingly the Q values are:
George H. W. Bush (Q23505)
George W. Bush (Q207)
Jeb Bush (Q221997)
Bush family (Q2743830)
Bush (Q1484464)
My idea is that the data structure should be construed in the following way
Key:Bush
Entry Set: Q23505, Q207, Q221997, Q2743830, Q1484464
But the code I have now doesn't do that.
It creates a seperate entry for each name and Q value. i.e.
Key:Jeb Bush
Entry Set: Q221997
Key:George W. Bush
Entry Set: Q207
and so on.
The full code in all it's glory can be seen on my github page, but I'll summarize it below also.
This is what I'm using to add values to my data strucuture:
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
This is how I fetch the content:
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
"wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
"href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( variable_entity, Q );
}
}
}
and this is how I deal with a disambiguation page:
public static void all_possibilities( String variable_entity ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the wikipedia
//get it's normal wiki disambig page
Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get();
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = linq.text().replace(' ', '+');
Wikidata_Q_Reader.getQ( linq_nospace );
}
}
I was thinking maybe I could pass the Key
value around, but I really don't know. I'm kind of stuck. Maybe someone can see how I can implement this functionality.
回答1:
I'm not clear from your question what isn't working, or if you're seeing actual errors. But, while your basic data structure idea (HashMap
of String
to Set<String>
) is sound, there's a bug in the "add" function.
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
In the case where a key is seen for the first time (if (!q_valMap.containsKey(key))
), it vivifies a new HashSet
for that key, but it doesn't add value
to it before returning. (And the returned value is the old value for that key, so it'll be null.) So you're going to be losing one of the Q-values for every term.
For multi-layered data structures like this, I usually special-case just the vivification of the intermediate structure, and then do the adding and return in a single code path. I think this would fix it. (I'm also going to call it valSet
because it's a set and not a list. And there's no need to re-add the set to the map each time; it's a reference type and gets added the first time you encounter that key.)
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key)) {
q_valMap.put(key, new HashSet<String>());
}
HashSet<String> valSet = q_valMap.get(key);
valSet.add(value);
return valSet;
}
Also be aware that the Set
you return is a reference to the live Set
for that key, so you need to be careful about modifying it in callers, and if you're doing multithreading you're going to have concurrent access issues.
Or just use a Guava Multimap
so you don't have to worry about writing the implementation yourself.
来源:https://stackoverflow.com/questions/29814038/create-a-hashmap-with-a-fixed-key-corresponding-to-a-hashset-point-of-departure