问题
newbie in apache nutch - writing a client to use it via REST. succeed in all the steps (INJECT, FETCH...) - in the last step - when trying to index to solr - it fails to pass the parameter. The Request (I formatted it in some website)
{
"args": {
"batch": "1463743197862",
"crawlId": "sample-crawl-01",
"solr.server.url": "http:\/\/x.x.x.x:8081\/solr\/"
},
"confId": "default",
"type": "INDEX",
"crawlId": "sample-crawl-01"
}
The Nutch logs:
java.lang.Exception: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Was that implemented? the param passing to solr plugin?
回答1:
You need to create/update a configuration using the /config/create/
endpoint, with a POST request and a payload similar to:
{
"configId":"solr-config",
"force":"true",
"params":{"solr.server.url":"http://127.0.0.1:8983/solr/"}
}
In this case I'm creating a new configuration and specifying the solr.server.url
parameter. You can verify this is working with a GET request to /config/solr-config
(solr-config
is the previously specified configId
), the output should contain all the default parameters see https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4 for an example/default output. If everything worked fine in the returned JSON you should see the solr.server.url
option with the desired value https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4#file-nutch-solr-config-json-L464.
After this just hit the /job/create
endpoint to create a new INDEX
Job, the payload should be something like:
{
"type":"INDEX",
"confId":"solr-config",
"crawlId":"crawl01",
"args": {}
}
The idea is that need to you pass the configId
that you created with the solr.server.url
specified along with the crawlId
and other args. This should return something similar to:
{
"id": "crawl01-solr-config-INDEX-1252914231",
"type": "INDEX",
"confId": "solr-config",
"args": {},
"result": null,
"state": "RUNNING",
"msg": "OK",
"crawlId": "crawl01"
}
Bottom line you need to create a new configuration with the solr.server.url
setted instead of specifying it through the args
key in the JSON payload.
来源:https://stackoverflow.com/questions/37345524/apache-nutch-to-index-to-solr-via-rest