Nutch problems executing crawl

十年热恋 提交于 2019-12-08 06:03:46

问题


I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7.

Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl.

I am getting the following error when I try to run a crawl execute with nutch:

Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt

Failed with exit value 127.

I have my JAVA_HOME classpath set, and I have altered the host file to include the 127.0.0.1 as the localhost.

I am curious if I am calling the write directory correctly, if maybe that is the problem.

The full printout looks like:

User5@User5-PC /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local
$ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ TestCrawl/  2

Injecting seed URLs
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Injector: starting at 2015-12-23 17:48:21
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
        at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
        at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
        at org.apache.nutch.crawl.Injector.run(Injector.java:379)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:369)

Error running:
  /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Failed with exit value 127.

The hadoop log that I think may have something to do with the error I am getting is:

2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
2016-01-07 12:24:40,450 ERROR crawl.Injector - Injector: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
    at org.apache.hadoop.fs.Path.initialize(Path.java:206)
    at org.apache.hadoop.fs.Path.<init>(Path.java:172)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parse(URI.java:3048)
    at java.net.URI.<init>(URI.java:746)
    at org.apache.hadoop.fs.Path.initialize(Path.java:203)
    ... 4 more

回答1:


You are running linux commands from Cygwin and there is no C:\ path in linux systems. Correct command should be something like

/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/bin/nutch inject TestCrawl/crawldb /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/urls/seed.txt



回答2:


You have answer to your problem in this message:

2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

This is happening because hadoop version included with nutch 1.11 is designed to work in linux out of the box and not on windows.

I had same situation and I ended up using nutch1.11 in ubuntu virtual box.




回答3:


hadoop-core jar file is needed when you are working with nutch

  • with nutch 1.11 compatible hadoop-core jar is 0.20.0
  • please download jar from this link : http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm

  • paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder and it will run successfully.




回答4:


The problem is pretty clear. According to your hadoop log, it cannnot find the winutils.exe file. Include winutils.exe in %HADOOP_HOME%/bin folder



来源:https://stackoverflow.com/questions/34445121/nutch-problems-executing-crawl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!