carrot2 - can I cluster documents from a folder?

冷暖自知 提交于 2019-12-06 12:32:21
Stanislaw Osinski

Currently Carrot2 Workbench does not support clustering files directly from a local folder. There are a few solutions here:

  1. Convert all your text file to Carrot2 XML format and cluster the XML file in Carrot2 Workbench.

  2. Index your files in Apache Solr and query your Solr index from Carrot2 Workbench.

  3. Convert your files to a Lucene index and query the index from Carrot2 Workbench. I wrote a simple utility for that task called folder2index (source code).

    Assuming you're on Windows, the indexing process is the following:

    1. Uzip the folder2index tool somewhere, let's assume you unzipped it to c:\carrot2\folder2index-0.0.1.

    2. To index text files from some directory (let's assume c:\txt-input) and create the index in c:\txt-input-index, do this:

      a. Open command line console (Start menu -> Run program -> type cmd and press Enter).

      b. In the console, type:

      cd c:\carrot2\folder2index-0.0.2
      java -jar folder2index-0.0.2.jar --index c:\txt-input-index --folders c:\txt-input --use-tika
      

      After a short while you should see something like:

      ...
      Index created: c:\txt-input-index
      
    3. Once you've indexed the files, you can cluster them in Carrot2 Workbench, using the Lucene document source. Use the content file name to refer to the content of your text file; the name of the file is stored in the fileName field.

    A couple of notes:

    • Currently only PDF, HTML and TXT files are indexed, other files are ignored.

    • If the index already exists, files are added to the index. This means that if you run the command twice with the same parameters, the index will contain duplicate documents. To re-index a folder to which you've just added some files, it's best to delete the index directory first.

    • You can use the Query field in Carrot2 Workbench to select specific files from the index, e.g.:

      *:* -- retrieves all the content (up to the requested number of results)

      mining -- retrieves all the documents that contain the word "mining" in them (again, up to the requested number of results)

      "data mining" -- retrieves documents that contain the exact phrase "data mining"

      fileName:92* -- retrieves contents of files whose names start with "92"

I recently had built a document clustering software. This software is build in java. This software is absolutely free. Document organizer software can cluster a huge collection of document of following extensions:

  • txt
  • pdf
  • doc
  • docx
  • xls
  • xlsx
  • ppt
  • pptx

If this software doesnt fullfill your requirement please let me know.

Here's the link: http://www.computergodzilla.com

If you want to read more, refer here: http://computergodzilla.blogspot.com/2013/07/document-organizer-software.html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!