carrot2

carrot2 - can I cluster documents from a folder?

泪湿孤枕 提交于 2020-01-02 10:26:45
问题 I'm trying to cluster documents I have collected as part of a research project. I am trying to use Carrot2 workbench and can't find out how to point carrot at the folder containing the documents. How do I do this please? (I have a small number of documents (.txt) to compare and they're on a standalone research machine so I can't connect to the web and process them there). Any help gratefully received! (I am trying to identify similarities/themes/groups across the documents; if Carrot2 isn't

Carrot2 workbench not able to process large data

空扰寡人 提交于 2019-12-25 04:25:02
问题 I wanted to cluster my data-set using carrot2 workbench. I have an input xml file with 65536 documents. I am using Lingo clustering algorithm. But when I start the process, the workbench returns the result within few seconds having all the documents in the "other topics" cluster. I have checked the clustering with smaller data-sets and I am getting the results. 回答1: Carrot2 Lingo algorithm was designed for small data sets, up to a thousand or so of documents. For larger data sets, you may

Searching over documents stored in Hadoop - which tool to use?

前提是你 提交于 2019-12-18 12:38:22
问题 I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI... When you read about the one you can be often sure that each of the others tools is going to be mentioned. I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done. The

Carrot2+ElasticSearch Basic Flow of Information

随声附和 提交于 2019-12-08 09:52:31
问题 I am using Carrot2 and ElasticSearch. I has elastic search server running with a lot of data when I installed carrot2 plugin. Wanted to get answers to a few basic questions: Will clustering work only on newly indexed documents or even old documents? How can I specify which fields to look at for clustering? The curl command is working and giving some results. How can I get the curl command which takes a JSON as input to a REST API url of the form localhost:9200/article-index/article/_search

carrot2 - can I cluster documents from a folder?

冷暖自知 提交于 2019-12-06 12:32:21
I'm trying to cluster documents I have collected as part of a research project. I am trying to use Carrot2 workbench and can't find out how to point carrot at the folder containing the documents. How do I do this please? (I have a small number of documents (.txt) to compare and they're on a standalone research machine so I can't connect to the web and process them there). Any help gratefully received! (I am trying to identify similarities/themes/groups across the documents; if Carrot2 isn't the right tool then would be grateful for alternative suggestions!) Many thanks, John Stanislaw Osinski

Searching over documents stored in Hadoop - which tool to use?

泪湿孤枕 提交于 2019-11-30 07:34:20
I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI... When you read about the one you can be often sure that each of the others tools is going to be mentioned. I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done. The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc,