lucene

How to find similar documents

你说的曾经没有我的故事 提交于 2020-01-01 00:45:13
问题 How do you find a similar documents of a given document in Lucene. I do not know what the text is i only know what the document is. Is there a way to find similar documents in lucene. I am a newbie so I may need some hand holding. 回答1: you may want to check the MoreLikeThis feature of lucene. MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index. http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search

How does Lucene/Solr achieve high performance in multi-field / faceted search?

白昼怎懂夜的黑 提交于 2019-12-31 22:24:31
问题 Context This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search , in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car). When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices

How does Lucene/Solr achieve high performance in multi-field / faceted search?

折月煮酒 提交于 2019-12-31 22:23:24
问题 Context This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search , in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car). When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices

What is the easiest way to implement terms association mining in Solr?

亡梦爱人 提交于 2019-12-31 21:44:20
问题 Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in

What is the easiest way to implement terms association mining in Solr?

自闭症网瘾萝莉.ら 提交于 2019-12-31 21:43:10
问题 Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in

Net Core使用Lucene.Net和盘古分词器 实现全文检索

和自甴很熟 提交于 2019-12-31 20:11:21
Lucene.net Lucene.net是Lucene的.net移植版本,是一个开源的全文检索引擎开发包,即它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,是一个高性能、可伸缩的文本搜索引擎库。它的功能就是负责将文本数据按照某种分词算法进行切词,分词后的结果存储在索引库中,从索引库检索数据的速度非常快。Lucene.net需要有索引库,并且只能进行站内搜索。(来自百度百科) 效果图 盘古分词 如何使用 将PanGu.dIl与PanGu.Lucenet.Analyzer. dl并加入到项目中 将Dict文件,拷贝到项目Bin文件夹里面 字典文件夹下载: https://pan.baidu.com/s/1HNiLp6bCcodN8vqlck066g 提取码: xydc 测试 可以看到,盘古分词相对Lucene.net自带的一元分词来说,是比较好的,因为一元分词不适合进行中文检索。 一元分词是按字拆分的,比如上面一句话,使用一元分词拆分的结果是:"有","一","种","方","言","叫","做","不","老","盖","儿"。如果查找“方言”这个词,是找不到查询结果的。不符合我们的检索习惯,所以基本不使用。 拓展 上面的"不老盖儿"(河南方言),这里想组成一个词,那么需要创建"不老盖儿"词组并添加到字典里面。 使用DictManage工具

自然语言处理在现实生活中运用

浪尽此生 提交于 2019-12-31 15:48:09
自己动手搭建搜索工具 作者 白宁超 2016年4月12日16:31:48 摘要: 搜索已经作为生活中不可缺少的一部分,诸如:百度、 google 、还是在微信上寻找好友或者通过一段文本查找关键字。另外亚马逊、京东、天猫、苏宁等电商在搜索中也是别有洞天(多面搜索等)。对于开发人员,搜索往往是大部分应用的关键功能,特别是对大规模文本数据驱动应用更是如此。另一类搜索像语音智能检索,其采用分类、聚类、神经网络等方法进行模型评估,反馈给用户比较理想的匹配结果,这里需要强调的是其采用评分机制反馈的模糊近似查询结果,与传统精确采用是不一样的。这种结果的反馈评分主要依托正确率和召回率。这里自己构建搜索工具好处在于:灵活性、开发费用低、自己更了解自己的搜索工具、价格当然是免费的啦。本文作者花费大量时间,经过资料收集,研究和实验所得,旨在技术分享。( 本文原创,转载需说明出处: 自己动手搭建搜索工具。 ) 目录 【文本挖掘(0)】 快速了解什么是自然语言处理 【文本挖掘(1)】 OpenNLP:驾驭文本,分词那些事 【文本挖掘(2)】 【NLP】Tika 文本预处理:抽取各种格式文件内容 【文本挖掘(3)】 自己动手搭建搜索工具 1 Apache Solr 搜索服务器简介 1.1. Solr 是什么? Solr 它是一种开放源码的、基于 Lucene Java 的搜索服务器,易于加入到 Web

Kibana query exact match

人走茶凉 提交于 2019-12-31 11:02:13
问题 I would like to know how to query a field to exactly match a string. I'm actually trying to query like this: url : "http://www.domain_name.com" Which returns all string starting with http://www.domain_name.com . 回答1: I had a similar issue, and ifound that ".raw" fixed it - in your example, try url.raw : "http://www.domain_name.com" 回答2: Just giving more visibility to @dezhi's comment. in newer version of ES(5.x, 6.x), you should use `url.keyword` instead, as they have changed to a new keyword

Passing a Lucene query to Neo4j REST API using Cypher 2.0

无人久伴 提交于 2019-12-31 04:29:33
问题 If I have a Lucene query such as (title:"foo bar" AND body:baz*) OR title:bat is there any straightforward way to pass this into a Cypher query? It looks like this sort of used to work with START and the old node_auto_index but I'm not sure how to do this properly with Cypher 2.0. I've tried sticking it in the MATCH clause but I get invalid syntax errors: MATCH (item:Item {...}) RETURN item I'm about to write a parser that converts a Lucene query to a parameterized Cypher query but I thought

elasticsearch 概述

狂风中的少年 提交于 2019-12-31 03:51:25
从2012年开始接触elasticsearch到2013年工作中大规模应用可以说跟它有一段渊源。从14年开始断断续续的把它的源码看过一遍,总体感觉是,初看是云深不知处,渐渐的拨云见日,终于柳暗花明。看完不禁为作者的技术所折服,一个人完成如此系统,能写出如此优雅的代码真的是很不容易。这个系列是我阅读源码的比较,希望能为想要或者正在阅读es源码的各位带来一些帮助。这里首先说一下elasticsearch的概念和理论。接下来是对源码的分析。 elasticsearch 作为第一个基于lucene的分布式全文检索系统(solor分布式在它之后)从2010年发布到现在在全球得到了广泛的应用,也得到了广泛的认可,是当前最成功的开源系统之一。受到数据量爆发的驱动,lucene(solor)虽然性能在不断提升但是还是难以满足需求。将lucene优秀的全文检索功能和分布式相结合,将lucene索引纵向切分扩展,同时实现索引的分布式管理,于是诞生了如此优秀的分布式全文检索系统。 lucene作为一个优秀的开源全文检索系统被全球很多公司广泛应用,它严格按照全文检索理论实现。它的索引中存储有词到doc的倒排信息,同时也存储了文档到词的正向信息。它的实现也相当精妙,因此它有如此优秀的性能和如此良好的搜索效果。想了解它的详细信息请参考官方文档或者相关博客 1 。 elasticsearch的一些概念: