lucene

Solr基础理论【倒排索引,模糊查询】

北战南征 提交于 2021-02-16 06:51:23
一.简介   现有的许多不同类型 的技术系统,如关系型数据库、键值存储、操作磁盘文件的map-reduce【映射-规约】引擎、图数据库等,都是为了帮助用户解决颇具挑战性的数据存储与检索问题而设计的。而搜索引擎,尤其是Solr,致力于解决一类特定的问题:搜索大量非结构化的文本数据,并返回最相关的搜索结果。 二.文档   Solr是一个文档存储与检索引擎。提交给solr处理的每一份数据都是一个文档。文档可以是一篇新闻报道、一份简历、社交用户信息,甚至是一本书。   每个文档包含一个或多个字段,每个字段被赋予具体的字段类型:字符串、标记化文本、布尔值、日期/时间、经纬度等。潜在的字段类型数量是无限的,因为一个字段类型是有若干个分析步骤组成的,这些步骤会决定数据如何在字段中被处理,以及如何映射到Solr索引中。每个字段在solr的schema文件中被指定特定的字段类型,并告知solr接收到此类内容的处理办法。   如下:        要在solr上执行一个查询,可以在文档上搜索一个或多个字段,即使字段未包含在该文档中。solr将返回哪些包含了与查询匹配的字段内容的文档。值得注意的是,虽然solr为每个文档提供了一个灵活的schema文件,但灵活并不代表无模式。在solr的schema文件中,所有的字段必须被定义,所有的字段名称【包括动态字段命名模式】必须被指定类型

SOLR and Natural Language Parsing - Can I use it?

大兔子大兔子 提交于 2021-02-15 08:18:53
问题 Requirements Word frequency algorithm for natural language processing Using Solr While the answer for that question is excellent, I was wondering if I could make use of all the time I spent getting to know SOLR for my NLP. I thought of SOLR because: It's got a bunch of tokenizers and performs a lot of NLP. It's pretty use to use out of the box. It's restful distributed app, so it's easy to hook up I've spent some time with it, so using could save me time. Can I use Solr? Although the above

SOLR and Natural Language Parsing - Can I use it?

若如初见. 提交于 2021-02-15 08:16:45
问题 Requirements Word frequency algorithm for natural language processing Using Solr While the answer for that question is excellent, I was wondering if I could make use of all the time I spent getting to know SOLR for my NLP. I thought of SOLR because: It's got a bunch of tokenizers and performs a lot of NLP. It's pretty use to use out of the box. It's restful distributed app, so it's easy to hook up I've spent some time with it, so using could save me time. Can I use Solr? Although the above

elasticsearch简介和elasticsearch_dsl

岁酱吖の 提交于 2021-02-13 07:19:48
elasticsearch es是基于lucene分片(shard)存储的近实时的分布式搜索引擎 名词解释: Lucene:使用java语言编写的存储与查询框架,通过组织文档与文本关系信息进行倒排索引,内部形成多个segment段进行存储,是es的核心组件,但不具备分布式能力。 segment:Lucene内部最小的存储单元,也是es的最小存储单元,多个小segment可合为一个较大的segment,并但不能拆分。 shard:es为解决海量数据的处理能力,在Lucene之上设计了分片的概念,每个分片存储部分数据,分片可以设置多个副本,通过内部routing算法将数据路由到各个分片上,以支持分布式存储与查询。 近实时:严格讲es并不是索引即可见的数据库,首先数据会被写入主分片所在机器的内存中,再触发flush操作,形成一个新的segment数据段,只有flush到磁盘的数据才会被异步拉取到其它副本节点,如果本次搜索命中副本节点且数据没有同步的话,那么是不会被检索到的;es默认flush间隔是1s,也可通过修改refresh_interval参数来调整间隔(为提升性能和体验,一版设置30s-60s)。 分布式:es天生支持分布式,配置与使用上与单机版基本没什么区别,可快速扩张至上千台集群规模、支持PB级数据检索;通过内部路由算法将数据储存到不同节点的分片上;当用户发起一次查询时

extracting all fields from a Lucene8 index

喜你入骨 提交于 2021-02-11 14:54:28
问题 Given an index created with Lucene-8, but without knowledge of the field s used, how can I programmatically extract all the fields? (I'm aware that the Luke browser can be used interactively (thanks to @andrewjames) Examples for using latest version of Lucene. ) The scenario is that, during a development phase, I have to read indexes without prescribed schemas. I'm using IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index))); IndexSearcher searcher = new IndexSearcher

How to make modifications to SOLR's tfidf similarity?

瘦欲@ 提交于 2021-02-11 13:32:10
问题 I am trying to search for titles hence just the presence of the word is sufficient and its frequency is not relevant at least to my use-case. For e.g: the search query is: "board early with my pets" The results I got are: Result 1: Pets 2.3924026 Result 2: Pets Counts against in cabin pet limit 2.0538325 Result 3: Pets Preboarding allowed 1.6092906 Ideally I want the result 3 to come at the top which needs some external work. However the result 1 is obvious and acceptable but the result 2 has

How to make modifications to SOLR's tfidf similarity?

≯℡__Kan透↙ 提交于 2021-02-11 13:30:44
问题 I am trying to search for titles hence just the presence of the word is sufficient and its frequency is not relevant at least to my use-case. For e.g: the search query is: "board early with my pets" The results I got are: Result 1: Pets 2.3924026 Result 2: Pets Counts against in cabin pet limit 2.0538325 Result 3: Pets Preboarding allowed 1.6092906 Ideally I want the result 3 to come at the top which needs some external work. However the result 1 is obvious and acceptable but the result 2 has

Unhandled exception in AppDomain - read past EOF error in Lucene indexing

放肆的年华 提交于 2021-02-11 13:21:15
问题 We've been having a problem with Lucene indexing for a while already. Basically quite often when we try to publish a content the indexing just throws an error like this: 2020-01-13 22:22:38,068 [P36840/D2/TLucene Merge Thread #0] ERROR Umbraco.Core.UmbracoApplicationBase - Unhandled exception in AppDomain (terminating) Lucene.Net.Index.MergePolicy+MergeException: Exception of type 'Lucene.Net.Index.MergePolicy+MergeException' was thrown. ---> System.IO.IOException: read past EOF at Lucene.Net

find nodes with a specific child association

僤鯓⒐⒋嵵緔 提交于 2021-02-11 07:47:30
问题 I am looking for a query (lucene, fts-alfresco or ...) to return all the document which have a specific child association (that is not null). Some context: Documents of type abc:document have a child-association abc:linkedDocument . Not all document have an other document linked to them, some have none some have one or multiple. I need a fast and easy way to get an overview of all the documents that do have at least one document linked to them. Currently I have a webscript that does what I

How to group results of a Lucene query, count the hits by group and highlight the documents in some selected group?

限于喜欢 提交于 2021-02-10 14:21:53
问题 I have different types of documents each of which may have multiple authors and upon searching I would like: results to be grouped by author such that I can count the number of documents of each type by each author and use the highlighter to highlight the documents belonging to a selected author. How should I index the documents and search on them to achieve this? Particularly, how to perform grouping when I have multiple authors for a document and the documents are of different types? 来源: