lucene | 易学教程

2.Elasticsearch快速入门_Elasticsearch核心概念：NRT、索引、分片、副本等

阅读更多关于 2.Elasticsearch快速入门_Elasticsearch核心概念：NRT、索引、分片、副本等

1、lucene和elasticsearch的前世今生 lucene，最先进、功能最强大的搜索库，直接基于lucene开发，非常复杂，api复杂（实现一些简单的功能，写大量的java代码），需要深入理解原理（各种索引结构） elasticsearch，基于lucene，隐藏复杂性，提供简单易用的restful api接口、java api接口（还有其他语言的api接口）（1）分布式的文档存储引擎（2）分布式的搜索引擎和分析引擎（3）分布式，支持PB级数据开箱即用，优秀的默认参数，不需要任何额外设置，完全开源关于elasticsearch的一个传说，有一个程序员失业了，陪着自己老婆去英国伦敦学习厨师课程。程序员在失业期间想给老婆写一个菜谱搜索引擎，觉得lucene实在太复杂了，就开发了一个封装了lucene的开源项目，compass。后来程序员找到了工作，是做分布式的高性能项目的，觉得compass不够，就写了elasticsearch，让lucene变成分布式的系统。 2、elasticsearch的核心概念（1） Near Realtime（NRT）：近实时，两个意思，从写入数据到数据可以被搜索到有一个小延迟（大概1秒）；基于es执行搜索和分析可以达到秒级（2） Cluster ：集群，包含多个节点，每个节点属于哪个集群是通过一个配置（集群名称

lucene中文教程

阅读更多关于 lucene中文教程

lucene是什么？ lucene是基于java的开源全文索引工具包。开源好理解，开放源代码嘛！可是，这个全文索引是什么东西？全文索引是索引的一种，不了解索引的朋友可以看这篇文章( 索引是什么 )。索引是一种有特殊数据结构的数据。在某种情况，通过它，可以对数据进行快速查询。简而言之，lucene是一个jar包，这个jar包有很多的工具类，可以帮助你创建一种名叫全文索引的索引，可以帮助你使用这些被创建的索引来进行快速检索。 lucene的原理是什么？这个问题描述的不准确，应该是lucene能够进行快速检索的原理是什么？这个问题lucene的定义"lucene是基于java的开源全文索引工具包"已经给出了答案。通过lecene提供的工具，你对需要进行搜索的原数据（文档，网页，数据库等）进行索引操作，生成一份全文索引数据。在进行检索的时候，你不是对原数据, 而是对索引数据进行搜索的，所以你的检索效率才有了指数级提高(索引数据有利于检索的数据结构)。什么是全文索引？全文索引，又称倒排索引(反向索引)(inverted index)，与之相对应的是正排索引(正向索引)(forward index)。正排索引：无论课本，杂志，还是报纸，它们都拥有一个目录。假如我们想看某篇文章，通过目录，我们可以快速的找到这篇文章所在的页面，而不是笨拙的一页页的去翻看这本书。这里，目录就是一个索引

ES在Linux上的配置

阅读更多关于 ES在Linux上的配置

ES在Linux上的配置 1.ES是基于Lucene进行开发的，Lucene又是apache基金会的一个项目，也就是说Lucene必须要jdk环境，ES也必须要jdk环境 2.拿到一个新的Linux服务器的时候如何配置 2.1修改hostname 2.2修改hostname和IP的映射 2.3直接关闭防火墙/开启防火墙访问端口号 2.4需要关闭防火墙的开机启动 2.5重启 3.配置jdk 4.ES的配置 4.1使用xftp工具把ES的压缩包上传到Linux服务器上 4.2解压ES的压缩包 4.3增大Linux上部署软件的内存和硬盘 4.4最大的线程数 4.5配置用户最大的线程数 4.6使其永久生效 4.7进入到ES的bin目录启动 4.8验证是否配置成功 4.9配置中文分词器（搜狗，IK） 4.9.1使用xftp把IK分词器压缩包上传到Linux服务器上 4.9.2 使ES集成IK分词器 4.9.3在ES的plugins目录中创建IK目录 4.9.4解压IK分词器 4.9.5启动ES 配置ES所遇到的问题： 1.can not run elasticsearch as root 2.CONFIG_SECCOMP not compiled into kernel，CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed 3

Lucene（全文检索）

阅读更多关于 Lucene（全文检索）

Lucene 数据分类非结构化数据查询方法 Lucene实现全文检索分析器的使用查询索引索引库的维护数据分类结构化数据：指具有固定格式或有限长度的数据(例如数据库，元数据等) 非结构化数据：指不定长或无固定格式的数据(例如邮件，word文档等) 非结构化数据查询方法顺序扫描法从头到尾进行扫描，找到匹配的文件全文检索先建立索引，然后对索引进行搜索 Lucene实现全文检索索引和搜索流程图创建索引获得原始文档创建文档对象我们可以将磁盘上的一个文件当成一个document，Document中包括一些Field（file_name文件名称、file_path文件路径、file_size文件大小、file_content文件内容）分析文档将原始内容创建为包含域（Field）的文档（document），需要再对域中的内容进行分析，分析的过程是经过对原始文档提取单词、将字母转为小写、去除标点符号、去除停用词等过程生成最终的语汇单元，可以将语汇单元理解为一个一个的单词。创建索引( 倒排索引结构 ) 对所有文档分析得出的语汇单元进行索引，索引的目的是为了搜索，最终要实现只搜索被索引的语汇单元从而找到Document（文档）创建索引代码分析器的使用 Lucene 自带分词器 StandardAnalyzer：单字分词 SmartChineseAnalyzer

Compile failed when running a lucene example

阅读更多关于 Compile failed when running a lucene example

问题 After fixing the error like tools.jar and junit.jar not found (thanks to stack overflow) I tried to compile the example given in "lucene in action" book. But when I compiled I am getting this error. Can you tell what error I am getting and how to fix it? Total time: 0 seconds E:\LuceneInAction>ant Indexer Buildfile: E:\LuceneInAction\build.xml check-environment: compile: [javac] E:\LuceneInAction\build.xml:66: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last;

Symfony with Zend Lucene and related models (with foreign keys)

阅读更多关于 Symfony with Zend Lucene and related models (with foreign keys)

问题 Well I was developing an application usin Symfony 1.4 and Doctrine when I realized a major drawback on my Zend Lucene implementation. I have a model called Publication that is related (via foreign key relations) with a few other models (subjects, genres, languages, authors, etc.) and I'm getting they're names when adding a new document to the index (using the Jobeet tutorial way) so that I can search for publications with a given subject, genre, language, author, etc... The problem is if for

Elasticsearch - higher scoring if higher frequency of term

阅读更多关于 Elasticsearch - higher scoring if higher frequency of term

问题 I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field. Document A has ONLY 1 term in the "tags" field, and it's "Twitter". Document B has 100 terms in the "tags" field, but 3 of them is "Twitter". Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the

Elasticsearch - higher scoring if higher frequency of term

阅读更多关于 Elasticsearch - higher scoring if higher frequency of term

Elasticsearch: shingles with stop words elimination

阅读更多关于 Elasticsearch: shingles with stop words elimination

问题 I am trying to implement an elasticsearch mapping to optimize phrase search in a large body of text. As per the suggestions in this article, I am using a shingle filter to build multiple unigrams per phrase. Two questions: In the article mentioned, the stopwords are filtered and the shingles take care of the missing spaces by inserting "_" tokens. These tokens should be eliminated from the unigram that is indexed by the engine. The point of this elimination is be able to respond to phrase

Elasticsearch: shingles with stop words elimination

阅读更多关于 Elasticsearch: shingles with stop words elimination

订阅 lucene