lucene

PDFBox adding white spaces within words

我的梦境 提交于 2020-01-10 23:37:40
问题 When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training I've tried with several other PDF files and it seems to be doing same on several pages. I do the following: java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf on

PDFBox adding white spaces within words

别说谁变了你拦得住时间么 提交于 2020-01-10 23:33:06
问题 When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training I've tried with several other PDF files and it seems to be doing same on several pages. I do the following: java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf on

Java Lucene NGramTokenizer

人盡茶涼 提交于 2020-01-10 22:43:40
问题 I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return String Objects. Here is the code that I have: Reader reader = new StringReader("This is a test string"); NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3); Where are the ngrams that were tokenized? How can I get the output in Strings

Is there a fast, accurate Highlighter for Lucene?

自作多情 提交于 2020-01-10 16:32:17
问题 I've been using the (Java) Highlighter for Lucene (in the Sandbox package) for some time. However, this isn't really very accurate when it comes to matching the correct terms in search results - it works well for simple queries, for example searching for two separate words will highlight both code fragments in the results. However, it doesn't act well with more complicated queries. In the simplest case, phrase queries such as "Stack Overflow" will match all occurrences of Stack or Overflow in

Is there a fast, accurate Highlighter for Lucene?

大兔子大兔子 提交于 2020-01-10 16:28:16
问题 I've been using the (Java) Highlighter for Lucene (in the Sandbox package) for some time. However, this isn't really very accurate when it comes to matching the correct terms in search results - it works well for simple queries, for example searching for two separate words will highlight both code fragments in the results. However, it doesn't act well with more complicated queries. In the simplest case, phrase queries such as "Stack Overflow" will match all occurrences of Stack or Overflow in

sorl的使用

跟風遠走 提交于 2020-01-10 02:46:10
Sorl定义: sorl是独立的企业级搜索服务器,它对外提供类似于Web-service的API接口。用户可以通过http请求,向搜索引擎服务器提交一定格式的xml文件,生成索引。 同时也可以通过Http Get操作提出查询请求,并以xml返回查询结果 特点: sorl采用java5开发的基于Lucene的全文服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,并且提供了一个完善的功能管理界面,是一款非常优秀的 全文搜索引擎 。 工作方式: 文档通过Http利用XML 加到一个搜索集合中。 查询该集合也是通过http收到一个XML/JSON响应来实现。它的主要特性包括:高效、灵活的缓存功能, 垂直搜索功能,高亮显示搜索结果,通过索引复制来提高可用性,提供一套强大Data Schema来定义字段,类型和设置 文本分析 ,提供基于Web的管理界面等。 全文检索引擎Solr系列—–全文检索基本原理 http://www.importnew.com/12707.html 全文检索可以归纳为两个过程:1、索引创建(indexing)2、搜索索引(search) Solr/Lucene采用的是一种反向索引,所谓 反向索引 :就是从关键字到文档的映射过程,保存这种映射这种信息的索引称为反向索引 索引创建: 1)把原始文档交给分词组件

全文检索lucence之倒排索引

天涯浪子 提交于 2020-01-09 23:50:09
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 索引可以类比一本书的目录,通过目录我们可以快速定位到想要找的内容。同样,在数据库中存储了海量的数据,通过创建索引,我们可以较快速的查询到指定的内容。 通常情况下,比如传统的关系型数据库(RDBMS),常见的有mysql、Oracle,此类数据库中,我们创建的索引一般为正排索引,即存储(key,value)的数据时,通过对key创建索引,来寻找value。 与上面的方式不同,lucence则是通过value来查找key的方式来创建索引。 lucence是基于java进行开发的。其将资源以Document 为对象进行存储,每个文档由一些列Field构成。对field创建索引,关联文档的唯一id。通常情况下,当储存一个新的文档时,会对文档数据分词,即获取Fields,接着创建倒排索引,其存储结构为:词项的字符串+词项的文档频率+记录词项的频率信息+记录词项的位置信息+跳跃偏移量。简单的理解可以形成以下结构: 分别表示词,词出现的文档编号,文档中出现的频率和文档中出现的位置。这样当我们对词进行搜索时,会找到该词出现过的所有文档的ID,然后再通过该文档的ID寻找文档的具体内容。 当然,Lucene词典中词的顺序是按照英文字母的顺序排列的,这样就可以采用压缩存储:假设有term,termagancy,termagant

Elasticsearch学习(1)—— 简介

只愿长相守 提交于 2020-01-09 14:27:55
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 【简介】 Elasticsearch ( ES ) 是一个基于 Lucene 的 实时分布式开源的全文搜索和分析引擎 。它不但 稳定、可靠、快速,而且也具有良好的水平扩展能力,是专门为分布式环境设计的。 Elasticsearch 通常用于单页面应用 ( Single Page Application ) 项目中,这个应用程序像 Google 和百度一样,提供一个搜索框用于输入关键字,然后返回一个包含搜索结果的列表。 Elasticsearch 使用 Java 语言开发,使用 Elastic 开源协议,已经被世界各地的各个大型的公司或组织使用。 Elasticsearch 通过 RESTful Web 服务接口访问,并使用 JSON 文档来存储数据。 【Elasticsearch 优点】 跨平台 :Elasticsearch 使用 Java 作为开发语言,所以可以运行在任何平台上; 近实时 :在Elasticsearch和磁盘之间是文件系统缓存,内存buffer生成一个新的segment,刷到文件系统缓存中,Lucene即可检索到这个新的segment。 刷到文件系统缓存中这个步骤,Elasticsearch默认1s的时间间隔,这也就是说相当于是实时搜索的 ; 分片机制提供更好的分布性 :同一个索引分成多个分片

elasticsearch - Return the tokens of a field

烈酒焚心 提交于 2020-01-09 12:52:12
问题 How can I have the tokens of a particular field returned in the result For example, A GET request curl -XGET 'http://localhost:9200/twitter/tweet/1' returns { "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_source" : { "user" : "kimchy", "postDate" : "2009-11-15T14:12:12", "message" : "trying out Elastic Search" } } I would like to have the tokens of '_source.message' field included in the result 回答1: There is also another way to do it using the following script_fields script: curl

Lucene in Android

纵然是瞬间 提交于 2020-01-09 10:29:45
问题 I'm new to android and Lucene . can I use Lucene for search in android list view . I have tried importing the package 2.3.2 and also used the jar files in library. However, there is an error in SearchFiles.java error is : The type java.rmi.Remote cannot be resolved. It is indirectly referenced from .class files. There is a possibility that this file doesnt exist for android. Is this the problem? 回答1: You may want to use the native Full Text Search feature called FTS3 in SQLite instead, which