lucene

Efficient substring search in a large text file containing 100 millions strings(no duplicate string)

泄露秘密 提交于 2020-01-22 12:53:06
问题 I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in

Elasticsearch server discovery configuration

可紊 提交于 2020-01-22 06:45:24
问题 I've installed ElasticSearch server, that i'm running by: $ ./elasticsearch -f {0.18.2}[11698]: initializing ... loaded [], sites [] {0.18.2}[11698]: initialized {0.18.2}[11698]: starting ... bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.106:9300]} new_master [Stingray][ocw4qPdmSfWuD9pUxHoN1Q][inet[/192.168.1.106:9300]], reason: zen-disco-join (elected_as_master) elasticsearch/ocw4qPdmSfWuD9pUxHoN1Q recovered [0] indices into cluster_state bound_address {inet[

Solr Query Syntax

♀尐吖头ヾ 提交于 2020-01-21 03:31:22
问题 I just got started looking at using Solr as my search web service. I don't know whether Solr supports these query types: Startswith Exact Match Contain Doesn't Contain In the range Could anyone guide me how to implement those features in Solr? Cheers, Samnang 回答1: Solr is capable of all those things but to adequately explain how to do each of time an answer would become a mini-manual for Solr. I'd suggest you read the actual manual and tutorials linked from the Solr homepage. In short though:

Using a Combination of Wildcards and Stemming

时光怂恿深爱的人放手 提交于 2020-01-21 01:50:07
问题 I'm using a snowball analyzer to stem the titles of multiple documents. Everything works well, but their are some quirks. Example: A search for "valv", "valve", or "valves" returns the same number of results. This makes sense since the snowball analyzer reduces everything down to "valv". I run into problems when using a wildcard. A search for "valve*" or "valves*" does not return any results. Searching for "valv*" works as expected. I understand why this is happening, but I don't know how to

Spelling correction for data normalization in Java

帅比萌擦擦* 提交于 2020-01-20 04:08:21
问题 I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile. This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just

白话Elasticsearch67-不随意调节jvm和thread pool的原因&jvm和服务器内存分配的最佳实践

♀尐吖头ヾ 提交于 2020-01-17 09:36:52
文章目录 不随意调节jvm和thread pool的原因 jvm gc threadpool jvm和服务器内存分配的最佳实践 jvm heap分配 将机器上少于一半的内存分配给es 不要给jvm分配超过32G内存 在32G以内的话具体应该设置heap为多大? 对于有1TB内存的超大内存机器该如何分配? swapping 不随意调节jvm和thread pool的原因 es中有很多的配置都让大家忍不住去调优,因为也许大家都太过于迷恋性能优化了,都认为优化一些配置可以大幅度提升性能,就感觉性能调优像个魔法一样,是个万能的东西。但是其实99.99%的情况下,对于es来说,大部分的参数都保留为默认的就可以了。因为这些参数经常被滥用和错误的调节,继而导致严重的稳定性问题以及性能的急剧下降。 jvm gc jvm使用垃圾回收器来释放掉不用的内存,千万不要去调节默认的垃圾回收行为。es默认用的垃圾回收器是CMS。CMS回收器是并发式的回收器,能够跟应用程序工作线程并发工作,最大程度减少垃圾回收时的服务停顿时间。但是CMS还是会有两个停顿阶段,同时在回收特别大的heap时也会有一些问题。尽管有一些缺点,但是CMS对于要求低延时请求响应的软件来说,还是最佳的垃圾回收器,因此官方的推荐就是使用CMS垃圾回收器。 有一种最新的垃圾回收器叫做G1。G1回收器可以比CMS提供更少的回收停顿时间

hit highlighting in lucene

早过忘川 提交于 2020-01-17 01:11:36
问题 i am searching for strings indexed in lucene as documents. now i give it a long string to match. example: "iamrohitbanga is a stackoverflow user" search string documents: document 1: field value: rohit document 2: field value: banga now i use fuzzy matching to find the search strings in the documents. the 2 documents match. i want to retrieve the position at which the string rohit occurs in the search string. how to do it using lucene java api. also note that the fuzzy matching would lead to

Solr Multiphrase query not able to generate result even when the token was present

依然范特西╮ 提交于 2020-01-16 21:54:45
问题 Hi I'm stuck with an issue: I have a field splited_data and field type text_split (in my schema.xml ): <field name="splited_data" type="text_split" indexed="true" stored="false" /> <fieldType name="text_split" class="solr.TextField" autoGeneratePhraseQueries="true" omitNorms="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr

readis和es

爷,独闯天下 提交于 2020-01-16 20:52:46
readis集群 面试题 :集群和分布式区别?(理解) (1)相同点 解决高并发 和 大数据的问题 (2)不同点 集群 是 多个服务去实现同一个功能 分布式 不同的服务器实现不同的功能 1 什么是集群(理解) 多个服务器集中再一起 ,去处理同一个业务 集群和分布式概念(理解) 2 为什么需要集群 ,集群有什么好处?(理解) 解决 高并发 大数据的问题 3 集群的特点(理解) (1) 扩展性 --可以动态的去扩展的服务器 (2) 高可用 --> 具备容错的功能 (错误恢复) (3) 负载均衡 -->把请求 分配到不同的服务器里面 4 面试题 :集群和分布式区别?(理解) (1)相同点 解决高并发 和 大数据的问题 (2)不同点 集群 是 多个服务去实现同一个功能 分布式 不同的服务器实现不同的功能 5 如果要做集群 应该怎么做? 方案一 主从复制 好不好? 优点: 完成主从复制 , 缺点:不能完成容错或者恢复的功能(keepalive) 获取 手动重启 扩展服务器比较麻烦 方案二: 哨兵模式 Redis 2.8中提供了哨兵工具来 实现自动化的系统监控和故障恢复功能 。 好处:完成自动的主从切换 缺点:Redis较难支持在线扩容,在集群容量达到上限时在线扩容会变得很复杂 方案三: Redis-Cluster (搭建) 好处:可以完成分布式存储问题,解决容量问题 Redis

Lucene.NET TokenStream.Next method disappeared

若如初见. 提交于 2020-01-16 20:34:06
问题 I have to update a project using Lucene.NET. It first time I meet this library, and I should update the references to a new version. So I did with Lucene references via a NuGet. But actually I should in a way update some methods that disappeared in the new versions. public abstract class AnalyzerView { public abstract string Name { get; } public virtual string GetView(TokenStream tokenStream,out int numberOfTokens) { StringBuilder sb = new StringBuilder(); Token token = tokenStream.Next();