Fuzzy

Elasticsearch数据库

断了今生、忘了曾经 提交于 2019-11-28 22:14:33
1、什么是Elasticsearch 1、概念以及特点 1、Elasticsearch和MongoDB/Redis/Memcache一样,是非关系型数据库。是一个接近实时的搜索平台,从索引这个文档到这个文档能够被搜索到只有一个轻微的延迟,企业应用定位:采用Restful API标准的可扩展和高可用的实时数据分析的全文搜索工具。 2、可拓展:支持一主多从且扩容简易,只要cluster.name一致且在同一个网络中就能自动加入当前集群;本身就是开源软件,也支持很多开源的第三方插件。 3、高可用:在一个集群的多个节点中进行分布式存储,索引支持shards和复制,即使部分节点down掉,也能自动进行数据恢复和主从切换。 3、采用RestfulAPI标准:通过http接口使用JSON格式进行操作数据。 4、数据存储的最小单位是文档,本质上是一个JSON 文本: 2、项目中为何使用(主搜索次分析再存储) 2.1、搜索引擎 实际项目开发中,几乎每个系统都会有一个搜索的功能,数据量少时可以直接从主数据库中比如Mysql搜索,但当搜索做到一定程度时,比如系统数据量上了10亿、100亿条的时候,传统的关系型数据库的I/O性能和统计分析性能就难以满足用户需要了。所以很多公司都会把搜索单独做成一个独立的模块,用ElasticSearch等来实现。虽然内存缓存数据库的读写性能很高

ZooKeeper持久化原理

℡╲_俬逩灬. 提交于 2019-11-28 19:44:38
切换事务日志文件的时机,实际是生成快照文件的时机 ZK 的数据与存储中,有几个特别关注点: 内存数据 与 磁盘数据 间的关系: 内存数据,是真正提供服务的数据 磁盘数据,作用: 恢复内存数据,恢复现场 数据同步:集群内,不同节点间的数据同步(另,内存中的提议缓存队列 proposals) 磁盘数据,为什么同时包含:快照、事务日志?出于数据粒度的考虑 如果只包含快照,那恢复现场的时候,会有数据丢失, 因为生成快照的时间间隔太大,即,快照的粒度太粗了 事务日志,针对每条提交的事务都会 flush 到磁盘, 因此粒度很细,恢复现场时,能够恢复到事务粒度上 快照生成的时机:基于阈值,引入随机因素 解决的关键问题:避免所有节点同时 dump snapshot, 因为 dump snapshot 耗费大量的 磁盘 IO、CPU, 所有节点同时 dump 会严重影响集群的对外服务能力 countLog > snapCount/2 + randRoll ,其中: countLog 为累计执行事务个数 snapCount 为配置的阈值 randRoll 为随机因素(取值:0~snapCount/2) ZK 的 快照文件是 Fuzzy 快照,不是精确到某一时刻的快照,而是某一时间段内的快照 ZK 使用「异步线程」生成快照: 线程之间共享内存空间,导致 Fuzzy 快照 这就要求 ZK

Fuzzy matching deduplication in less than exponential time?

荒凉一梦 提交于 2019-11-28 04:34:48
I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc). I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once. The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential

Fuzzy Date Time Picker Control in C# .NET?

断了今生、忘了曾经 提交于 2019-11-27 19:32:48
I am implementing a Fuzzy Date control in C# for a winforms application. The Fuzzy Date should be able to take fuzzy values like Last June 2 Hours ago 2 Months ago Last week Yesterday Last year and the like Are there any sample implementations of "Fuzzy" Date Time Pickers? Any ideas to implement such a control would be appreciated PS : I am aware of the fuzzy date algorithm spoken about here and here , I am really looking for any ideas and inspirations for developing such a control Piotr Czapla The parsing is quite easy. It can be implemented as bunch of regexps and some date calculations. The

Fuzzy date algorithm

风格不统一 提交于 2019-11-27 03:13:24
I'm looking for a fuzzy date algorithm. I just started writing one and realised what a tedious task it is. It quickly degenerated into a lot of horrid code to cope with special cases like the difference between "yesterday", "last week" and "late last month" all of which can (in some cases) refer to the same day but are individually correct based on today's date. I feel sure there must be an open source fuzzy date formatter but I can't find it. Ideally I'd like something using NSDate (OSX/iPhone) and its formatters but that isn't the difficult bit. Does anyone know of a fuzzy date formatter

Fuzzy matching deduplication in less than exponential time?

痴心易碎 提交于 2019-11-27 00:31:10
问题 I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc). I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once. The former would be a linear time problem (comparing a value against

oracle的resetlogs机制浅析

二次信任 提交于 2019-11-26 14:10:11
oracle 的 resetlogs 机制浅析 alter database open resetlogs 这个命令我想大家都很熟悉了,那有没有想过这个resetlogs选项 为什么要用?什么时候用? 它的原理机制是什么?他都起哪些作用? 我们都知道数据在启动时候是要做一致性检查的,oracle在open阶段要做两次检查 1. 检查数据文件头的检查点计数( checkpoint cnt )是否和控制文件的检查点计数( checkpoint cnt )一致。目的是确认数据文件 是否来自同一版本,而不是从备份中恢复的。如果这一步检查通过,就进行第二步检查 2. 检查数据文件头的开始scn和控制文件中记录该文件的结束scn是否一致。如果数据文件头的开始scn和控制文件中该文件的结束scn 相等,那说明这个数据文件就不需要恢复,否则就要恢复文件 如果以上两步检查都通过,那就可以正常打开 数据库 ,锁定数据文件,同时将控制文件中每个数据文件的结束scn设置无穷大。 我们在某些条件下打开数据,会提示让用resetlogs选项open数据库,为什么要用resetlogs呢?它是干嘛用的呢?问号一大堆了吧, 下面来具体分析下。 resetlogs的作用 防止陈旧的数据进入数据库(保证数据库的一致性),这也就是为什么在用resetlogs打开数据库,一定要立即对数据库做个全备。 在控制文件,data

Fuzzy date algorithm

好久不见. 提交于 2019-11-26 12:37:49
问题 I\'m looking for a fuzzy date algorithm. I just started writing one and realised what a tedious task it is. It quickly degenerated into a lot of horrid code to cope with special cases like the difference between \"yesterday\", \"last week\" and \"late last month\" all of which can (in some cases) refer to the same day but are individually correct based on today\'s date. I feel sure there must be an open source fuzzy date formatter but I can\'t find it. Ideally I\'d like something using NSDate

Elasticsearch DSL语法的学习

两盒软妹~` 提交于 2019-11-25 22:55:59
DSL语法学习 (1)term和terms查询 (2)match查询 match_all: 查询所有文档 multi_match:可以指定多个字段 match_phrase:短语匹配查询 (3)rang范围查询 (4)wildcard查询 允许使用通配符*和?来进行查询 *代表0个或多个字符 ?代表任意一个字符 (5)fuzzy模糊查询 value:查询的关键字 boost:查询的权值,默认值1.0 (6)highlight高亮显示 fields (7)bool查询 must:满足的条件是----and should:可以满足也可以不满足的天剑-----or must_not:不需要的条件----not (8)聚合查询 sum:求总和 avg:求平均值 count:统计数 cardinality: 值去重计数 <hr/> 查询:GET GET/_search{ "query":{"term":{"user":"kimchy"}}} 查询document #对age进行倒序查询 POST/pigg/_search { "query": {"match_all": {}}, "sort": [ {"age": {"order": "desc"}} ] } #查询前2条数据,from是从0开始的 POST/pigg/_search { "query": {"match_all": {}}