neo4j

TinkerPop集成Hadoop+Spark

最后都变了- 提交于 2020-01-26 04:01:39
前言 前面介绍了 TinkerPop集成Neo4j 的配置方法,并且实现了HA操作。这里有一个突出问题就是不管是使用Neo4j,还是自带的TinkerGraph都不可避免的面临一个问题——大数据量场景,也即分布式问题。鉴于此, Tinkerpop 还提供了和Hadoop+Spark的集成解决方案,从而解决单节点问题。但是由于Spark中的数据一致性问题,不能修改数据,所以这种方案不能修改数据,也不能新增数据,只适合用来查询、计算,不得不说这是一个很大的缺点。如果有同学有更好的解决方法,欢迎在下面留言交流。另外,本文的所有操作同样都以Tinkerpop Server 3.4.4为例。 TinkerPop集成Hadoop+Spark 在 Tinkerpop官网 中已经给出了和Hadoop+Spark的集成方法,但是有两个问题。第一,所有的操作都是基于console完成的,没有server的操作步骤;第二,在使用SparkGraphComputer时,master都是local模式,对于使用YARN作为资源管理器的时候,参照官网资料往往是调试不成功的。原因主要有三点: SparkGraphComputer会创建自己的SparkContext,而不是通过spark-submit获取配置信息。 对于Spark运行在YARN上的模式,直到Tinkerpop 3.2.7/3.3.1版本之后才支持

Neo4j PostingsFormat with name 'BlockTreeOrds' does not exist

邮差的信 提交于 2020-01-25 20:28:27
问题 I tried to packaged my project. But when I run the jar file, I find a bug. Exception in thread "main" java.lang.RuntimeException: Error starting org.neo4j.kernel.impl.factory.CommunityFacadeFactory, D:\f ... Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine@5483163c' failed to initialize. Please see attached cause exception. ... Caused by: java.lang.IllegalArgumentException: An SPI class of type org

How to efficiently find multiple relationship size

拟墨画扇 提交于 2020-01-25 09:14:04
问题 We have a large graph (over 1 billion edges) that has multiple relationship types between nodes. In order to check the number of nodes that have a single unique relationship between nodes (i.e. a single relationship between two nodes per type, which otherwise would not be connected) we are running the following query: MATCH (n)-[:REL_TYPE]-(m) WHERE size((n)-[]-(m))=1 AND id(n)>id(m) RETURN COUNT(DISTINCT n) + COUNT(DISTINCT m) To demonstrate a similar result, the below sample code can run on

How to group or merge virtual relationship created using apoc.create.vRelationship among nodes in neo4j?

无人久伴 提交于 2020-01-25 08:00:06
问题 There is a set of artists, from which some artists create a temporary group and organize a event in any city. After it different groups organize events in different city or same city as done by some other group. I want to query the data when artist A participates in the event then the events done in same city by Artist B in a series of Dates with below Cypher query but get duplicate virtual relationship for Artist A & Event and also for Event & City. MATCH seriesB = (bArtist:Artist)-[:HAS

Neo4J cypher: collect intermediate node properties (path)

放肆的年华 提交于 2020-01-25 07:58:09
问题 I have a data lineage related graph in Neo4J with variable length path containing intermediate nodes (tables): match p=(s)-[r:airflow_loads_to*]->(t) where s.database_name='hive' and s.schema_name='test' and s.name="source_table" return s.name,collect(nodes(p)),t.name Instead of returning the nodes between s.name and t.name as a path, I want to return an array of the name property of all nodes in the path (in the order of traversing) I probably have to use collect, but that is not possible on

How to set Neo4j conf in docker?

浪尽此生 提交于 2020-01-25 07:27:27
问题 I used to run Neo4j separately and then my application interacted with it as required. Every time I fresh installed Neo4j, I had to go to /etc/neo4j/neo4j.conf and comment this one line: dbms.directories.import=/var/lib/neo4j/import by putting a # in start of it to make things work for me. By default this line wasn't commented. Anyways, I am moving to docker now, and I want to know how to change that line in docker environment? Here's my portion of neo4j in docker file. neo4j: container_name:

simple lookup takes several minutes despite using an index

℡╲_俬逩灬. 提交于 2020-01-25 06:46:08
问题 I have a decently sized graph (~600 million nodes, 3.5 billion edges) that I imported into neo4j. The graph is also quite dense (median edge count around 10); though I'm not sure if that affects performance. For one type of node (:Authors) - there are roughly 200 million nodes of this type - I would like to run a query for a specific name, which is stored in the property normalizedName . Here is the (very simple) query: MATCH (a:AUTHOR) WHERE a.normalizedName = "jonathan smith" RETURN a As

simple lookup takes several minutes despite using an index

爱⌒轻易说出口 提交于 2020-01-25 06:46:04
问题 I have a decently sized graph (~600 million nodes, 3.5 billion edges) that I imported into neo4j. The graph is also quite dense (median edge count around 10); though I'm not sure if that affects performance. For one type of node (:Authors) - there are roughly 200 million nodes of this type - I would like to run a query for a specific name, which is stored in the property normalizedName . Here is the (very simple) query: MATCH (a:AUTHOR) WHERE a.normalizedName = "jonathan smith" RETURN a As

How to use SQL-like GROUP BY in Cypher query language, in Neo4j?

不问归期 提交于 2020-01-25 02:17:04
问题 I want to find the number of all users in a company and the number of its men and women. My query is: start n=node:company(name:"comp") match n<-[:Members_In]-x, n<-[:Members_In]-y where x.Sex='Male' and y.Sex='Female' return n.name as companyName, count(distinct x) as NumOfMale, count(distinct y) as NumOfFemale" ); My query is correct, but I know I shouldn't use n<-[:Members_In]-y in the match clause. How can I get the number of male, number of female, and total number of users? 回答1: Peter

Neo4j java: Traversal from multiple start points

北城余情 提交于 2020-01-25 01:43:26
问题 my task in Neo4j 2.0 embedded is to find the paths from multiple nodes to the root of the tree, in which all nodes are located. Thus, if we assume I have start nodes A, B, and C, I'd like to find paths A-->...-->root B-->...-->root C-->...-->root For this task, I defined a TraversalDescription which works just fine when applied to each of the start nodes individually. Now I saw that the TraversalDescription's traverse method can not only take one start node but multiple. So I put all my start