bigdata

Inserting large number of nodes into Neo4J

痴心易碎 提交于 2019-11-30 20:43:29
问题 I have a table stored in a typical MySQL database and I've built a small parser tool using java, to parse out and build a neo4j database. This database will have ~40 million nodes, each with one or more edges (with a possible maximum of 10 edges). The problem comes from the way I have to create certain nodes. There's a user node, comment node, and hashtag node. The user nodes and hashtag nodes must each be unique. I'm using code from the following example to ensure uniqueness: public Node

Spark job execution time exponentially increases with very wide dataset and number of columns [duplicate]

末鹿安然 提交于 2019-11-30 20:26:53
问题 This question already exists : Spark Fixed Width File Import Large number of columns causing high Execution time Closed last year . I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand. What are the

NumPy reading file with filtering lines on the fly

ぐ巨炮叔叔 提交于 2019-11-30 19:57:20
I have a large array of numbers written in a CSV file and need to load only a slice of that array. Conceptually I want to call np.genfromtxt() and then row-slice the resulting array, but the file is so large that may not to fit in RAM the number of relevant rows might be small, so there is no need to parse every line. MATLAB has the function textscan() that can take a file descriptor and read only a chunk of the file. Is there anything like that in NumPy? For now, I defined the following function that reads only the lines that satisfy the given condition: def genfromtxt_cond(fname, cond=

Memory limits in data table: negative length vectors are not allowed

牧云@^-^@ 提交于 2019-11-30 18:11:42
I have a data table with several social media users and his/her followers. The original data table has the following format: X.USERID FOLLOWERS 1081 4053807021,2476584389,4713715543, ... So each row contains a user together with his/her ID and a vector of followers (seperated by a comma). In total I have 24,000 unique user IDs together with 160,000,000 unique followers. I wish to convert my original table in the following format: X.USERID FOLLOWERS 1: 1081 4053807021 2: 1081 2476584389 3: 1081 4713715543 4: 1081 580410695 5: 1081 4827723557 6: 1081 704326016165142528 In order to get this data

how to sort word count by value in hadoop? [duplicate]

杀马特。学长 韩版系。学妹 提交于 2019-11-30 17:44:23
This question already has an answer here: hadoop map reduce secondary sorting 5 answers hi i wanted to learn how to sort the word count by value in hadoop.i know hadoop takes of sorting keys, but not by values. i know to sort the values we must have a partitioner,groupingcomparator and a sortcomparator but i am bit confused in applying these concepts together to sort the word count by value. do we need another map reduce job to achieve the same or else a combiner to count the occurrences and then sort here and emit the same to reducer? can any one explain how to sort word count example by

Can Azure Search be used as a primary database for some data?

十年热恋 提交于 2019-11-30 17:37:39
Microsoft promotes Azure Search as "cloud search", but doesn't necessarily say it's a "database" or "data storage". It stops short of saying it's big data. Can/should azure search be used as the primary database for some data? Or should there always be some "primary" datastore that is "duplicated" in azure search for search purposes? If so, under what circumstances/what scenarios make sense to use Azure Search as a primary database? Although we generally don't recommend it, you might consider using Azure Search as a primary store if: Your app can tolerate some data inconsistency. Azure Search

How to get array/bag of elements from Hive group by operator?

£可爱£侵袭症+ 提交于 2019-11-30 17:35:22
I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I want to write Hive Query that will give the below output:- 001 [111, 222, 123] 002 [222, 333] 003 [555] In Pig, this can be very easily achieved by something like this:- grouped_relation = GROUP sample_table BY F1; Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for

what should be considered before choosing hbase?

旧巷老猫 提交于 2019-11-30 16:21:39
I am very new in big data space. We got suggestion from team we should use hbase instead of RDBMS for high performance . We do not have any idea what should/must be considered before switching RDMS to hbase. Any ideas? One of my favourite book describes.. Coming to @Whitefret's last point : There is some thing called CAP theorm based on which decision can be taken. Consistency (all nodes see the same data at the same time) Availability (every request receives a response about whether it succeeded or failed) Partition tolerance (the system continues to operate despite arbitrary partitioning due

How to set data block size in Hadoop ? Is it advantage to change it?

痴心易碎 提交于 2019-11-30 16:21:34
If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how? There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented: HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is

How does the Apache Spark scheduler split files into tasks?

非 Y 不嫁゛ 提交于 2019-11-30 14:40:27
In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow: Here I wanna know three things about how does a stage be splited into tasks? in this example above, it seems that tasks' number are created based on the file number, am I right? if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks? If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming