bigdata

Why does Spark's OneHotEncoder drop the last category by default?

社会主义新天地 提交于 2019-12-01 03:24:45
I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer(inputCol="c",outputCol="c_idx") >>> ff = ss.fit(fd).transform(fd) >>> ff.show() +----+---+-----+ | x| c|c_idx| +----+---+-----+ | 1.0| a| 0.0| | 1.5| a| 0.0| |10.0| b| 1.0| | 3.2| c| 2.0| +----+---+-----+ By default, the OneHotEncoder will drop the last category: >>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec") >>> fe = oe.transform(ff) >>>

Finding Minimum hamming distance of a set of strings in python

丶灬走出姿态 提交于 2019-12-01 02:25:22
问题 I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is dmin=len(trans[0]) for i in xrange(len(trans)): for j in xrange(i+1,len(trans)): dist=hamdist(trans[i][:-1], trans[j][:-1]) if dist < dmin: dmin = dist Is there a more efficient method to do this? Here hamdist is

Which function in spark is used to combine two RDDs by keys

谁说我不能喝 提交于 2019-12-01 02:22:41
Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ] Now, I want to join them by key values, so for example I want to return the following ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3, value4, value7]) ] How I can I do this, in spark using Python or Scala? One way is to use join, but join would create a tuple inside the tuple. But I want to only have one tuple per key value pair. I would union the two RDDs and to a reduceByKey to

Oozie s3 as job folder

Deadly 提交于 2019-12-01 01:50:38
Oozie is failing with following error when workflow.xml is provided from s3, But the same worked provided workflow.xml from HDFS. Same has worked with earlier versions of oozie, Is there anything changed from 4.3 version of oozie.? Env: HDP 3.1.0 Oozie 4.3.1 oozie.service.HadoopAccessorService.supported.filesystems=* Job.properties nameNode=hdfs://ambari-master-1a.xdata.com:8020 jobTracker=ambari-master-2a.xdata.com:8050 queue=default #OOZIE job details basepath=s3a://mybucket/test/oozie oozie.use.system.libpath=true oozie.wf.application.path=${basepath}/jobs/test-hive​ #(works with this

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

柔情痞子 提交于 2019-12-01 01:47:44
I use Kafka streams to process real-time data, in the Kafka streams tasks, I need to access MySQL to query data, and need to call another restful service. All the operations are synchronous. I'm afraid the sync call will reduce the process capability of the streams tasks. Is this a good practice? or Is there any good idea to do this? A better way to do it would be to stream your MySQL table(s) into Kafka, and access the data there. This has the advantage of decoupling your streams app from the MySQL database. If you moved away from MySQL in the future, so long as the data were still written to

Inserting large number of nodes into Neo4J

我怕爱的太早我们不能终老 提交于 2019-12-01 01:14:26
I have a table stored in a typical MySQL database and I've built a small parser tool using java, to parse out and build a neo4j database. This database will have ~40 million nodes, each with one or more edges (with a possible maximum of 10 edges). The problem comes from the way I have to create certain nodes. There's a user node, comment node, and hashtag node. The user nodes and hashtag nodes must each be unique. I'm using code from the following example to ensure uniqueness: public Node getOrCreateUserWithUniqueFactory( String username, GraphDatabaseService graphDb ) { UniqueFactory<Node>

Spark job execution time exponentially increases with very wide dataset and number of columns [duplicate]

 ̄綄美尐妖づ 提交于 2019-12-01 01:12:00
This question is an exact duplicate of: Spark Fixed Width File Import Large number of columns causing high Execution time I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand. What are the possible reasons for this? How can I improve the performance? One of the similar issues I found: http://apache-spark

Which function in spark is used to combine two RDDs by keys

大兔子大兔子 提交于 2019-11-30 21:52:09
问题 Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ] Now, I want to join them by key values, so for example I want to return the following ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3, value4, value7]) ] How I can I do this, in spark using Python or Scala? One way is to use join, but join would create a tuple inside the tuple. But

Inserting a big array of object in mongodb from nodejs

坚强是说给别人听的谎言 提交于 2019-11-30 21:32:00
I need to insert a big array of objects (about 1.5-2 millions) in mongodb from nodejs. How can i improve my inserting? This is my code: var sizeOfArray = arrayOfObjects.length; //sizeOfArray about 1.5-2 millions for(var i = 0; i < sizeOfResult; ++i) { newKey = { field_1: result[i][1], field_2: result[i][2], field_3: result[i][3] }; collection.insert(newKey, function(err, data) { if (err) { log.error('Error insert: ' + err); } }); } You can use bulk inserts. There are two types of bulk operations: Ordered bulk operations. These operations execute all the operation in order and error out on the

Skipping the first line of the .csv in Map reduce java

对着背影说爱祢 提交于 2019-11-30 21:24:46
As mapper function runs for every line , can i know the way how to skip the first line. For some file it consists of column header which i want to ignore In mapper while reading the file, the data is read in as key-value pair. The key is the byte offset where the next line starts. For line 1 it is always zero. So in mapper function do the following @Override public void map(LongWritable key, Text value, Context context) throws IOException { try { if (key.get() == 0 && value.toString().contains("header") /*Some condition satisfying it is header*/) return; else { // For rest of data it goes here