bigdata

R: Expanding an R factor into dummy columns for every factor level

拥有回忆 提交于 2019-12-02 01:55:48
I have a quite big data frame in R with two columns. I am trying to make out of the Code column ( factor type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that. > str(d) 'data.frame': 649226 obs. of 2 variables: $ User: int 210 210 210 210 269 317 317 317 317 326 ... $ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ... The User column is not unique, meaning that there can be several rows with the same User . Doesn't matter if in the end the amount of rows remains the same or the rows with the

Unique Key generation in Hive/Hadoop

淺唱寂寞╮ 提交于 2019-12-02 00:43:50
While selecting a set of records from a big data hive table, a unique key needs to be created for each record. In a sequential mode of operation , it is easy to generate unique id by calling soem thing like max(id). Since hive runs the task in parallel, how can we generate unique key as part of a select query, without compromising the performance of hadoop. Is this really a map reduce problem or do we need to go for a sequential approach to solve this. If by some reason you do not want to deal with UUIDs, then this solution (based on numeric values) does not require your parallel units to

Iterating an RDD and updating a mutable collection returns an empty collection

允我心安 提交于 2019-12-01 23:17:58
I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome. I am comparing two tables My desired output schema is: case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String) When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes

Spark::KMeans calls takeSample() twice?

百般思念 提交于 2019-12-01 22:48:35
I have many data and I have experimented with partitions of cardinality [20k, 200k+]. I call it like that: from pyspark.mllib.clustering import KMeans, KMeansModel C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None) C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None) and I see that initRandom() calls takeSample() once. Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample() s per KMeans() ?

How to drop duplicated rows using pandas in a big data file?

試著忘記壹切 提交于 2019-12-01 22:15:33
I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way: chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000) for chunk in chunker: chunk.drop_duplicates(['Author ID']) But if duplicated rows distribute in different chunk seems like above script can't get the expected results. Is there any better way? You could try something like this. First, create your chunker. chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000) Now

How to drop duplicated rows using pandas in a big data file?

北战南征 提交于 2019-12-01 20:45:34
问题 I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way: chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000) for chunk in chunker: chunk.drop_duplicates(['Author ID']) But if duplicated rows distribute in different chunk seems like above script can't get the expected results. Is there any better way? 回答1: You could try something like this. First, create your chunker. chunker =

In spark join, does table order matter like in pig?

流过昼夜 提交于 2019-12-01 15:53:04
Related to Spark - Joining 2 PairRDD elements When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly better to do join A, B than join A by B , from performance perspective (avoiding spill and OOM) Is there a similar concept in spark? I didn't see any such recommendation, and wonder how is it possible? The implementation looks to me pretty much the same as in pig: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd

In spark join, does table order matter like in pig?

淺唱寂寞╮ 提交于 2019-12-01 15:00:43
问题 Related to Spark - Joining 2 PairRDD elements When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly better to do join A, B than join A by B , from performance perspective (avoiding spill and OOM) Is there a similar concept in spark? I didn't see any such recommendation, and wonder how is it possible? The implementation looks to me pretty much the

Hive Table returning empty result set on all queries

我只是一个虾纸丫 提交于 2019-12-01 14:28:10
I created a Hive Table, which loads data from a text file. But its returning empty result set on all queries. I tried the following command: CREATE TABLE table2( id1 INT, id2 INT, id3 INT, id4 STRING, id5 INT, id6 STRING, id7 STRING, id8 STRING, id9 STRING, id10 STRING, id11 STRING, id12 STRING, id13 STRING, id14 STRING, id15 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/biadmin/lineitem'; The command gets executed, and the table gets created. But, always returns 0 rows for all queries, including SELECT * FROM table2; Sample data: Single line of the

How to avoid reading old files from S3 when appending new data?

心不动则不痛 提交于 2019-12-01 14:04:53
Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading 16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26