bigdata

Spark 1.0.2 (also 1.1.0) hangs on a partition

≯℡__Kan透↙ 提交于 2019-12-23 09:49:23
问题 I've got a weird problem in apache spark and I would appreciate some help. After reading data from hdfs (and doing some conversion from json to object) the next stage (processing said objects) fails after 2 partitions have been processed (out of 512 in total). This happens on large-ish datasets (the smallest I have noticed is about 700 megs, but could be lower, I haven't narrowed it down yet). EDIT: 700 megs is the tgz file size, uncompressed it's 6 gigs. EDIT 2: The same thing happens on

Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames

本小妞迷上赌 提交于 2019-12-23 09:16:50
问题 I am receiving "java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)" error while using dataframes in scala app and running it using spark. However if I work using only RDD's and not dataframes, no such error comes up with same pom and settings. Also while going through other posts with same error, it is mentioned that scala version has to be 2.10 as spark is not compatible with 2.11 scala, and i am using 2.10 scala version with 2.0.0 spark. Below

Sorting a data stream before writing to file in nodejs

馋奶兔 提交于 2019-12-23 07:25:17
问题 I have an input file which may potentially contain upto 1M records and each record would look like this field 1 field 2 field3 \n I want to read this input file and sort it based on field3 before writing it to another file. here is what I have so far var fs = require('fs'), readline = require('readline'), stream = require('stream'); var start = Date.now(); var outstream = new stream; outstream.readable = true; outstream.writable = true; var rl = readline.createInterface({ input: fs

Sorting a data stream before writing to file in nodejs

眉间皱痕 提交于 2019-12-23 07:25:03
问题 I have an input file which may potentially contain upto 1M records and each record would look like this field 1 field 2 field3 \n I want to read this input file and sort it based on field3 before writing it to another file. here is what I have so far var fs = require('fs'), readline = require('readline'), stream = require('stream'); var start = Date.now(); var outstream = new stream; outstream.readable = true; outstream.writable = true; var rl = readline.createInterface({ input: fs

How to convert r data frame to h2o object

ε祈祈猫儿з 提交于 2019-12-23 06:49:14
问题 Im new to R and H2O and I have tried to find a way to convert r data frame to a h2o object. I have spent some time research on how to do this with no luck. Other way around is possible and well documented as follows. prosPath = system.file("extdata", "prostate.csv", package="h2o") prostate.hex = h2o.importFile(localH2O, path = prosPath) prostate.data.frame <- as.data.frame(prostate.hex) But what i want is complete opposite of this. I wants to convert r "prostate.data.frame" data object

Sqoop Import from Hive to Hive

扶醉桌前 提交于 2019-12-23 04:54:37
问题 Can we import tables from Hive DataSource to Hive DataSource using Sqoop . Query like - sqoop import --connect jdbc:hive2://localhost:10000/default --driver org.apache.hive.jdbc.HiveDriver --username root --password root --table student1 -m 1 --target-dir hdfs://localhost:9000/user/dummy/hive2result Right now its throwing the below exception 15/07/19 19:50:18 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Method not supported java.sql.SQLException: Method not

Read, format, then write large CSV files

柔情痞子 提交于 2019-12-23 04:43:18
问题 I have fairly large csv files that I need to manipulate/amend line-by-line (as each line may require different amending rules) then write them out to another csv with the proper formatting. Currently, I have: import multiprocessing def read(buffer): pool = multiprocessing.Pool(4) with open("/path/to/file.csv", 'r') as f: while True: lines = pool.map(format_data, f.readlines(buffer)) if not lines: break yield lines def format_data(row): row = row.split(',') # Because readlines() returns a

select multiple elements with group by in spark.sql

情到浓时终转凉″ 提交于 2019-12-23 03:47:20
问题 is there any way to group by table in sql spark which selects multiple elements code i am using: val df = spark.read.json("//path") df.createOrReplaceTempView("GETBYID") now doing group by like : val sqlDF = spark.sql( "SELECT count(customerId) FROM GETBYID group by customerId"); but when I try: val sqlDF = spark.sql( "SELECT count(customerId),customerId,userId FROM GETBYID group by customerId"); Spark gives an error : org.apache.spark.sql.AnalysisException: expression 'getbyid. userId ' is

how to compare two data frames in scala

南楼画角 提交于 2019-12-23 03:17:29
问题 I have two exactly same dataframes for comparison test df1 ------------------------------------------ year | state | count2 | count3 | count4| 2014 | NJ | 12332 | 54322 | 53422 | 2014 | NJ | 12332 | 53255 | 55324 | 2015 | CO | 12332 | 53255 | 55324 | 2015 | MD | 14463 | 76543 | 66433 | 2016 | CT | 14463 | 76543 | 66433 | 2016 | CT | 55325 | 76543 | 66433 | ------------------------------------------ df2 ------------------------------------------ year | state | count2 | count3 | count4| 2014 |

Create hive table error to load Twitter data

萝らか妹 提交于 2019-12-23 02:40:44
问题 I am trying to create external table and trying to load twitter data into table. While creating the table I am getting following error and could not able to track the error. hive> ADD JAR /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar > ; Added [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path Added resources: [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] hive> CREATE EXTERNAL TABLE tweets ( > id BIGINT, > created_at STRING, > source STRING, > favorited BOOLEAN, >