bigdata | 易学教程

Spark 1.0.2 (also 1.1.0) hangs on a partition

阅读更多关于 Spark 1.0.2 (also 1.1.0) hangs on a partition

问题 I've got a weird problem in apache spark and I would appreciate some help. After reading data from hdfs (and doing some conversion from json to object) the next stage (processing said objects) fails after 2 partitions have been processed (out of 512 in total). This happens on large-ish datasets (the smallest I have noticed is about 700 megs, but could be lower, I haven't narrowed it down yet). EDIT: 700 megs is the tgz file size, uncompressed it's 6 gigs. EDIT 2: The same thing happens on

Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames

阅读更多关于 Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames

问题 I am receiving "java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)" error while using dataframes in scala app and running it using spark. However if I work using only RDD's and not dataframes, no such error comes up with same pom and settings. Also while going through other posts with same error, it is mentioned that scala version has to be 2.10 as spark is not compatible with 2.11 scala, and i am using 2.10 scala version with 2.0.0 spark. Below

Sorting a data stream before writing to file in nodejs

阅读更多关于 Sorting a data stream before writing to file in nodejs

问题 I have an input file which may potentially contain upto 1M records and each record would look like this field 1 field 2 field3 \n I want to read this input file and sort it based on field3 before writing it to another file. here is what I have so far var fs = require('fs'), readline = require('readline'), stream = require('stream'); var start = Date.now(); var outstream = new stream; outstream.readable = true; outstream.writable = true; var rl = readline.createInterface({ input: fs

Sorting a data stream before writing to file in nodejs

阅读更多关于 Sorting a data stream before writing to file in nodejs

How to convert r data frame to h2o object

阅读更多关于 How to convert r data frame to h2o object

问题 Im new to R and H2O and I have tried to find a way to convert r data frame to a h2o object. I have spent some time research on how to do this with no luck. Other way around is possible and well documented as follows. prosPath = system.file("extdata", "prostate.csv", package="h2o") prostate.hex = h2o.importFile(localH2O, path = prosPath) prostate.data.frame <- as.data.frame(prostate.hex) But what i want is complete opposite of this. I wants to convert r "prostate.data.frame" data object

Sqoop Import from Hive to Hive

阅读更多关于 Sqoop Import from Hive to Hive

问题 Can we import tables from Hive DataSource to Hive DataSource using Sqoop . Query like - sqoop import --connect jdbc:hive2://localhost:10000/default --driver org.apache.hive.jdbc.HiveDriver --username root --password root --table student1 -m 1 --target-dir hdfs://localhost:9000/user/dummy/hive2result Right now its throwing the below exception 15/07/19 19:50:18 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Method not supported java.sql.SQLException: Method not

Read, format, then write large CSV files

阅读更多关于 Read, format, then write large CSV files

问题 I have fairly large csv files that I need to manipulate/amend line-by-line (as each line may require different amending rules) then write them out to another csv with the proper formatting. Currently, I have: import multiprocessing def read(buffer): pool = multiprocessing.Pool(4) with open("/path/to/file.csv", 'r') as f: while True: lines = pool.map(format_data, f.readlines(buffer)) if not lines: break yield lines def format_data(row): row = row.split(',') # Because readlines() returns a

select multiple elements with group by in spark.sql

阅读更多关于 select multiple elements with group by in spark.sql

问题 is there any way to group by table in sql spark which selects multiple elements code i am using: val df = spark.read.json("//path") df.createOrReplaceTempView("GETBYID") now doing group by like : val sqlDF = spark.sql( "SELECT count(customerId) FROM GETBYID group by customerId"); but when I try: val sqlDF = spark.sql( "SELECT count(customerId),customerId,userId FROM GETBYID group by customerId"); Spark gives an error : org.apache.spark.sql.AnalysisException: expression 'getbyid. userId ' is

how to compare two data frames in scala

阅读更多关于 how to compare two data frames in scala

问题 I have two exactly same dataframes for comparison test df1 ------------------------------------------ year | state | count2 | count3 | count4| 2014 | NJ | 12332 | 54322 | 53422 | 2014 | NJ | 12332 | 53255 | 55324 | 2015 | CO | 12332 | 53255 | 55324 | 2015 | MD | 14463 | 76543 | 66433 | 2016 | CT | 14463 | 76543 | 66433 | 2016 | CT | 55325 | 76543 | 66433 | ------------------------------------------ df2 ------------------------------------------ year | state | count2 | count3 | count4| 2014 |

Create hive table error to load Twitter data

阅读更多关于 Create hive table error to load Twitter data

问题 I am trying to create external table and trying to load twitter data into table. While creating the table I am getting following error and could not able to track the error. hive> ADD JAR /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar > ; Added [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path Added resources: [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] hive> CREATE EXTERNAL TABLE tweets ( > id BIGINT, > created_at STRING, > source STRING, > favorited BOOLEAN, >