bigdata | 易学教程

How to append keys to values for {Key,Value} pair RDD and How to convert it to an rdd? [duplicate]

阅读更多关于 How to append keys to values for {Key,Value} pair RDD and How to convert it to an rdd? [duplicate]

问题 This question already has answers here : Spark-Obtaining file name in RDDs (7 answers) Closed 2 years ago . Suppose i am having 2 files in file1,file2 in dataset directory: val file = sc.wholeTextFiles("file:///root/data/dataset").map((x,y) => y + "," + x) in the Above code i am trying to get an rdd having values:-> value,key as single value into rdd suppose filename is file1 and say 2 records: file1: 1,30,ssr 2,43,svr And file2: 1,30,psr 2,43,pvr The desired rdd output is: (1,30,ssr,file1),

How to Set spoutconfig from default setting?

阅读更多关于 How to Set spoutconfig from default setting?

问题 I'm Trying to get the fb pages data using graph api. The size each post is more than 1MB where kafka default fetch.message is 1MB. I have changed the kafka properties from 1MB to 3MB by adding the below lines in kafa consumer.properties and server.properties file. fetch.message.max.bytes=3048576 (consumer.properties) file message.max.bytes=3048576 (server.properties) replica.fetch.max.bytes=3048576 (server.properties ) Now after adding the above lines in Kafka, 3MB message data is going into

Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark

阅读更多关于 Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark

问题 I just started learning spark. Using spark in the standalone mode and trying to do word count in scala. The issue I have observed is reduceByKey() is not grouping the words as expected. NULL array is printed. The steps I have followed are follows... create a text file and include some words separated by spaces. In the spark shell I am executing the below commands. scala> import org.apache.spark.SparkContext import org.apache.spark.SparkContext scala> import org.apache.spark.SparkContext._

Converting 900 MB .csv into ROOT (CERN) TTree

阅读更多关于 Converting 900 MB .csv into ROOT (CERN) TTree

问题 I am new to programming and ROOT (CERN), so go easy on me. Simply, I want to convert a ~900 MB (11M lines x 10 columns) .csv file into a nicely organized .root TTree. Could someone provide the best way to go about this? Here is an example line of data with headers (it's 2010 US census block population and population density data): "Census County Code","Census Tract Code","Census Block Code","County/State","Block Centroid Latitude (degrees)","Block Centroid W Longitude (degrees)","Block Land

What is the proper way to declare a simple Timestamp in Avro

阅读更多关于 What is the proper way to declare a simple Timestamp in Avro

问题 How can we declare a simple timestamp in Avro please. type:timestamp doesnt work. So I actually use a simple string but I want it as a timestamp. (this is my variable: 27/01/1999 08:45:34 ) Thank you 回答1: Use Avro's logical type: {"name":"timestamp","type": {"type": "string", "logicalType": "timestamp-millis"} Few useful links: Avro timestamp-millis Avro Logical types Hortonworks community question about Avro timestamp 来源： https://stackoverflow.com/questions/55607244/what-is-the-proper-way-to

Spark Dataframe count function and many more functions throw IndexOutOfBoundsException

阅读更多关于 Spark Dataframe count function and many more functions throw IndexOutOfBoundsException

问题 1) Initially filtered RDD with null values. val rddWithOutNull2 = rddSlices.filter(x => x(0) != null) 2) Then converted this RDD to RDD of Row 3) After converting RDD to Dataframe using Scala : val df = spark.createDataFrame(rddRow,schema) df.printSchema() Output: root |-- name: string (nullable = false) println(df.count()) Output: Error : count : : [Stage 11:==================================> (3 + 2) / 5][error] o.a.s.e.Executor - Exception in task 4.0 in stage 11.0 (TID 16) java.lang

Why we need a coarse quantizer?

阅读更多关于 Why we need a coarse quantizer?

问题 In Product Quantization for Nearest Neighbor Search, when it comes to section IV.A, it says they they will use a coarse quantizer too (which they way I feel it, is just a really smaller product quantizer, smaller w.r.t. k , the number of centroids). I don't really get why this helps the search procedure and the cause might be that I think I don't get the way they use it. Any ides please ? 回答1: As mentioned in the NON EXHAUSTIVE SEARCH section, Approximate nearest neighbor search with product

sqoop import eror - File does not exist:

阅读更多关于 sqoop import eror - File does not exist:

问题 I am trying to import data from MySql to HDFS using Sqoop. But I am getting the following error. How to solve this? command : sqoop import --connect jdbc:mysql://localhost/testDB --username root --password password --table student --m 1 error : ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/usr/lib/sqoop/lib/parquet-format-2.0.0.jar at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall

Less than comparison for date in spark scala rdd

阅读更多关于 Less than comparison for date in spark scala rdd

问题 I want to print data of employees who joined before 1991. Below is my sample data: 69062,FRANK,ANALYST,5646,1991-12-03,3100.00,,2001 63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001 Initial RDD for loading data: val rdd=sc.textFile("file:////home/hduser/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0}) UDF for converting string column to date column: def convertStringToDate(s: String): Date = { val dateFormat = new SimpleDateFormat("yyyy-MM-dd") dateFormat.parse(s) }

Distinct values of a key in a sub-document MongoDB (100 million records)

阅读更多关于 Distinct values of a key in a sub-document MongoDB (100 million records)

问题 I have 100 million records in my "sample" collection. I want to have another collection with all of the distinct user names "user.screen_name" I have the following structure in my mongodb database "sample" collection: { "_id" : ObjectId("515af34297c2f607b822a54b"), "text" : "random text goes here", "user" : { "id" : 972863366, "screen_name" : "xname", "verified" : false, "time_zone" : "Amsterdam", } } When I try things like "distinct('user.id).length" I get the following error: "errmsg" :