spark-streaming

Print parquet schema using Spark Streaming

独自空忆成欢 提交于 2019-12-13 03:33:09
问题 Following is the extract of the scala code written to extract praquet files and print the schema and first few records from the Parquet file. But nothing is getting printed. val batchDuration = 2 val inputDir = "file:///home/samplefiles" val conf = new SparkConf().setAppName("gpParquetStreaming").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("spark.streaming.fileStream.minRememberDuration", "600000") val ssc = new StreamingContext(sc, Seconds(batchDuration))

Apache Spark: Yarn logs Analysis

心不动则不痛 提交于 2019-12-13 02:26:06
问题 I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this : hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID> I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?

Spark streaming: When does the job fails, after failure of multiple tasks retries

匆匆过客 提交于 2019-12-13 02:18:51
问题 I am running a spark-streaming job to stream data from HDFS. The job fails frequently once or twice a day, showing multiple errors in the log files. I want to know, when does the spark-streaming job Fails/Exits, after so and so conditions/ retries are performed? Exception in yarn log :- 16/05/10 02:22:35 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 3 retries) java.io.IOException: Failed to connect to spark-prod-02-w-8.c.orion-0010.internal/10.240

Spark Streaming with Python - class not found exception

走远了吗. 提交于 2019-12-13 02:12:53
问题 I'm working on a project to bulk load data from a CSV file to HBase using Spark streaming. The code I'm using is as follows (adapted from here): def bulk_load(rdd): conf = {#removed for brevity} keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter" valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter" load_rdd = rdd.flatMap(lambda line: line.split("\n"))\ .flatMap(csv_to_key_value) load_rdd.saveAsNewAPIHadoopDataset(conf

Queries with streaming sources must be executed with writeStream.start();;

旧城冷巷雨未停 提交于 2019-12-12 20:37:41
问题 I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML. val spark = SparkSession .builder() .appName("Spark SQL basic example") .master("local") .getOrCreate() import spark.implicits._ val toString = udf((payload: Array[Byte]) => new String(payload)) val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1") .load(

inappropriate output while creating a dataframe

独自空忆成欢 提交于 2019-12-12 19:54:04
问题 I'm trying to stream the data from kafka topic using scala application.I'm able to get the data from the topic, but how to create a data frame out of it? Here is the data(in string,string format) { "action": "AppEvent", "tenantid": 298, "lat": 0.0, "lon": 0.0, "memberid": 16390, "event_name": "CATEGORY_CLICK", "productUpccd": 0, "device_type": "iPhone", "device_os_ver": "10.1", "item_name": "CHICKEN" } I tried few ways to do it, but it is not yielding satisfactory results. +------------------

SparkStreaming: Read Kafka Stream and provide it as RDD for further processing

别来无恙 提交于 2019-12-12 19:06:29
问题 I have currently the following setup: Application writes data to Kafka -> SparkStreaming reads the stored data (always reading from earliest entry) and does conversions to the stream -> Application needs a RDD of this result to train an mllib model. I want to basically achieve something similar to https://github.com/keiraqz/anomaly-detection - but my data does not come from file but from kafka and needs some reprocessing in Spark to extract the training data from the input. Reading the data

Spark Streaming application gives OOM after running for 24 hours

瘦欲@ 提交于 2019-12-12 16:28:29
问题 I am using spark 1.5.0 and working on a spark streaming application . The application reads files from HDFS , converts rdd into dataframe and execute multiple queries on each dataframe. The application runs perfectly for around 24 hours and then it crashes. The application master logs / driver logs shows : Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class

NoSuchMethodError while running Spark Streaming job on HDP 2.2

淺唱寂寞╮ 提交于 2019-12-12 16:25:54
问题 I am trying to run a simple streaming job on HDP 2.2 Sandbox but facing java.lang.NoSuchMethodError error. I am able to run SparkPi example on this machine without an issue. Following are the versions I am using- <kafka.version>0.8.2.0</kafka.version> <twitter4j.version>4.0.2</twitter4j.version> <spark-version>1.2.1</spark-version> <scala.version>2.11</scala.version> Code Snippet - val sparkConf = new SparkConf().setAppName("TweetSenseKafkaConsumer").setMaster("yarn-cluster"); val ssc = new

What is Starvation scenario in Spark streaming?

≯℡__Kan透↙ 提交于 2019-12-12 16:18:21
问题 In the famous word count example for spark streaming, the spark configuration object is initialized as follows: /* Create a local StreamingContext with two working thread and batch interval of 1 second. The master requires 2 cores to prevent from a starvation scenario. */ val sparkConf = new SparkConf(). setMaster("local[2]").setAppName("WordCount") Here if I change the master from local[2] to local or does not set the Master, I do not get the expected output and in fact word counting doesn't