apache-spark | 易学教程

Hadoop Capacity Scheduler and Spark

阅读更多关于 Hadoop Capacity Scheduler and Spark

问题 If I define CapacityScheduler Queues in yarn as explained here http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html how do I make spark use this? I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it. Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?). 回答1: You should configure the

Spark Streaming not reading from Kafka topics

阅读更多关于 Spark Streaming not reading from Kafka topics

问题 I have set up Kafka and Spark on Ubuntu. I am trying to read kafka topics through Spark Streaming using pyspark(Jupyter notebook). Spark is neither reading the data nor throwing any error. Kafka producer and consumer are communicating with each other on terminal. Kafka is configured with 3 partitions on port 9092,9093,9094. Messages are getting stored in kafka topics. Now, I want to read it through Spark Streaming. I am not sure what I am missing. Even I have explored it on internet, but

How to detect Parquet files?

阅读更多关于 How to detect Parquet files?

问题 I have a script I am writing that will use either plain text or Parquet files. If it is a parquet file it will read it in using a dataframe and a few other things. On my cluster I am working on the first solution was the easiest and was if the extension of a file was .parquet if (parquetD(1) == "parquet") { if (args.length != 2) { println(usage2) System.exit(1) println(args) } } it would read it in with the dataframe. The problem is I have a bunch of files some people have created with no

How to detect Parquet files?

阅读更多关于 How to detect Parquet files?

How to skip first and last line from a dat file and make it to dataframe using scala in databricks

阅读更多关于 How to skip first and last line from a dat file and make it to dataframe using scala in databricks

问题 H|*|D|*|PA|*|BJ|*|S|*|2019.05.27 08:54:24|##| H|*|AP_ATTR_ID|*|AP_ID|*|OPER_ID|*|ATTR_ID|*|ATTR_GROUP|*|LST_UPD_USR|*|LST_UPD_TSTMP|##| 779045|*|Sar|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|128|*|2019.05.14 16:48:16|##| 779048|*|KK|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|116|*|2019.05.14 16:59:02|##| 779054|*|Nisha - A|*|EXACT|*|CustomColumnRow120|*|2|*|1165|*|2019.05.15 12:11:48|##| T|*||*|2019.05.27 08:54:28|##| file name is PA.dat. I need to skip first line and also last line of the

How can I run spark in headless mode in my custom version on HDP?

阅读更多关于 How can I run spark in headless mode in my custom version on HDP?

问题 How can I run spark in headless mode? Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster. I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Error with spark Row.fromSeq for a text file

阅读更多关于 Error with spark Row.fromSeq for a text file

问题 import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark._ import org.apache.spark.sql.types._ import org.apache.spark.sql._ object fixedLength { def main(args:Array[String]) { def getRow(x : String) : Row={ val columnArray = new Array[String](4) columnArray(0)=x.substring(0,3) columnArray(1)=x.substring(3,13) columnArray(2)=x.substring(13,18) columnArray(3)=x.substring(18,22) Row.fromSeq(columnArray) }

Pyspark Schema for Json file

阅读更多关于 Pyspark Schema for Json file

问题 I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting arrayfield:[{"name":"somename"},{"address" : "someadress"}] Right now the data is as below arrayfield:[] what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading

java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

阅读更多关于 java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

问题 I run the following python script with spark-submit, r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list) r_labeled = r.map(f_0).flatMap(f_1) r_labeled.map(lambda x: x[3]).collect() It gets java.lang.OutOfMemoryError, specifically on the collect() action of the last line, java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io

Check for empty row within spark dataframe?

阅读更多关于 Check for empty row within spark dataframe?

问题 Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row. So i am running the following and for some reason it gives me an OK output: check_empty = lambda row : not any([False if k is None else True for k in row]) check_empty_udf = sf.udf(check_empty, BooleanType()) df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show() I am missing