apache-spark

Hadoop Capacity Scheduler and Spark

╄→гoц情女王★ 提交于 2021-02-20 04:22:05
问题 If I define CapacityScheduler Queues in yarn as explained here http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html how do I make spark use this? I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it. Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?). 回答1: You should configure the

Spark Streaming not reading from Kafka topics

孤街醉人 提交于 2021-02-20 02:49:25
问题 I have set up Kafka and Spark on Ubuntu. I am trying to read kafka topics through Spark Streaming using pyspark(Jupyter notebook). Spark is neither reading the data nor throwing any error. Kafka producer and consumer are communicating with each other on terminal. Kafka is configured with 3 partitions on port 9092,9093,9094. Messages are getting stored in kafka topics. Now, I want to read it through Spark Streaming. I am not sure what I am missing. Even I have explored it on internet, but

How to detect Parquet files?

家住魔仙堡 提交于 2021-02-20 02:13:40
问题 I have a script I am writing that will use either plain text or Parquet files. If it is a parquet file it will read it in using a dataframe and a few other things. On my cluster I am working on the first solution was the easiest and was if the extension of a file was .parquet if (parquetD(1) == "parquet") { if (args.length != 2) { println(usage2) System.exit(1) println(args) } } it would read it in with the dataframe. The problem is I have a bunch of files some people have created with no

How to detect Parquet files?

拥有回忆 提交于 2021-02-20 02:00:39
问题 I have a script I am writing that will use either plain text or Parquet files. If it is a parquet file it will read it in using a dataframe and a few other things. On my cluster I am working on the first solution was the easiest and was if the extension of a file was .parquet if (parquetD(1) == "parquet") { if (args.length != 2) { println(usage2) System.exit(1) println(args) } } it would read it in with the dataframe. The problem is I have a bunch of files some people have created with no

How to skip first and last line from a dat file and make it to dataframe using scala in databricks

老子叫甜甜 提交于 2021-02-19 08:59:30
问题 H|*|D|*|PA|*|BJ|*|S|*|2019.05.27 08:54:24|##| H|*|AP_ATTR_ID|*|AP_ID|*|OPER_ID|*|ATTR_ID|*|ATTR_GROUP|*|LST_UPD_USR|*|LST_UPD_TSTMP|##| 779045|*|Sar|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|128|*|2019.05.14 16:48:16|##| 779048|*|KK|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|116|*|2019.05.14 16:59:02|##| 779054|*|Nisha - A|*|EXACT|*|CustomColumnRow120|*|2|*|1165|*|2019.05.15 12:11:48|##| T|*||*|2019.05.27 08:54:28|##| file name is PA.dat. I need to skip first line and also last line of the

How can I run spark in headless mode in my custom version on HDP?

自古美人都是妖i 提交于 2021-02-19 08:26:32
问题 How can I run spark in headless mode? Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster. I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Error with spark Row.fromSeq for a text file

久未见 提交于 2021-02-19 08:25:07
问题 import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark._ import org.apache.spark.sql.types._ import org.apache.spark.sql._ object fixedLength { def main(args:Array[String]) { def getRow(x : String) : Row={ val columnArray = new Array[String](4) columnArray(0)=x.substring(0,3) columnArray(1)=x.substring(3,13) columnArray(2)=x.substring(13,18) columnArray(3)=x.substring(18,22) Row.fromSeq(columnArray) }

Pyspark Schema for Json file

你。 提交于 2021-02-19 08:14:06
问题 I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting arrayfield:[{"name":"somename"},{"address" : "someadress"}] Right now the data is as below arrayfield:[] what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading

java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

被刻印的时光 ゝ 提交于 2021-02-19 08:09:05
问题 I run the following python script with spark-submit, r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list) r_labeled = r.map(f_0).flatMap(f_1) r_labeled.map(lambda x: x[3]).collect() It gets java.lang.OutOfMemoryError, specifically on the collect() action of the last line, java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io

Check for empty row within spark dataframe?

梦想与她 提交于 2021-02-19 07:55:06
问题 Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row. So i am running the following and for some reason it gives me an OK output: check_empty = lambda row : not any([False if k is None else True for k in row]) check_empty_udf = sf.udf(check_empty, BooleanType()) df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show() I am missing