pyspark

Error: Must specify a primary resource (JAR or Python or R file) - IPython notebook

风格不统一 提交于 2019-12-22 04:59:17
问题 I try to run Apache Spark in IPython Notebook, follow this insruction (and all advice in comments) - link But when I run IPython Notebook by this command: ipython notebook --profile=pyspark I get this error: Error: Must specify a primary resource (JAR or Python or R file) If i run pyspark in shell, everything OK. That means what I have some trouble with connection Spark and IPython. By the way, this my bash_profile: export SPARK_HOME="$HOME/spark-1.4.0" export PYSPARK_SUBMIT_ARGS='--conf

How to get the table name from Spark SQL Query [PySpark]?

早过忘川 提交于 2019-12-22 04:49:08
问题 To get the table name from a SQL Query, select * from table1 as t1 full outer join table2 as t2 on t1.id = t2.id I found a solution in Scala How to get table names from SQL query? def getTables(query: String): Seq[String] = { val logicalPlan = spark.sessionState.sqlParser.parsePlan(query) import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation logicalPlan.collect { case r: UnresolvedRelation => r.tableName } } which gives me the correct table names when I iterate over the return

how can you calculate the size of an apache spark data frame using pyspark?

☆樱花仙子☆ 提交于 2019-12-22 04:33:08
问题 Is there a way to calculate the size in bytes of an Apache spark Data Frame using pyspark? 回答1: why don't you just cache the df and then look in the spark UI under storage and convert the units to bytes df.cache() 来源: https://stackoverflow.com/questions/38180140/how-can-you-calculate-the-size-of-an-apache-spark-data-frame-using-pyspark

How to add multiple columns using UDF?

Deadly 提交于 2019-12-22 03:48:46
问题 Question I want to add the return values of a UDF to an existing dataframe in seperate columns. How do I achieve this in a resourceful way? Here's an example of what I have so far. from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType df = spark.createDataFrame([("Alive",4)],["Name","Number"]) df.show(1) +-----+------+ | Name|Number| +-----+------+ |Alive| 4| +-----+------+ def example(n): return [[n+2], [n-2]] # schema =

Python spark extract characters from dataframe

假如想象 提交于 2019-12-22 03:46:03
问题 I have a dataframe in spark, something like this: ID | Column ------ | ---- 1 | STRINGOFLETTERS 2 | SOMEOTHERCHARACTERS 3 | ANOTHERSTRING 4 | EXAMPLEEXAMPLE What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this: ID | New Column ------ | ------ 1 | STRIN_F 2 | SOMEO_E 3 | ANOTH_S 4 | EXAMP_E I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific

How do I collect a single column in Spark?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-22 03:20:55
问题 I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected. Here is an example: df = sqlContext.createDataFrame([Row(array=[1,2,3])]) df['array'].collect() This produces the following error: Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'Column' object is not callable How can I use the collect() function on a

How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

大憨熊 提交于 2019-12-22 01:30:05
问题 There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something. Trying to load a simple test csv file from my s3. Doing it locally, like below, works. from pyspark.sql import SparkSession from pyspark import SparkContext as sc logFile = "sparkexamplefile.csv" spark = SparkSession.builder.appName("SimpleApp").getOrCreate() logData = spark.read.text(logFile).cache() numAs = logData.filter(logData.value.contains('a')).count() numBs = logData.filter

Why is my Spark streaming app so slow?

♀尐吖头ヾ 提交于 2019-12-22 01:06:01
问题 I have a cluster with 4 nodes: 3 Spark nodes and 1 Solr node. My CPU is 8 core, my memory is 32 GB, disc space is SSD. I use cassandra as my database. My data amount is 22GB after 6 hours and I now have around 3,4 Million rows, which should be read in under 5 minutes. But already it can't complete the task in this amount of time. My future plan is to read 100 Million rows in under 5 minutes . I am not sure what I can increase or do better to achieve this result now as well as to achieve my

What is the recommended way to distribute a scikit learn classifier in spark?

不羁的心 提交于 2019-12-21 22:18:50
问题 I have built a classifier using scikit learn and now I would like to use spark to run predict_proba on a large dataset. I currently pickle the classifier once using: import pickle pickle.dump(clf, open('classifier.pickle', 'wb')) and then in my spark code I broadcast this pickle using sc.broadcast for use in my spark code which has to load it in at each cluster node. This works but the pickle is large (about 0.5GB) and it seems very inefficient. Is there a better way to do this? 回答1: This

PySpark: Search For substrings in text and subset dataframe

落花浮王杯 提交于 2019-12-21 22:00:30
问题 I am brand new to pyspark and want to translate my existing pandas / python code to PySpark . I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned. Below is the Python code I tried in PySpark: def pilot_discrep(input_file): df = input_file searchfor = ['cat', 'dog', 'frog', 'fleece'] df = df[df['original_problem'].str.contains('|'.join(searchfor))] return df When I try to run the above, I get the following