pyspark

Read SAS sas7bdat data with Spark

与世无争的帅哥 提交于 2020-06-16 11:53:06
问题 I have a SAS table and I try to read it with Spark. I've try to use this https://github.com/saurfang/spark-sas7bdat like but I couldn't get it to work. Here is the code: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format("com.github.saurfang.sas.spark").load("my_table.sas7bdat") It returns this error: Py4JJavaError: An error occurred while calling o878.load. : java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark.

Pandas scalar UDF failing, IllegalArgumentException

落爺英雄遲暮 提交于 2020-06-16 07:58:16
问题 First off, I apologize if my issue is simple. I did spend a lot of time researching it. I am trying to set up a scalar Pandas UDF in a PySpark script as described here. Here is my code: from pyspark import SparkContext from pyspark.sql import functions as F from pyspark.sql.types import * from pyspark.sql import SQLContext sc.install_pypi_package("pandas") import pandas as pd sc.install_pypi_package("PyArrow") df = spark.createDataFrame( [("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2

How do I download a large list of URLs in parallel in pyspark?

孤人 提交于 2020-06-16 05:07:22
问题 I have an RDD containing 10000 urls to be fetched. list = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com', 'http://google.com', 'http://twitter.com'] urls = sc.parallelize(list) I need to check which urls are broken and preferably fetch the results to a corresponding RDD in Python. I tried this: import asyncio import concurrent.futures import requests async def get(url): with concurrent.futures.ThreadPoolExecutor() as executor: loop = asyncio.get_event_loop() futures = [ loop.run_in_executor(

How to import Delta Lake module in Zeppelin notebook and pyspark?

左心房为你撑大大i 提交于 2020-06-14 07:56:11
问题 I am trying to use Delta Lake in a Zeppelin notebook with pyspark and seems it cannot import the module successfully. e.g. %pyspark from delta.tables import * It fails with the following error: ModuleNotFoundError: No module named 'delta' However, there is no problem to save/read the data frame using delta format. And the module can be loaded successfully if using scala spark %spark Is there any way to use Delta Lake in Zeppelin and pyspark? 回答1: Finally managed to load it on zeppelin pyspark

Encountering “ WARN ProcfsMetricsGetter: Exception when trying to compute pagesize” error when running Spark

笑着哭i 提交于 2020-06-13 06:14:48
问题 I installed spark and when trying to run it, I am getting the error: WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped Can someone help me with that? 回答1: The same problem occured with me because python path was not added to system environment. I added this in environment and now it works perfectly. Adding PYTHONPATH environment variable with value as: %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-<version>-src.zip:

Encountering “ WARN ProcfsMetricsGetter: Exception when trying to compute pagesize” error when running Spark

心已入冬 提交于 2020-06-13 06:14:29
问题 I installed spark and when trying to run it, I am getting the error: WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped Can someone help me with that? 回答1: The same problem occured with me because python path was not added to system environment. I added this in environment and now it works perfectly. Adding PYTHONPATH environment variable with value as: %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-<version>-src.zip:

Spark assign value if null to column (python)

匆匆过客 提交于 2020-06-13 06:05:07
问题 Assuming that I have the following data +--------------------+-----+--------------------+ | values|count| values2| +--------------------+-----+--------------------+ | aaaaaa| 249| null| | bbbbbb| 166| b2| | cccccc| 1680| something| +--------------------+-----+--------------------+ So if there is a null value in values2 column how to assign the values1 column to it? So the result should be: +--------------------+-----+--------------------+ | values|count| values2| +--------------------+-----+-

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

℡╲_俬逩灬. 提交于 2020-06-13 05:36:48
问题 I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this. I want to build dag for this Create/clone a cluster on AWS EMR Install python requirements Install pyspark related libararies Get latest code from github Submit spark job Terminate cluster on finish for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow 1)

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

主宰稳场 提交于 2020-06-13 05:36:30
问题 I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this. I want to build dag for this Create/clone a cluster on AWS EMR Install python requirements Install pyspark related libararies Get latest code from github Submit spark job Terminate cluster on finish for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow 1)

PySpark 2.4.5: IllegalArgumentException when using PandasUDF

送分小仙女□ 提交于 2020-06-12 08:01:13
问题 I am trying Pandas UDF and facing the IllegalArgumentException. I also tried replicating examples from PySpark Documentation GroupedData to check but still getting the error. Following is the environment configuration python3.7 Installed PySpark==2.4.5 using pip Installed PyArrow==0.16.0 using pip from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('int', PandasUDFType.GROUPED_AGG) def min_udf(v): return v.min() sorted(gdf.agg(min_udf(df.age)).collect()) Output