pyspark

Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

不打扰是莪最后的温柔 提交于 2021-01-07 02:32:15
问题 I'm using Rest Data Source and I keep running into an issue with the output saying the following: hope_prms = { 'url' : search_url , 'input' : 'new_view' , 'method' : 'GET' , 'readTimeout' : '10000' , 'connectionTimeout' : '2000' , 'partitions' : '10'} sodasDf = spark.read.format('org.apache.dsext.spark.datasource.rest.RestDataSource').options(**hope_prms).load() An error occurred while calling o117.load. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.dsext.spark

Converting query from SQL to pyspark

霸气de小男生 提交于 2021-01-07 01:37:08
问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

Converting query from SQL to pyspark

泪湿孤枕 提交于 2021-01-07 01:32:56
问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

PySpark string syntax error on UDF that returns MapType(StringType(), StringType())

假如想象 提交于 2021-01-07 01:29:08
问题 I'm getting the following syntax error: pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.; When performing some aspect sentiment classification on the text column of a Spark dataframe df_text that looks more or less like the following: index id text 1995 ev0oyrq [sign up]( 2014 eugwxff No I am not. 2675 g9f914q It’s hard for her to move around and even sit down, hard for her to walk and squeeze her hands. She hunches now. 1310 echja0g Thank you! 2727 gc725t2

Spark: unusually slow data write to Cloud Storage

大憨熊 提交于 2021-01-07 01:24:25
问题 As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage. My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data: df = spark.createDataFrame(df.rdd, avro_schema_str) df \ .write \ .format("avro") \ .partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \ .save(f"gs://{output_path}") The write stage stats from the UI: My worker stats: Quite strangely for the adequate

Can't install python-snappy wheel in Pycharm

大兔子大兔子 提交于 2021-01-07 01:23:07
问题 I have a question here, and then I have followed this answer https://stackoverflow.com/a/43756412/12375559 to download the file and installed from my windows prompt, and it seems the python-snappy has been installed C:\Users\xxxx\IdeaProjects\xxxx\venv>pip install python_snappy-0.5.4-cp38-cp38-win32.whl Processing c:\users\xxxxxx\ideaprojects\xxxxxx\venv\python_snappy-0.5.4-cp38-cp38-win32.whl Installing collected packages: python-snappy Successfully installed python-snappy-0.5.4 WARNING: You

Can't install python-snappy wheel in Pycharm

谁说胖子不能爱 提交于 2021-01-07 01:20:01
问题 I have a question here, and then I have followed this answer https://stackoverflow.com/a/43756412/12375559 to download the file and installed from my windows prompt, and it seems the python-snappy has been installed C:\Users\xxxx\IdeaProjects\xxxx\venv>pip install python_snappy-0.5.4-cp38-cp38-win32.whl Processing c:\users\xxxxxx\ideaprojects\xxxxxx\venv\python_snappy-0.5.4-cp38-cp38-win32.whl Installing collected packages: python-snappy Successfully installed python-snappy-0.5.4 WARNING: You

Can't install python-snappy wheel in Pycharm

岁酱吖の 提交于 2021-01-07 01:16:46
问题 I have a question here, and then I have followed this answer https://stackoverflow.com/a/43756412/12375559 to download the file and installed from my windows prompt, and it seems the python-snappy has been installed C:\Users\xxxx\IdeaProjects\xxxx\venv>pip install python_snappy-0.5.4-cp38-cp38-win32.whl Processing c:\users\xxxxxx\ideaprojects\xxxxxx\venv\python_snappy-0.5.4-cp38-cp38-win32.whl Installing collected packages: python-snappy Successfully installed python-snappy-0.5.4 WARNING: You

How to correctly transform spark dataframe by mapInPandas

随声附和 提交于 2021-01-06 03:51:57
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{

How to correctly transform spark dataframe by mapInPandas

倖福魔咒の 提交于 2021-01-06 03:42:32
问题 I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output transformed_df should have 30k rows Current output: I'm getting 3 rows with 1 core and 24 rows with 8 cores. INPUT: respond_sdf has 10k rows +-----+-------------------------------------------------------------------+ |url |content | +-----+-------------------------------------------------------------------+ |api_1|{