pyspark | 易学教程

pyspark Hive Context — read table with UTF-8 encoding

阅读更多关于 pyspark Hive Context — read table with UTF-8 encoding

问题 I have a table in hive, And I am reading that table in pyspark df_sprk_df from pyspark import SparkContext from pysaprk.sql import HiveContext sc = SparkContext() hive_context = HiveContext(sc) df_sprk_df = hive_context.sql('select * from databasename.tablename') df_pandas_df = df_sprk_df.toPandas() df_pandas_df = df_pandas_df.astype('str') but when I try to convert df_pandas_df to astype of str. but I get error like UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in

PySpark sql compare records on each day and report the differences

阅读更多关于 PySpark sql compare records on each day and report the differences

问题 so the problem I have is I have this dataset: and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as: I managed to tide up all the records using this sql: select [Date] ,Mnemonic ,securityDesc ,sum(cast(TradedVolume as money)) as TradedVolumSum FROM SomeTable group by [Date],Mnemonic,securityDesc but I dont know how to compare each days record with the other day and

Read XML in spark

阅读更多关于 Read XML in spark

问题 i am trying to read xml/nested xml in pysaprk uing spark-xml jar. df = sqlContext.read \ .format("com.databricks.spark.xml")\ .option("rowTag", "hierachy")\ .load("test.xml" when i execute, dataframe is not creating properly. +--------------------+ | att| +--------------------+ |[[1,Data,[Wrapped...| +--------------------+ xml format i have is mentioned below : 回答1: heirarchy should be rootTag and att should be rowTag as df = spark.read \ .format("com.databricks.spark.xml") \ .option("rootTag

pyspark flatmat error: TypeError: 'int' object is not iterable

阅读更多关于 pyspark flatmat error: TypeError: 'int' object is not iterable

问题 This is the sample example code in my book: from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("spark://chetan-ThinkPad- E470:7077").setAppName("FlatMap") sc = SparkContext(conf=conf) numbersRDD = sc.parallelize([1, 2, 3, 4]) actionRDD = numbersRDD.flatMap(lambda x: x + x).collect() for values in actionRDD: print(values) I am getting this error: TypeError: 'int' object is not iterable at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org

Spark not executing tasks

阅读更多关于 Spark not executing tasks

问题 I cant get the pyspark to work. I added the necessary paths to the system variable SPARK_HOME . I extracted data from my mongodb database and simply converted the obtained list to dataframe. Then, I want to see the dataframe through show() (the last line of code) which gives the following error. My hadoop version is 2.7, pyspark and local spark both are 2.4.1, python 3.6. Java version is 8. import os import sys spark_path = r"C:\Tools\spark-2.4.0-bin-hadoop2.7" # spark installed folder os

Remove all rows that are duplicates with respect to some rows

阅读更多关于 Remove all rows that are duplicates with respect to some rows

问题 I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame: +------+-----+----+ | id|value|type| +------+-----+----+ |283924| 1.5| 0| |283924| 1.5| 1| |982384| 3.0| 0| |982384| 3.0| 1| |892383| 2.0| 0| |892383| 2.5| 1| +------+-----+----+ I want to identify duplicates by just the "id" and "value" columns, and then remove all instances. In this case: Rows 1 and 2 are duplicates (again we are ignoring the "type" column) Rows 3 and 4 are

Custom Evaluator during cross validation SPARK

阅读更多关于 Custom Evaluator during cross validation SPARK

问题 My aim is to add a rank based evaluator to the CrossValidator function (PySpark) cvExplicit = CrossValidator(estimator=cvSet, numFolds=8, estimatorParamMaps=paramMap,evaluator=rnkEvaluate) Although I need to pass the evaluated dataframe into the function, and I do not know how to do that part. class rnkEvaluate(): def __init__(self, user_col = "user", rating_col ="rating", prediction_col = "prediction"): print(user_col) print(rating_col) print(prediction_col) def isLargerBetter(): return True

Spark: Variant Datatype is not supported

阅读更多关于 Spark: Variant Datatype is not supported

问题 While extracting the data from SQL Server of variant data type in Pyspark. i am getting a SQLServerException : "Variant datatype is not supported" Please advice for any workaround. 回答1: Converted the column datatype into varchar while fetching and thing worked SELECT CONVERT(varchar,Code,20) into Code from DBTable 来源： https://stackoverflow.com/questions/40786605/spark-variant-datatype-is-not-supported

How to serialize PySpark GroupedData object?

阅读更多关于 How to serialize PySpark GroupedData object?

问题 I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed). df.groupBy("geo_city") <pyspark.sql.group.GroupedData at 0x10503c5d0> I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive).

pyspark write to wasb blob storage container

阅读更多关于 pyspark write to wasb blob storage container

问题 I am running a Ubuntu instance to run a calculation of azure using a N-series instance. After the calculation I try to write to a Azure blob container using the wasb like URL wasb://containername/path I am trying to use the pyspark command sparkSession.write.save('wasb://containername/path', format='json', mode='append') But I receive a Java io exception from spark saying it doesn't support a wasb file system. I was wondering if anyone knows how to write to a wasb address while not using a