pyspark

pyspark Hive Context — read table with UTF-8 encoding

☆樱花仙子☆ 提交于 2019-12-24 11:34:58
问题 I have a table in hive, And I am reading that table in pyspark df_sprk_df from pyspark import SparkContext from pysaprk.sql import HiveContext sc = SparkContext() hive_context = HiveContext(sc) df_sprk_df = hive_context.sql('select * from databasename.tablename') df_pandas_df = df_sprk_df.toPandas() df_pandas_df = df_pandas_df.astype('str') but when I try to convert df_pandas_df to astype of str. but I get error like UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in

PySpark sql compare records on each day and report the differences

怎甘沉沦 提交于 2019-12-24 11:13:06
问题 so the problem I have is I have this dataset: and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as: I managed to tide up all the records using this sql: select [Date] ,Mnemonic ,securityDesc ,sum(cast(TradedVolume as money)) as TradedVolumSum FROM SomeTable group by [Date],Mnemonic,securityDesc but I dont know how to compare each days record with the other day and

Read XML in spark

痴心易碎 提交于 2019-12-24 10:45:53
问题 i am trying to read xml/nested xml in pysaprk uing spark-xml jar. df = sqlContext.read \ .format("com.databricks.spark.xml")\ .option("rowTag", "hierachy")\ .load("test.xml" when i execute, dataframe is not creating properly. +--------------------+ | att| +--------------------+ |[[1,Data,[Wrapped...| +--------------------+ xml format i have is mentioned below : 回答1: heirarchy should be rootTag and att should be rowTag as df = spark.read \ .format("com.databricks.spark.xml") \ .option("rootTag

pyspark flatmat error: TypeError: 'int' object is not iterable

空扰寡人 提交于 2019-12-24 10:39:23
问题 This is the sample example code in my book: from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("spark://chetan-ThinkPad- E470:7077").setAppName("FlatMap") sc = SparkContext(conf=conf) numbersRDD = sc.parallelize([1, 2, 3, 4]) actionRDD = numbersRDD.flatMap(lambda x: x + x).collect() for values in actionRDD: print(values) I am getting this error: TypeError: 'int' object is not iterable at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org

Spark not executing tasks

坚强是说给别人听的谎言 提交于 2019-12-24 10:23:22
问题 I cant get the pyspark to work. I added the necessary paths to the system variable SPARK_HOME . I extracted data from my mongodb database and simply converted the obtained list to dataframe. Then, I want to see the dataframe through show() (the last line of code) which gives the following error. My hadoop version is 2.7, pyspark and local spark both are 2.4.1, python 3.6. Java version is 8. import os import sys spark_path = r"C:\Tools\spark-2.4.0-bin-hadoop2.7" # spark installed folder os

Remove all rows that are duplicates with respect to some rows

浪尽此生 提交于 2019-12-24 09:58:12
问题 I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame: +------+-----+----+ | id|value|type| +------+-----+----+ |283924| 1.5| 0| |283924| 1.5| 1| |982384| 3.0| 0| |982384| 3.0| 1| |892383| 2.0| 0| |892383| 2.5| 1| +------+-----+----+ I want to identify duplicates by just the "id" and "value" columns, and then remove all instances. In this case: Rows 1 and 2 are duplicates (again we are ignoring the "type" column) Rows 3 and 4 are

Custom Evaluator during cross validation SPARK

蹲街弑〆低调 提交于 2019-12-24 09:47:40
问题 My aim is to add a rank based evaluator to the CrossValidator function (PySpark) cvExplicit = CrossValidator(estimator=cvSet, numFolds=8, estimatorParamMaps=paramMap,evaluator=rnkEvaluate) Although I need to pass the evaluated dataframe into the function, and I do not know how to do that part. class rnkEvaluate(): def __init__(self, user_col = "user", rating_col ="rating", prediction_col = "prediction"): print(user_col) print(rating_col) print(prediction_col) def isLargerBetter(): return True

Spark: Variant Datatype is not supported

我的梦境 提交于 2019-12-24 09:29:15
问题 While extracting the data from SQL Server of variant data type in Pyspark. i am getting a SQLServerException : "Variant datatype is not supported" Please advice for any workaround. 回答1: Converted the column datatype into varchar while fetching and thing worked SELECT CONVERT(varchar,Code,20) into Code from DBTable 来源: https://stackoverflow.com/questions/40786605/spark-variant-datatype-is-not-supported

How to serialize PySpark GroupedData object?

♀尐吖头ヾ 提交于 2019-12-24 09:24:13
问题 I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed). df.groupBy("geo_city") <pyspark.sql.group.GroupedData at 0x10503c5d0> I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive).

pyspark write to wasb blob storage container

有些话、适合烂在心里 提交于 2019-12-24 09:04:04
问题 I am running a Ubuntu instance to run a calculation of azure using a N-series instance. After the calculation I try to write to a Azure blob container using the wasb like URL wasb://containername/path I am trying to use the pyspark command sparkSession.write.save('wasb://containername/path', format='json', mode='append') But I receive a Java io exception from spark saying it doesn't support a wasb file system. I was wondering if anyone knows how to write to a wasb address while not using a