pyspark

PySpark - Create DataFrame from Numpy Matrix

你说的曾经没有我的故事 提交于 2019-12-22 08:34:50
问题 I have a numpy matrix: arr = np.array([[2,3], [2,8], [2,3],[4,5]]) I need to create a PySpark Dataframe from arr . I can not manually input the values because the length/values of arr will be changing dynamically so I need to convert arr into a dataframe. I tried the following code to no success. df= sqlContext.createDataFrame(arr,["A", "B"]) However, I get the following error. TypeError: Can not infer schema for type: <type 'numpy.ndarray'> 回答1: Hope this helps! import numpy as np #sample

Upper triangle of cartesian in spark for symmetric operations: `x*(x+1)//2` instead of `x**2`

与世无争的帅哥 提交于 2019-12-22 08:34:38
问题 I need to compute pairwise symmetric scores for items of a list in Spark. I.e. score(x[i],x[j]) = score(x[j], x[i]) . One solution is to use x.cartesian(x) . However it will perform x**2 operations instead of minimal necessary x*(x+1)//2 . What is the most efficient remeady for this issue in Spark? PS. In pure Python I would use iterator like: class uptrsq_range(object): def __init__(self, n): self._n_ = n self._length = n*(n+1) // 2 def __iter__(self): for ii in range(self._n_): for jj in

Memory efficient cartesian join in PySpark

大憨熊 提交于 2019-12-22 08:24:02
问题 I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node. These ids are about 30 characters long. For example: ids O2LWk4MAbcrOCWo3IVM0GInelSXfcG HbDckDXCye20kwu0gfeGpLGWnJ2yif o43xSMBUJLOKDxkYEQbAEWk4aPQHkm I am looking to write to file a list of all of the pairs of ids. For example: id1,id2 O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif O2LWk4MAbcrOCWo3IVM0GInelSXfcG

Convert Sparse Vector to Dense Vector in Pyspark

我的梦境 提交于 2019-12-22 08:10:19
问题 I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>>

Spark with Cython

微笑、不失礼 提交于 2019-12-22 07:54:24
问题 I recently wanted to use Cython with Spark, for which I followed the following reference. I wrote the following programs as mentioned but I am getting a: TypeError: fib_mapper_cython() takes exactly 1 argument (0 given) spark-tools.py def spark_cython(module, method): def wrapped(*args, **kwargs): global cython_function_ try: return cython_function_(*args, **kwargs) except: import pyximport pyximport.install() cython_function_ = getattr(__import__(module), method) return cython_function_(

Spark with Cython

好久不见. 提交于 2019-12-22 07:51:06
问题 I recently wanted to use Cython with Spark, for which I followed the following reference. I wrote the following programs as mentioned but I am getting a: TypeError: fib_mapper_cython() takes exactly 1 argument (0 given) spark-tools.py def spark_cython(module, method): def wrapped(*args, **kwargs): global cython_function_ try: return cython_function_(*args, **kwargs) except: import pyximport pyximport.install() cython_function_ = getattr(__import__(module), method) return cython_function_(

How to enable Tungsten optimization in Spark 2?

孤街醉人 提交于 2019-12-22 07:03:18
问题 I just built Spark 2 with hive support and deploy it to a cluster with Hortonworks 2.3.4. However I find that this Spark 2.0.3 is slower than the standard spark 1.5.3 that comes with HDP 2.3 When I check explain it seems that my Spark 2.0.3 is not using tungsten. Do I need to create special build to enable Tungsten? Spark 1.5.3 Explain == Physical Plan == TungstenAggregate(key=[id#2], functions=[], output=[id#2]) TungstenExchange hashpartitioning(id#2) TungstenAggregate(key=[id#2], functions=

Partitions not being pruned in simple SparkSQL queries

爱⌒轻易说出口 提交于 2019-12-22 05:33:28
问题 I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions. Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore: # Make some data df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 'o',

How to create a z-score in Spark SQL for each group

混江龙づ霸主 提交于 2019-12-22 05:30:16
问题 I have a dataframe which looks like this dSc TranAmount 1: 100021 79.64 2: 100021 79.64 3: 100021 0.16 4: 100022 11.65 5: 100022 0.36 6: 100022 0.47 7: 100025 0.17 8: 100037 0.27 9: 100056 0.27 10: 100063 0.13 11: 100079 0.13 12: 100091 0.15 13: 100101 0.22 14: 100108 0.14 15: 100109 0.04 Now I want to create a third column with the z-score of each TranAmount which will be (TranAmount-mean(TranAmount))/StdDev(TranAmount) here mean and standard deviation will be based on groups of each dSc Now

IllegalArgumentException thrown when count and collect function in spark

徘徊边缘 提交于 2019-12-22 05:11:44
问题 I tried to load a small dataset on local Spark when this exception is thrown when I used count() in PySpark ( take() seems working). I tried to search about this issue but got no luck in figuring out why. It seems something is wrong with the partition of RDD. Any ideas? Thank you in advance! sc.stop() sc = SparkContext("local[4]", "temp") testfile1 = sc.textFile(localpath('part-00000-Copy1.xml')) testfile1.filter(lambda x: x.strip().encode('utf-8').startswith(b'<row')).take(1) ## take