pyspark | 易学教程

PySpark - Create DataFrame from Numpy Matrix

阅读更多关于 PySpark - Create DataFrame from Numpy Matrix

问题 I have a numpy matrix: arr = np.array([[2,3], [2,8], [2,3],[4,5]]) I need to create a PySpark Dataframe from arr . I can not manually input the values because the length/values of arr will be changing dynamically so I need to convert arr into a dataframe. I tried the following code to no success. df= sqlContext.createDataFrame(arr,["A", "B"]) However, I get the following error. TypeError: Can not infer schema for type: <type 'numpy.ndarray'> 回答1: Hope this helps! import numpy as np #sample

Upper triangle of cartesian in spark for symmetric operations: `x*(x+1)//2` instead of `x**2`

阅读更多关于 Upper triangle of cartesian in spark for symmetric operations: `x*(x+1)//2` instead of `x**2`

问题 I need to compute pairwise symmetric scores for items of a list in Spark. I.e. score(x[i],x[j]) = score(x[j], x[i]) . One solution is to use x.cartesian(x) . However it will perform x**2 operations instead of minimal necessary x*(x+1)//2 . What is the most efficient remeady for this issue in Spark? PS. In pure Python I would use iterator like: class uptrsq_range(object): def __init__(self, n): self._n_ = n self._length = n*(n+1) // 2 def __iter__(self): for ii in range(self._n_): for jj in

Memory efficient cartesian join in PySpark

阅读更多关于 Memory efficient cartesian join in PySpark

问题 I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node. These ids are about 30 characters long. For example: ids O2LWk4MAbcrOCWo3IVM0GInelSXfcG HbDckDXCye20kwu0gfeGpLGWnJ2yif o43xSMBUJLOKDxkYEQbAEWk4aPQHkm I am looking to write to file a list of all of the pairs of ids. For example: id1,id2 O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif O2LWk4MAbcrOCWo3IVM0GInelSXfcG

Convert Sparse Vector to Dense Vector in Pyspark

阅读更多关于 Convert Sparse Vector to Dense Vector in Pyspark

问题 I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>>

Spark with Cython

阅读更多关于 Spark with Cython

问题 I recently wanted to use Cython with Spark, for which I followed the following reference. I wrote the following programs as mentioned but I am getting a: TypeError: fib_mapper_cython() takes exactly 1 argument (0 given) spark-tools.py def spark_cython(module, method): def wrapped(*args, **kwargs): global cython_function_ try: return cython_function_(*args, **kwargs) except: import pyximport pyximport.install() cython_function_ = getattr(__import__(module), method) return cython_function_(

Spark with Cython

阅读更多关于 Spark with Cython

How to enable Tungsten optimization in Spark 2?

阅读更多关于 How to enable Tungsten optimization in Spark 2?

问题 I just built Spark 2 with hive support and deploy it to a cluster with Hortonworks 2.3.4. However I find that this Spark 2.0.3 is slower than the standard spark 1.5.3 that comes with HDP 2.3 When I check explain it seems that my Spark 2.0.3 is not using tungsten. Do I need to create special build to enable Tungsten? Spark 1.5.3 Explain == Physical Plan == TungstenAggregate(key=[id#2], functions=[], output=[id#2]) TungstenExchange hashpartitioning(id#2) TungstenAggregate(key=[id#2], functions=

Partitions not being pruned in simple SparkSQL queries

阅读更多关于 Partitions not being pruned in simple SparkSQL queries

问题 I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions. Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore: # Make some data df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 'o',

How to create a z-score in Spark SQL for each group

阅读更多关于 How to create a z-score in Spark SQL for each group

问题 I have a dataframe which looks like this dSc TranAmount 1: 100021 79.64 2: 100021 79.64 3: 100021 0.16 4: 100022 11.65 5: 100022 0.36 6: 100022 0.47 7: 100025 0.17 8: 100037 0.27 9: 100056 0.27 10: 100063 0.13 11: 100079 0.13 12: 100091 0.15 13: 100101 0.22 14: 100108 0.14 15: 100109 0.04 Now I want to create a third column with the z-score of each TranAmount which will be (TranAmount-mean(TranAmount))/StdDev(TranAmount) here mean and standard deviation will be based on groups of each dSc Now

IllegalArgumentException thrown when count and collect function in spark

阅读更多关于 IllegalArgumentException thrown when count and collect function in spark

问题 I tried to load a small dataset on local Spark when this exception is thrown when I used count() in PySpark ( take() seems working). I tried to search about this issue but got no luck in figuring out why. It seems something is wrong with the partition of RDD. Any ideas? Thank you in advance! sc.stop() sc = SparkContext("local[4]", "temp") testfile1 = sc.textFile(localpath('part-00000-Copy1.xml')) testfile1.filter(lambda x: x.strip().encode('utf-8').startswith(b'<row')).take(1) ## take