pyspark | 易学教程

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

阅读更多关于 running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

问题 this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark). EDIT please see my update edits at the bottom of this post I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info: #Table is a List of Rows from small Hive table I loaded using #query = "SELECT *

Set SPARK-HOME path variable in windows and pycharm

阅读更多关于 Set SPARK-HOME path variable in windows and pycharm

问题 I am new to SPARK and trying to use it in windows. I was able to successfully download and install Spark 1.4.1 using pre-build version with hadoop. In the following directory: /my/spark/directory/bin I can run the spark-shell and pyspark.cmd and everything works fine. The only problem I am dealing with is that I want to import pyspark while I am coding in Pycharm. Right now I am using the following code to make things work: import sys import os from operator import add os.environ['SPARK_HOME'

Using LSH in spark to run nearest neighbors query on every point in dataframe

阅读更多关于 Using LSH in spark to run nearest neighbors query on every point in dataframe

问题 I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark. code for creating the model brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n) model = brp.fit(data_df) df_lsh = model.transform(data_df) Now, How do I run approx nearest neighbor query for each point in data_df. I have tried broadcasting the model but got pickle error. Also, defining a udf to access the model gives

Using keras model in pyspark lambda map function

阅读更多关于 Using keras model in pyspark lambda map function

问题 I want to use the model to predict scores in map lambda function in PySpark. def inference(user_embed, item_embed): feats = user_embed + item_embed dnn_model = load_model("best_model.h5") infer = dnn_model.predict(np.array([feats]), verbose=0, steps=1) return infer iu_score = iu.map(lambda x: Row(userid=x.userid, entryid=x.entryid, score = inference(x.user_embed, x.item_embed))) The running is extremely slow and it stuck at the final stage quickly after code start running. [Stage 119:========

Spark structured streaming with python

阅读更多关于 Spark structured streaming with python

问题 I am trying to Spark structured streaming with Kafka and Python. Requirement: I need to process streaming data from Kafka (in JSON format) in Spark (perform transformations) and then store it in a database. I have data in JSON format like, {"a": 120.56, "b": 143.6865998138807, "name": "niks", "time": "2012-12-01 00:00:09"} I am planning to use spark.readStream for reading from Kafka like, data = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option(

Pyspark - How to inspect variables within RDD operations

阅读更多关于 Pyspark - How to inspect variables within RDD operations

问题 I used to develop in Scala Spark using IntelliJ. I was able to inspect variable contents under debug mode by setting break point. Like this I recently start a new project using pyspark with pycharm. I found code does not stop at break point in Spark operations, like below. And another question is the prompt hint does not give right hint for instance from "map" function. Seems IDE does not know the variable from "map" function is still RDD , my guess is it related to python function does not

How to pass deep learning model data to map function in Spark

阅读更多关于 How to pass deep learning model data to map function in Spark

问题 I have a very simple use-case where I am reading large number of images as rdd from s3 using sc.binaryFiles method. Once this RDD is created I am passing the content inside the rdd to the vgg16 feature extractor function. So, in this I will need the model data using which the feature extraction will be done, so I am putting the model data into broadcast variable and then accesing the value in each map function. Below is the code:- s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH) s3_files

How to resolve pickle error in pyspark?

阅读更多关于 How to resolve pickle error in pyspark?

问题 I am iterating through files to gather information about the values in their columns and rows in a dictionary. I have the following code which works locally: def search_nulls(file_name): separator = ',' nulls_dict = {} fp = open(file_name,'r') null_cols = {} lines = fp.readlines() for n,line in enumerate(lines): line = line.split(separator) for m,data in enumerate(line): data = data.strip('\n').strip('\r') if str(m) not in null_cols: null_cols[str(m)] = defaultdict(lambda: 0) if len(data) <=

LDA model prediction nonconsistance

阅读更多关于 LDA model prediction nonconsistance

问题 I trained a LDA model and load it into the environment to transform the new data: from pyspark.ml.clustering import LocalLDAModel lda = LocalLDAModel.load(path) df = lda.transform(text) The model will add a new column called topicDistribution . In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice. May I ask the reason why and how to fix it? 回答1: LDA uses randomness when training and, depending on the

Improve parallelism in spark sql

阅读更多关于 Improve parallelism in spark sql

问题 I have the below code. I am using pyspark 1.2.1 with python 2.7 (cpython) for colname in shuffle_columns: colrdd = hive_context.sql('select %s from %s' % (colname, temp_table)) # zip_with_random_index is expensive colwidx = zip_with_random_index(colrdd).map(merge_index_on_row) (hive_context.applySchema(colwidx, a_schema) .registerTempTable(a_name)) The thing about this code is that it only operates on one column at a time. I have enough nodes in my cluster that I could be operating on many