pyspark | 易学教程

How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

阅读更多关于 How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

问题 For example on docs.datastax.com we mention : table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load() and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries. Thanks! 回答1: While you can't load data faster. You can load portions of the data or terminate early.

parsing a JSON string Pyspark dataframe column that has string of array in one of the columns

阅读更多关于 parsing a JSON string Pyspark dataframe column that has string of array in one of the columns

问题 I am trying to read a JSON file and parse 'jsonString' and the underlying fields which includes array into a pyspark dataframe. Here is the contents of json file. [{"jsonString": "{\"uid\":\"value1\",\"adUsername\":\"value3\",\"courseCertifications\":[{\"uid\":\"value2\",\"courseType\":\"TRAINING\"},{\"uid\":\"TEST\",\"courseType\":\"TRAINING\"}],\"modifiedBy\":\"value4\"}","transactionId": "value5", "tableName": "X"}, {"jsonString": "{\"uid\":\"value11\",\"adUsername\":\"value13\",\

Batch write from to Kafka does not observe checkpoints and writes duplicates

阅读更多关于 Batch write from to Kafka does not observe checkpoints and writes duplicates

问题 Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink. So should checkpoints work in batch-writing mode at all? Or I am missing something? Config: EH_SASL = 'kafkashaded.org.apache.kafka

Iterate Spark dataframe specific column

阅读更多关于 Iterate Spark dataframe specific column

问题 I want to encrypt a few columns of a Spark dataframe based on some condition. The below encrypt and decrypt function is working fine: def EncryptDecrypt(Encrypt, str): key = b'B5oRyf5Zs3P7atXIf-I5TaCeF3aM1NEILv3A7Zm93b4=' cipher_suite = Fernet(key) if Encrypt is True: a = bytes(str, "utf-8") return cipher_suite.encrypt(bytes(a)) else: return cipher_suite.decrypt(str) Now, I want to iterate over specific dataframe column to encrypt it. If the encryption condition is satisfied, I have to

pyspark OneHotEncoded vectors appear to be missing categories?

阅读更多关于 pyspark OneHotEncoded vectors appear to be missing categories?

问题 Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?). After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem Have dataset of the form 1. Wife

RDD Collect Issue

阅读更多关于 RDD Collect Issue

问题 I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working as expected. But, RDD collect is failing - distFile = spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv") distFile.collect() Error: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. Traceback: Traceback (most recent call last): File "/Users/aakash

How to find membership of vertices using Graphframes or igraph or networx in pyspark

阅读更多关于 How to find membership of vertices using Graphframes or igraph or networx in pyspark

问题 my input dataframe is df valx valy 1: 600060 09283744 2: 600131 96733110 3: 600194 01700001 and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership . I have tried Graphframes in pyspark and networx library too, but not getting desired results My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2) V1 V2 600060 1 96733110 1

PySpark: Filling missing values in multiple columns of one data frame with values of another data frame

阅读更多关于 PySpark: Filling missing values in multiple columns of one data frame with values of another data frame

问题 I have one data frame (D1) as follows: col1 | col2 | col3 | col4 22 | null | 23 | 56 12 | 54 | 22 | 36 48 | null | null | 45 null | 32 | 13 | 6 23 | null | 43 | 8 67 | 54 | 56 | null null | 32 | 32 | 6 3 | 54 | 64 | 8 67 | 4 | 23 | null The other data frame (D2): col_name | value col 1 | 15 col 2 | 26 col 3 | 38 col 4 | 41 I want to replace the null values in each column of D1 with the values from D2 corresponding to each columns. So the expected output would be: col1 | col2 | col3 | col4 22

What does rdd mean in pyspark dataframe

阅读更多关于 What does rdd mean in pyspark dataframe

问题 I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe. weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True) These two line of the code has the same output. I am wondering what the effect of having rdd weatherData.collect() weatherData.rdd.collect() 回答1: A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional

How do I consume Kafka topic inside spark streaming app?

阅读更多关于 How do I consume Kafka topic inside spark streaming app?

问题 When I create a stream from Kafka topic and print its content import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell' from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWords") ssc = StreamingContext(sc, 10) lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost