pyspark

How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

眉间皱痕 提交于 2019-12-25 04:23:20
问题 For example on docs.datastax.com we mention : table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load() and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries. Thanks! 回答1: While you can't load data faster. You can load portions of the data or terminate early.

parsing a JSON string Pyspark dataframe column that has string of array in one of the columns

安稳与你 提交于 2019-12-25 03:19:08
问题 I am trying to read a JSON file and parse 'jsonString' and the underlying fields which includes array into a pyspark dataframe. Here is the contents of json file. [{"jsonString": "{\"uid\":\"value1\",\"adUsername\":\"value3\",\"courseCertifications\":[{\"uid\":\"value2\",\"courseType\":\"TRAINING\"},{\"uid\":\"TEST\",\"courseType\":\"TRAINING\"}],\"modifiedBy\":\"value4\"}","transactionId": "value5", "tableName": "X"}, {"jsonString": "{\"uid\":\"value11\",\"adUsername\":\"value13\",\

Batch write from to Kafka does not observe checkpoints and writes duplicates

纵饮孤独 提交于 2019-12-25 03:12:47
问题 Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink. So should checkpoints work in batch-writing mode at all? Or I am missing something? Config: EH_SASL = 'kafkashaded.org.apache.kafka

Iterate Spark dataframe specific column

淺唱寂寞╮ 提交于 2019-12-25 02:28:37
问题 I want to encrypt a few columns of a Spark dataframe based on some condition. The below encrypt and decrypt function is working fine: def EncryptDecrypt(Encrypt, str): key = b'B5oRyf5Zs3P7atXIf-I5TaCeF3aM1NEILv3A7Zm93b4=' cipher_suite = Fernet(key) if Encrypt is True: a = bytes(str, "utf-8") return cipher_suite.encrypt(bytes(a)) else: return cipher_suite.decrypt(str) Now, I want to iterate over specific dataframe column to encrypt it. If the encryption condition is satisfied, I have to

pyspark OneHotEncoded vectors appear to be missing categories?

倾然丶 夕夏残阳落幕 提交于 2019-12-25 02:21:07
问题 Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?). After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem Have dataset of the form 1. Wife

RDD Collect Issue

▼魔方 西西 提交于 2019-12-25 02:19:13
问题 I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working as expected. But, RDD collect is failing - distFile = spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv") distFile.collect() Error: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. Traceback: Traceback (most recent call last): File "/Users/aakash

How to find membership of vertices using Graphframes or igraph or networx in pyspark

放肆的年华 提交于 2019-12-25 01:49:01
问题 my input dataframe is df valx valy 1: 600060 09283744 2: 600131 96733110 3: 600194 01700001 and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership . I have tried Graphframes in pyspark and networx library too, but not getting desired results My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2) V1 V2 600060 1 96733110 1

PySpark: Filling missing values in multiple columns of one data frame with values of another data frame

末鹿安然 提交于 2019-12-25 01:46:17
问题 I have one data frame (D1) as follows: col1 | col2 | col3 | col4 22 | null | 23 | 56 12 | 54 | 22 | 36 48 | null | null | 45 null | 32 | 13 | 6 23 | null | 43 | 8 67 | 54 | 56 | null null | 32 | 32 | 6 3 | 54 | 64 | 8 67 | 4 | 23 | null The other data frame (D2): col_name | value col 1 | 15 col 2 | 26 col 3 | 38 col 4 | 41 I want to replace the null values in each column of D1 with the values from D2 corresponding to each columns. So the expected output would be: col1 | col2 | col3 | col4 22

What does rdd mean in pyspark dataframe

血红的双手。 提交于 2019-12-25 01:24:36
问题 I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe. weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True) These two line of the code has the same output. I am wondering what the effect of having rdd weatherData.collect() weatherData.rdd.collect() 回答1: A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional

How do I consume Kafka topic inside spark streaming app?

ぐ巨炮叔叔 提交于 2019-12-25 01:14:04
问题 When I create a stream from Kafka topic and print its content import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell' from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWords") ssc = StreamingContext(sc, 10) lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost