pyspark

Load Spark RDD to Neo4j in Python

这一生的挚爱 提交于 2020-01-13 07:05:49
问题 I am working on a project where I am using Spark for Data processing. My data is now processed and I need to load the data into Neo4j . After loading into Neo4j, I will be using that to showcase the results. I wanted all the implementation to de done in Python Programming. But I could't find any library or example on net. Can you please help with links or the libraries or any example. My RDD is a PairedRDD. And in every tuple, I have to create a relationship. PairedRDD Key Value Jack [a,b,c]

Combining csv files with mismatched columns

有些话、适合烂在心里 提交于 2020-01-13 06:30:08
问题 I need to combine multiple csv files into one object (a dataframe, I assume) but they all have mismatched columns, like so: CSV A store_location_key | product_key | collector_key | trans_dt | sales | units | trans_key CSV B collector_key | trans_dt | store_location_key | product_key | sales | units | trans_key CSV C collector_key | trans_dt | store_location_key |product_key | sales | units | trans_id On top of that, I need these to match with two additional csv files that have a matching

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

荒凉一梦 提交于 2020-01-13 06:18:11
问题 I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| 5| 8| d1| | 4| 5| 6| 7| 2| 8| 1| f1| +---+---+---+---+---+---+---+---+ I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1"

Shipping and using virtualenv in a pyspark job

我只是一个虾纸丫 提交于 2020-01-13 01:52:13
问题 PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error: ImportError: Importing the multiarray numpy extension module failed. Most likely you are trying to import a failed build of numpy. If you're working with a numpy git repo, try `git clean -xdf` (removes all files not under version control). Otherwise reinstall numpy. Original error was: cannot import name multiarray

sparkml_实战全流程_LogisticRegression(一)

╄→гoц情女王★ 提交于 2020-01-13 00:59:19
sparkml_实战全流程_LogisticRegression 2.1 加载数据 创建转换器、评估器 birth place使用one-hot编码 创建一个评估器 VectorAssembler接受以下输入列类型:所有数值类型、布尔类型和向量类型 2.3 创建一个管道、拟合模型 拟合模型 randomSplit划分测试集 训练集 2.4 评估模型 2.5 保存模型 保存管道 载入模型 # 2.1 加载数据 import pyspark . sql . types as typ ​ labels = [ ( 'INFANT_ALIVE_AT_REPORT' , typ . IntegerType ( ) ) , ( 'BIRTH_PLACE' , typ . StringType ( ) ) , ( 'MOTHER_AGE_YEARS' , typ . IntegerType ( ) ) , ( 'FATHER_COMBINED_AGE' , typ . IntegerType ( ) ) , ( 'CIG_BEFORE' , typ . IntegerType ( ) ) , ( 'CIG_1_TRI' , typ . IntegerType ( ) ) , ( 'CIG_2_TRI' , typ . IntegerType ( ) ) , ( 'CIG_3_TRI' , typ

What is the most efficient way to do a sorted reduce in PySpark?

浪子不回头ぞ 提交于 2020-01-12 07:35:28
问题 I am analyzing on-time performance records of US domestic flights from 2015. I need to group by tail number, and store a date sorted list of all the flights for each tail number in a database, to be retrieved by my application. I am not sure which of two options for achieving this is the best one. # Load the parquet file on_time_dataframe = sqlContext.read.parquet('../data/on_time_performance.parquet') # Filter down to the fields we need to identify and link to a flight flights = on_time

How to create dataframe from list in Spark SQL?

别来无恙 提交于 2020-01-12 06:41:40
问题 Spark version : 2.1 For example, in pyspark, i create a list test_list = [['Hello', 'world'], ['I', 'am', 'fine']] then how to create a dataframe form the test_list, where the dataframe's type is like below: DataFrame[words: array<string>] 回答1: here is how - from pyspark.sql.types import * cSchema = StructType([StructField("WordList", ArrayType(StringType()))]) # notice extra square brackets around each element of list test_list = [['Hello', 'world']], [['I', 'am', 'fine']] df = spark

Pyspark: filter dataframe by regex with string formatting?

孤街醉人 提交于 2020-01-12 01:47:09
问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Pyspark: filter dataframe by regex with string formatting?

馋奶兔 提交于 2020-01-12 01:47:04
问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Preserve index-string correspondence spark string indexer

China☆狼群 提交于 2020-01-12 01:44:06
问题 Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using this simple example from the Spark documentation: from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame( [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], ["id", "category"]) indexer = StringIndexer(inputCol=