pyspark

pyspark 相似文章推荐-Word2Vec+Tfidf+LSH

自古美人都是妖i 提交于 2020-03-06 01:36:55
本文目的 最近在研究LSH方法,主要发现用pyspark实现的较少,故结合黑马头条推荐系统实践的视频进行了本地实现。 本项目完整源码地址: https://github.com/angeliababy/text_LSH 项目博客地址: https://blog.csdn.net/qq_29153321/article/details/104680282 算法 本章主要介绍如何使用文章关键词获取文章相似性。主要用到了Word2Vec+Tfidf+LSH算法。 1.使用Word2Vec训练出文章的词向量。 2.Tfidf获取文章关键词及权重。 3.使用关键词权重乘以其词向量平均值作为训练集。 4.使用LSH求取两两文章相似性。 对于海量的数据,通过两两文章向量的欧式距离求取与当前文章最相似的文章,显然不太现实,故采取LSH进行相似性检索。 LSH即局部敏感哈希,主要用来解决海量数据的相似性检索。由spark的官方文档翻译为:LSH的一般思想是使用一系列函数将数据点哈希到桶中,使得彼此接近的数据点在相同的桶中具有高概率,而数据点是远离彼此很可能在不同的桶中。spark中LSH支持欧式距离与Jaccard距离。在此欧式距离使用较广泛。 实践 部分原始数据: news_data: 一、获取分词数据 主要处理一个频道下的数据,便于进行文章相似性计算 # 中文分词 def

Error trying to access AWS S3 using Pyspark

自古美人都是妖i 提交于 2020-03-05 03:38:09
问题 I am trying to access gzip files from AWS S3 using Spark. I have a very simple script below. I first started off with a IAM user with access permissions to the S3 bucket. Then I created an EC2 instance & installed Python & Spark. I setup the spark.properties file as below. I only copied the jar files, didn't bother to go through the entire Hadoop installation. Then I realized I have to create an IAM role for EC2 instances to access S3. So, I created an IAM role, attached an access policy and

Convert Pyspark dataframe to dictionary

 ̄綄美尐妖づ 提交于 2020-03-05 02:53:25
问题 I'm trying to convert a Pyspark dataframe into a dictionary. Here's the sample CSV file - Col0, Col1 ----------- A153534,BDBM40705 R440060,BDBM31728 P440245,BDBM50445050 I've come up with this code - from rdkit import Chem from pyspark import SparkContext from pyspark.conf import SparkConf from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read.csv("gs://my-bucket/my_file.csv") # has two columns # Creating list to_list = map(lambda row:

Convert Pyspark dataframe to dictionary

一个人想着一个人 提交于 2020-03-05 02:52:26
问题 I'm trying to convert a Pyspark dataframe into a dictionary. Here's the sample CSV file - Col0, Col1 ----------- A153534,BDBM40705 R440060,BDBM31728 P440245,BDBM50445050 I've come up with this code - from rdkit import Chem from pyspark import SparkContext from pyspark.conf import SparkConf from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read.csv("gs://my-bucket/my_file.csv") # has two columns # Creating list to_list = map(lambda row:

PySpark - Numpy Not Found in Cluster Mode - ModuleNotFoundError

旧街凉风 提交于 2020-03-05 02:08:06
问题 I'm running a job on a PySpark cluster for the first time. It runs perfectly in standalone mode on the name node. However, when it runs in the cluster: spark-submit --master yarn \ --deploy-mode client \ --driver-memory 6g \ --executor-memory 6g \ --executor-cores 2 \ --num-executors 10 \ nearest_neighbor.py It begins complaining that numpy isn't installed: from pyspark.ml.param.shared import * File "/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582692915671_0024

Check if array contain an array

泪湿孤枕 提交于 2020-03-05 00:24:47
问题 I have a dataset: number | matricule<array> | name<array> | model <array> ---------------------------------------------------------------- AA | [] | [7] | [7] ---------------------------------------------------------------- AA | [9] | [4] | [9] ---------------------------------------------------------------- AA | [8] | [2] | [8, 2] ---------------------------------------------------------------- AA | [2] | [3, 4] | [3,4] I would like to add a new column "Falg" that contain true or false

Compare two datasets in pyspark

ぐ巨炮叔叔 提交于 2020-03-04 15:34:23
问题 I have 2 datasets. Example Dataset 1: id | model | first_name | last_name ----------------------------------------------------------- 1234 | 32 | 456765 | [456700,987565] ----------------------------------------------------------- 4539 | 20 | 123211 | [893456,123456] ----------------------------------------------------------- Some times one of the columns first_name and last_name is empty. Example dataset 2: number | matricule | name | model ---------------------------------------------------

How to manage physical data placement of a dataframe across the cluster with pyspark?

佐手、 提交于 2020-03-04 05:07:57
问题 Say I have a pyspark dataframe 'data' as follows. I want to partition the data by "Period". Rather I want each period of data to be stored on it's own partition (see the example below the 'data' dataframe below). data = sc.parallelize([[1,1,0,14277.4,0], \ [1,2,0,14277.4,0], \ [2,1,0,4741.91,0], \ [2,2,0,4693.03,0], \ [3,1,2,9565.93,0], \ [3,2,2,9566.05,0], \ [4,2,0,462.68,0], \ [5,1,1,3549.66,0], \ [5,2,5,3549.66,1], \ [6,1,1,401.52,0], \ [6,2,0,401.52,0], \ [7,1,0,1886.24,0], \ [7,2,0,1886

AWS Glue automatic job creation

試著忘記壹切 提交于 2020-03-03 10:12:10
问题 I have pyspark script which I can run in AWS GLUE. But everytime I am creating job from UI and copying my code to the job .Is there anyway I can automatically create job from my file in s3 bucket. (I have all the library and glue context which will be used while running ) 回答1: Another alternative is to use AWS CloudFormation. You can define all AWS resources you want to create (not only Glue jobs) in a template file and then update stack whenever you need from AWS Console or using cli.

用sqarkSQL往MySQL写入数据

醉酒当歌 提交于 2020-03-02 17:54:07
先设置表头,再写内容,内容得通过Row再转换成dataframe,再把内容与表头连接,再插入到MySQL中 #!/usr/bin/env python3 from pyspark . sql import Row from pyspark . sql . types import * from pyspark import SparkContext , SparkConf from pyspark . sql import SparkSession spark = SparkSession . builder . config ( conf = SparkConf ( ) ) . getOrCreate ( ) schema = StructType ( [ StructField ( "id" , IntegerType ( ) , True ) , \#true代表可以为空 StructField ( "name" , StringType ( ) , True ) , \ StructField ( "gender" , StringType ( ) , True ) , \ StructField ( "age" , IntegerType , True ] ) studentRDD = spark . saprkContext . parallelize ( [ "3