sklearn在中小型数据集上,在工业界是在使用的
xgboost、lightgbm在工业界的实用度非常的高
工业界数据体量真的达到上亿或者十亿这样的规模用sklern处理起来是比较吃力的,
可借助于大数据的工具,比如spark来解决
现在可以用spark来做大数据上的数据处理,比如数据工程、监督学习、无监督学习模型的构建,只要计算资源够就OK。【大数据底层做分布式处理】
注意:spark基于RDD形态、DataFrame形态两种形态的工具库,其中基于RDD形态的工具库目前已经暂停维护,所以建议使用DataFrame形态
- 对连续值处理
binaizer/二值化、按照给定边界离散化、 quantile_discretizer/按分位数、最大最小值幅度缩放、标准化、添加多项式特征 - 对离散型处理
独热向量编码 - 对文本型处理
去停用词、Tokenizer、count_vectorizer、TF-IDF权重、n-gram语言模型 - 高级变化
sql变换、R公式变换
对连续值处理
有的变换器需要fit在transfrom,有的不需要
直接transfrom通常不需要去扫描数据的,比如二值化,只需要设置阈值即可
1.1、binarizer/二值化
#连续值处理
##二值化
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
spark = SparkSession\
.builder\
.appName("BinarizerExample")\
.getOrCreate()
#用spark创建DataFrame
continuousDataFrame = spark.createDataFrame([
(0,1.1),
(1,8.5),
(2,5.2)
],['id','feature'])
#切分器threshold以5.1为划分点
binarizer = Binarizer(threshold=5.1,inputCol="feature",outputCol="binarized_feature")
#transform进行二值化
binarizedDataFrame = binarizer.transform(continuousDataFrame)
print("Binarizer output with Threshold = %f" % binarizer.getThreshold())
binarizedDataFrame.show()
spark.stop()
1.2、 按照给定边界离散化
#按照给点给的边界离散化
#比如用户的年龄可划分为几段,一些年龄便是边界
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
spark = SparkSession\
.builder\
.appName("BucketizerExample")\
.getOrCreate()
#分桶的边界
splits = [-float('inf'),-0.5,0.0,0.5,float('inf')]
data = [(-999.9,),(-0.5,),(-0.3,),(0.0,),(0.2,),(999.9,)]#给定分桶的边界
dataFrame = spark.createDataFrame(data,['feature'])
#初始化分桶器
bucketizer = Bucketizer(splits=splits,inputCol="feature",outputCol="bucketedFeature")
#按照规定对数据进行分桶
bucketedData = bucketizer.transform(dataFrame)
print("Binarizer output with Threshold = %f" % (len(bucketizer.getSplits())-1))
bucketedData.show()
spark.stop()
1.3 quantile_discretizer/按分位数离散化
from __future__ import print_function
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName('QuantileDiscretizerExample')\
.getOrCreate()
data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 9.2), (6, 14.4)]
df = spark.createDataFrame(data,['id','hour'])
df = df.repartition(1)
#分三个桶进行离散化,根据给定桶的数量来确定边界
discretizer = QuantileDiscretizer(numBuckets=3,inputCol='hour',outputCol='result')
#因为此次是进行等平【每个桶中的数量一样】的方式去切分,
# 所以要先去fit,扫描数据,再去做transfrom.
#扫描完数据才知道我的数据该如何分三个桶,边界如何
result = discretizer.fit(df).transform(df)
result.show()
spark.stop()
1.4 最大绝对值幅度缩放
from __future__ import print_function
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName('MaxAbsScalerExample')\
.getOrCreate()
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
],['id','feature'])
#MaxAbsScaler不需要给参数,需要先fit拟合,才知道最大绝对值
scaler = MaxAbsScaler(inputCol='feature',outputCol='scaledFeatures')
#计算最大绝对值用于缩放
scalerModel = scaler.fit(dataFrame)
#缩放幅度到【-1.1】之间
scalerData = scalerModel.transform(dataFrame)
scalerData.select("feature","scaledFeatures").show()
spark.shop()
1.5 标准化
第一个为稀疏形态的列子,第二个为稠密形态的例子
#读取libsvm类型数据
from __future__ import print_function
from pyspark.ml.feature import StandardScaler
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName('StandardScalerExample')\
.getOrCreate()
dataFrame = spark.read.format
#计算均值方差等参数
dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
# 计算均值方差等参数
scalerModel = scaler.fit(dataFrame)
# 标准化
scaledData = scalerModel.transform(dataFrame)
scaledData.show()
spark.stop()
from __future__ import print_function
from pyspark.ml.feature import StandardScaler
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("StandardScalerExample")\
.getOrCreate()
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"])
# 计算均值方差等参数
scalerModel = scaler.fit(dataFrame)
# 标准化
scaledData = scalerModel.transform(dataFrame)
scaledData.show()
spark.stop()
1.6 添加多项式特征
Vectors.dense
多项式的添加规则是啥PolynomialExpansion(degree=3,inputCol=“feature”,outputCol=“polyFeature”)
#添加多项式特征
from __future__ import print_function
from pyspark.ml.feature import PolynomialExpansion
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PolynomialExpansionExample")\
.getOrCreate()
df = spark.createDataFrame([
(Vectors.dense([2.0, 1.0]),),
(Vectors.dense([0.0, 0.0]),),
(Vectors.dense([3.0, -1.0]),)
],["feature"])
#degree指定最高几次
polyExpansion = PolynomialExpansion(degree=3,inputCol="feature",outputCol="polyFeature")
polyDF = polyExpansion.transform(df)
polyDF.show(truncate = False)
spark.stop()
对离散型处理
2.1 独热向量编码
步骤一:扫描知道数据有多少取值,比如多少种颜色,多少种尺码
通过StringIndexer进行fit–>transform拿到对应的indexd(编码)
步骤二:通过onehot【结果中多表达的内容不太理解】
#独热向量编码
from __future__ import print_function
from pyspark.ml.feature import OneHotEncoder,StringIndexer
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("OneHotEncoderExample")\
.getOrCreate()
df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
],["id",'category'])
stringIndexer =StringIndexer(inputCol="category",outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="categoryIndex",outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()
spark.stop()
对文本型处理
3.1 去停用词
StopWordsRemover
from __future__ import print_function
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("StopWordsRemoverExample")\
.getOrCreate()
sentenceData = spark.createDataFrame([
(0, ["I", "saw", "the", "red", "balloon"]),
(1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)
spark.stop()
3.2 Tokenizer
分词,Tokenizer,RegexTokenizer,RegexTokenizer
from __future__ import print_function
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("TokenizerExample")\
.getOrCreate()
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
spark.stop()
3.3 count_vectorizer
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import CountVectorizer
spark = SparkSession\
.builder\
.appName("CountVectorizerExample")\
.getOrCreate()
df = spark.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"])
# vocabSize=3,只保留最高频的三个词,minDF=2.0这个高频词至少出现两次
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)
spark.stop()
3.4 TF-IDF权重
根据在全文中出现的次数以及本文出现的次数来
from __future__ import print_function
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("TfIdfExample")\
.getOrCreate()
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
spark.stop()
3.5 n-gram语言模型
from __future__ import print_function
from pyspark.ml.feature import NGram
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("NGramExample")\
.getOrCreate()
#Hanmeimei loves LiLei
#LiLei loves Hanmeimei
wordDataFrame = spark.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"])
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)
spark.stop()
高级变换
4.1 SQL变换
特征工程
from __future__ import print_function
from pyspark.ml.feature import SQLTransformer
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("SQLTransformerExample")\
.getOrCreate()
df = spark.createDataFrame([
(0, 1.0, 3.0),
(2, 2.0, 5.0)
], ["id", "v1", "v2"])
# FROM __THIS_表示在当前的dataframe中
sqlTrans = SQLTransformer(
statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show()
spark.stop()
4.2 R公式变换
from __future__ import print_function
from pyspark.ml.feature import RFormula
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("RFormulaExample")\
.getOrCreate()
dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)],
["id", "country", "hour", "clicked"])
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label")
output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
spark.stop()
来源:CSDN
作者:小菜鸡一号
链接:https://blog.csdn.net/qq_38319401/article/details/104361188