Storm
------------------
实时计算,延迟很低。
吞吐量小。
tuple()
Spark Streaming
------------------
DStream,离散流计算。
相当于一序列RDD。
按照时间片划分RDD。
DStream分区 = RDD的分区。
动态数据。
StreamingContext( , Seconds(2))
windows话操作,batch的扩展。
吞吐量大。
socketTextStream() //Socket
//分区200ms
kafka流 //kafka分区 == rdd一个分区。
LocationStrategy
------------------
位置策略,控制主题分区在哪个节点消费。
PreferBroker //首选kafka服务器
PreferConsistent //首选均衡处理
PreferFixed //首选固定位置
ConsumerStrategy
-----------------
控制消费者对kafka消息的消费范围界定。
Assign //指定,控制到主题下的分区.
Subscribe //订阅主题集合,控制不到主题下的某个分区。
SubscribePattern //正则消费,对Subscribe的增强,支持正则表达式.
消费语义模型
----------------
1.at most once
submitOffset()
consumeMessage() ;
2.at least once
consumeMessage()
commitOffset()
3.exact once
依托于外部事务性资源(例如数据库)产品的事务管理特性。
将offset存储到事务性资源库中。
KafkaRDD分区计算
------------------
通过kafkaRDD的consumer.assignmeng()方法来得到。
而消费者对象是通过consumerStrategy.onStart获得.
因此KafkaRDD的分区数区域于消费者策略,原则上每
个主题分区对应一个rdd分区,有些情况需要考量,比如
分区上有限速的。
java版编程实现spark streaming kafka消息消费
---------------------------------------------
package com.oldboy.spark.java;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.Seconds;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* Created by Administrator on 2018/5/22.
*/
public class SparkStreamingKafkaJavaDemo {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf();
conf.setAppName("stream") ;
conf.setMaster("local[*]") ;
//创建流上下文
JavaStreamingContext ssc = new JavaStreamingContext(conf , Durations.seconds(5)) ;
//位置策略
LocationStrategy loc = LocationStrategies.PreferConsistent();
//消费参数
/**
* "bootstrap.servers" -> "s102:9092,s103:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer]
"group.id" -> "g1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
*
*/
Map<String,Object> kafkaParams = new HashMap<String,Object>() ;
kafkaParams.put("bootstrap.servers" ,"s102:9092" ) ;
kafkaParams.put("key.deserializer" , "org.apache.kafka.common.serialization.StringDeserializer") ;
kafkaParams.put("value.deserializer" , "org.apache.kafka.common.serialization.StringDeserializer") ;
kafkaParams.put("group.id" , "g1") ;
kafkaParams.put("auto.offset.reset" , "latest") ;
kafkaParams.put("enable.auto.commit" , false) ;
//消费主题
List<String> topics = new ArrayList<String>() ;
topics.add("t1") ;
//消费者策略
ConsumerStrategy con = ConsumerStrategies.Subscribe(topics , kafkaParams) ;
JavaDStream<ConsumerRecord<String,String>> ds1 = KafkaUtils.<String,String>createDirectStream(ssc , loc , con) ;
JavaDStream<String> ds2 = ds1.map(new Function<ConsumerRecord<String,String>, String>() {
public String call(ConsumerRecord<String, String> v1) throws Exception {
return v1.value();
}
}) ;
ds2.print();
ssc.start();
ssc.awaitTermination();
}
}
机器学习
----------------------
算法。
machine learning.
数学基础
---------------
1.kmean
k个均值
2.median
中位数
3.mode
众数
4.range
极差 , max - min
5.variance
方差
差平方和平均值
(x1-x)^2 + (x2-x)^2 + ...
----------------------------
n
6.standard deviation
标准差!
方差的平方根。
sqrt(variance)
7.skewness
偏度。
对称分布 : mean = median = mod
左偏分布 : mean < median < mod
右偏分布 : mean > median > mod
8.kertosis
峰度
正态 :kertosis = 3
较凸 :kertosis > 3
平滑 :kertosis < 3
BI
-------------
商业智能。
监督和非监督
--------------
1.监督
使用的数据都是打了标签的。
垃圾邮件分类。
神经网络
SVM
朴素贝叶斯
2.非监督
没有标签。
Kmean
贝叶斯
----------------
A事件发生时,B事件发生的概率。
P(A|B) * P(B)
P(B | A) = -----------------
P(A)
TF-IDF
----------
1.TF
term frequence,词频,针对单个文档。
word count.
衡量单词描述该文档主题的相关性。
//j : 第j篇文章
//i : 第i个单词
N(ij)
TF(ij) = --------------------
Sum(N(j))
2.IDF
inverse document frequence,逆文档频率,针对文档集合(语料库)
计算单词对整个文档集合的区分能力。
|D| 1000
idf(i) = log10 -----------------------
出现单词i的文档个数 + 1
3.TF-IDF
tf衡量某个单词在文章的重要性。
idf衡量单词用来区分整个语料库的重要性。
1000
最小二乘法
----------------
平方和最小值.
线性回归
---------------
regress,
呈现直线方式变换.
回归结果是变化的值。
逻辑回归
------------------------
计算的结果是固定值。
线性回归对结果进行二元判断,就是逻辑回归。
向量
------------------------
(1,2,3,4)
1
2
3
4)
(0,3,8,0,0,9,0,2)
松散向量:
-----------
sparse vector,
占用内存少。
(1000,1:3,2:8,5:9,..)
密度向量:
-----------
dense vector,
(0,1,2,0,0,5,0,6)
hello world , how are you, thank you!!
1/7
you = 2/ 7
1.
----------
hello tom1
2.
---------------
hello tom2
3.
---------------
hello tom3
3
hello = ----------- = 1
3
y = ax1 + bx2 + ... nx11 + C
使用线性回归实现酒质量预测
--------------------------
1.引入maven依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.0</version>
</dependency>
2.编程
/**
* 使用spark的线性回归预测红酒质量
*/
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.{Row, SparkSession}
object SparkMLLibLinearRegress {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("ml_linearRegress")
conf.setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate()
//1.定义样例类
case class Wine(FixedAcidity: Double, VolatileAcidity: Double, CitricAcid: Double,
ResidualSugar: Double, Chlorides: Double, FreeSulfurDioxide: Double,
TotalSulfurDioxide: Double, Density: Double, PH: Double,
Sulphates: Double, Alcohol: Double,
Quality: Double)
//2.加载csv红酒文件,变换形成rdd
val file = "file:///D:\\ml\\data\\red.csv" ;
val wineDataRDD = spark.sparkContext.textFile(file)
.map(_.split(";"))
.map(w => Wine(w(0).toDouble, w(1).toDouble, w(2).toDouble, w(3).toDouble,
w(4).toDouble, w(5).toDouble, w(6).toDouble, w(7).toDouble, w(8).toDouble,
w(9).toDouble, w(10).toDouble,
w(11).toDouble))
//导入sparksession的隐式转换对象的所有成员,才能将rdd转换成Dataframe
import spark.implicits._
val trainingDF = wineDataRDD.map(w => (w.Quality, Vectors.dense(w.FixedAcidity, w.VolatileAcidity,
w.CitricAcid, w.ResidualSugar, w.Chlorides, w.FreeSulfurDioxide, w.TotalSulfurDioxide,
w.Density, w.PH, w.Sulphates, w.Alcohol))).toDF("label", "features")
trainingDF.show(100,false)
//3.创建线性回归对象
val lr = new LinearRegression() //val lr = new LinearRegression()
//4.设置回归对象参数
lr.setMaxIter(2) //lr.setMaxIter(2)
//5.拟合模型
val model = lr.fit(trainingDF) //val model = lr.fit(trainingDF)
//6.构造测试数据集
val testDF = spark.createDataFrame(Seq((5.0, Vectors.dense(7.4, 0.7, 0.0, 1.9, 0.076, 25.0, 67.0, 0.9968, 3.2, 0.68, 9.8)),
(5.0, Vectors.dense(7.8, 0.88, 0.0, 2.6, 0.098, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4)),
(7.0, Vectors.dense(7.3, 0.65, 0.0, 1.2, 0.065, 15.0, 18.0, 0.9968, 3.36, 0.57, 9.5))))
.toDF("label", "features")
//7.对测试数据集注册临时表
testDF.createOrReplaceTempView("test") //testDF.createOrReolaceTempView("test")
//8.使用训练的模型对测试数据进行预测,并提取查看项
val tested = model.transform(testDF)
tested.show(100,false)
//
val tested2 = tested.select("features", "label", "prediction")
tested2.show(100,false)
//9.展示预测结果
tested.show()
//10.通过测试数据集只抽取features,作为预测数据。
val predictDF = spark.sql("select features from test")
predictDF.show(100,false)
model.transform(predictDF).show(1000,false)
}
}
模型持久化
------------------
//保存模型
val model = lr.fit(trainingDF)
model.save("file:///d:/mr/model/linreg")
//加载模型
val model = LinearRegressionModel.load("file:///d:/mr/model/linreg")
使用逻辑回归实现白酒的好坏评估
-------------------------------
/**
* 逻辑回归
*/
import org.apache.spark.SparkConf
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.sql.SparkSession
object WineLogisticRegressDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("logisticRegress")
conf.setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate() //val spark=SparkSession.builder().config(conf).getOrCreate()
//1.定义样例类
case class Wine(FixedAcidity: Double, VolatileAcidity: Double,
CitricAcid: Double, ResidualSugar: Double, Chlorides: Double,
FreeSulfurDioxide: Double, TotalSulfurDioxide: Double, Density: Double,
PH: Double, Sulphates: Double, Alcohol: Double, Quality: Double)
//2.加载csv红酒文件,变换形成rdd
val file = "file:///D:\\ml\\data\\white.csv";
val wineDataRDD = spark.sparkContext.textFile(file) //val wineDataRDD = spark.sparkContext.textFile(file)
.map(_.split(";"))
.map(w => Wine(w(0).toDouble, w(1).toDouble, w(2).toDouble, w(3).toDouble,
w(4).toDouble, w(5).toDouble, w(6).toDouble, w(7).toDouble, w(8).toDouble,
w(9).toDouble, w(10).toDouble,
w(11).toDouble))
//导入sparksession的隐式转换对象的所有成员,才能将rdd转换成Dataframe
import spark.implicits._
val trainingDF = wineDataRDD.map(w => (if (w.Quality < 7) 0D else
1D, Vectors.dense(w.FixedAcidity, w.VolatileAcidity, w.CitricAcid,
w.ResidualSugar, w.Chlorides, w.FreeSulfurDioxide, w.TotalSulfurDioxide,
w.Density, w.PH, w.Sulphates, w.Alcohol))).toDF("label", "features")
//3.创建逻辑回归对象
val lr = new LogisticRegression()
//设置回归对象参数
lr.setMaxIter(10).setRegParam(0.01)
//4.拟合训练数据,生成模型
val model = lr.fit(trainingDF)
//5.构造测试数据
val testDF = spark.createDataFrame(Seq(
(1.0, Vectors.dense(6.1, 0.32, 0.24, 1.5, 0.036, 43, 140, 0.9894, 3.36, 0.64, 10.7)),
(0.0, Vectors.dense(5.2, 0.44, 0.04, 1.4, 0.036, 38, 124, 0.9898, 3.29, 0.42, 12.4)),
(0.0, Vectors.dense(7.2, 0.32, 0.47, 5.1, 0.044, 19, 65, 0.9951, 3.38, 0.36, 9)),
(0.0, Vectors.dense(6.4, 0.595, 0.14, 5.2, 0.058, 15, 97, 0.991, 3.03, 0.41, 12.6)))
).toDF("label", "features")
testDF.createOrReplaceTempView("test")
//预测测试数据
val tested = model.transform(testDF).select("features", "label", "prediction")
//
val realData = spark.sql("select features from test")
model.transform(realData).select("features", "prediction").show(100, false)
}
}
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|label|sentence |words |rawFeatures |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(20,[0,5,9,17],[1.0,1.0,1.0,2.0]) |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
|label|sentence |words |rawFeatures |features |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(20,[0,5,9,17],[1.0,1.0,1.0,2.0]) |(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906]) |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |(20,[2,7,9,13,15],[0.6931471805599453,0.6931471805599453,0.8630462173553426,0.28768207245178085,0.28768207245178085]) |
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|(20,[4,6,13,15,18],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.28768207245178085,0.6931471805599453])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
def aggregate[B](z: =>B)
(seqop: (B, A) => B, combop: (B, B) => B): B = foldLeft(z)(seqop)
1:
----
comop(comop(comop(comop(seqop(seqop(seqop(zero,10),11),12) ,seqop(seqop(seqop(seqop(zero,13),14),15),16)) ,
seqop(seqop(seqop(zero,7),8),9)) ,
seqop(seqop(seqop(zero,1),2),3)) ,
seqop(seqop(seqop(zero,4),5),6))
RDD.treeAggregation和depth :
---------------------------
rdd树形聚合,操作结果类似于aggregateByKey(zeroU)(seqop,comop ),
计算方式也是分区内按照seqop聚合,分区间按照comop聚合,
树形聚合可以指定深度depth,depth参数对结果没有影响,影响的是
性能,将一次聚合的过程分成多次聚合。
Spark DF
-------------
文档频率的计算公式:
D + 1
DF = log(e)--------------------------
出现的单词的文档个数 + 1
parts: 10000
depth : 4
//规模 , 树形聚合的分区数标准,分区数大于该值,
scale = max( pow(10000 , 1/4), 2) = 10
while(10000 > 10 + ceil(10000 / 10)){
currParts = 10000 / 10 = 1000
}
while(1000 > 10 + ceil(1000 / 10)){
currParts = 10000 / 10 = 100
}
while(100 > 10 + ceil(100 / 10)){
currParts = 100 / 10 = 10
}
while(100 > 10 + ceil(100 / 10)){
currParts = 100 / 10 = 10
}
size indices values
100 , [1,3,5] , [100,200,300]
0,1,2
df :
k = 0
while(k < indices.length){
if(values(k) > 0){
df(indices(k)) += 1L
}
}
//所有的单词的文档频率
df : vector
doc1: hello tom1 -> (200 , [3,5] , [1,1])
doc1: hello tom2
doc1: hello tom3
doc1: hello tom4
doc1: hello tom5
3 + 1
hello ------------
3 + 1
20
4
0
3 + 1
log------
0 + 1
0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 1
0 130:64 131:253 132:255 133:63 157:96 158:205 159:251 1
0 155:53 156:255 157:253 158:253 159:253 160:124 183:180
0 128:73 129:253 130:227 131:73 132:21 156:73 157:251 15
0 154:46 155:105 156:254 157:254 158:254 159:254 160:255
0 152:56 153:105 154:220 155:254 156:63 178:18 179:166 1
0 155:21 156:176 157:253 158:253 159:124 182:105 183:176
0:
1 100:166 101:222 102:55 128:197 129:254 130:218 131:5 1
1 159:124 160:253 161:255 162:63 186:96 187:244 188:251
1 125:145 126:255 127:211 128:31 152:32 153:237 154:253
1 153:5 154:63 155:197 181:20 182:254 183:230 184:24 209
1 152:1 153:168 154:242 155:28 180:10 181:228 182:254 18
1 159:121 160:254 161:136 186:13 187:230 188:253 189:248
1 155:178 156:255 157:105 182:6 183:188 184:253 185:216
1 130:7 131:176 132:254 133:224 158:51 159:253 160:253 1
0 1:2 3:1 5:4
0 0:6 2:1 4:4
0 2:2 3:6 4:4
--------------
0 1 2 3 4 5 6 7
0 6 2 3 7 8 4 0 0
1 1:3 3:2 5:1
1 1:1 3:2 5:3
0 1 2 3 4 5 6 7
词频总数 features
0=>(3 , (6 2 3 7 8 4 0 0)) = 30 8
1=>(2 , (0 4 0 4 0 4 0 0)) = 12 8
//计算标签数 : 0,1,2
val numLabels = aggregated.length
//计算文档总数: 每个标签数的累加和
val numDocuments = aggregated.map(_._2._1).sum
//标签数组
val labelArray = new Array[Double](numLabels)
//
val piArray = new Array[Double](numLabels)
val thetaArray = new Array[Double](numLabels * numFeatures)
// log(1000 + 2 * 1) = log(1002)
val pilogDamon = log(docs + labels * λ)
val piArray = log[(每个标签个数 + lambda) - pilogDamon]
//
docs: 5
0 : 3
1 : 2
pilogDamon = log(5 + 2 * 1) = log(7)
pi(0) = log(3 + 1) - log(7) = log(4) - log(7) = 1.386 - 1.945 = -0.559
aggregateByKey[(Double, DenseVector)]
((0.0, Vectors.zeros(numFeatures).toDense))
(
seqOp = {
case ((weightSum: Double, featureSum: DenseVector), (weight, features)) =>
requireValues(features)
a x y => y = 1 * x + y
BLAS.axpy(weight, features, featureSum)
(weightSum + weight, featureSum)
},
combOp = {
case ((weightSum1, featureSum1), (weightSum2, featureSum2)) =>
BLAS.axpy(1.0, featureSum2, featureSum1)
(weightSum1 + weightSum2, featureSum1)
})
U : (0.0, Vectors.zeros(numFeatures).toDense)
seqop :{
case ((weightSum: Double, featureSum: DenseVector), (weight, features)) =>
requireValues(features)
BLAS.axpy(weight, features, featureSum)
(weightSum + weight, featureSum)
}
来源:oschina
链接:https://my.oschina.net/u/4258911/blog/3963258