spark 读取Phoenix hbase table表到 DataFrame的方式
Maven依赖:
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>${phoenix.version}</version>
<scope>provided</scope>
</dependency>
Demo1:
spark 读取Phoenix hbase table表到 DataFrame的方式
方式一:spark read读取各数据库的通用方式
spark.read.format("org.apache.phoenix.spark").option("table","subject_score").option("zkUrl","master,slave1,slave2,slave3,slave4").load
方式二:spark.load
val df = sqlContext.load(
"org.apache.phoenix.spark",
Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)
方式三:phoenixTableAsDataFrame(需要指定列名,留空就可以不指定列名)
val configuration = new Configuration()
// Can set Phoenix-specific settings, requires 'hbase.zookeeper.quorum'
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
// Load the columns 'ID' and 'COL1' from TABLE1 as a DataFrame
val df = sqlContext.phoenixTableAsDataFrame(
"TABLE1", Array("ID", "COL1"), conf = configuration
)
方式四:phoenixTableAsRDD (需要指定列名,留空就可以不指定列名)
val sc = new SparkContext("local", "phoenix-test")
// Load the columns 'ID' and 'COL1' from TABLE1 as an RDD
val rdd: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
"TABLE1", Seq("ID", "COL1"), zkUrl = Some("phoenix-server:2181")
)
Demo2:
方式一:
val df = sqlContext.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:localhost:2181")
.option("dbtable", "US_POPULATION")
.load()
方式二:
val df = sqlContext.load(
"jdbc",
Map("zkUrl" -> "localhost:2181", "url" -> "jdbc:phoenix:localhost:2181", "dbtable" -> "US_POPULATION", "driver" -> "org.apache.phoenix.jdbc.PhoenixDriver")
)
Demo3:
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("phoenix-test").getOrCreate()
// 第一种读取方法
var df = spark.read.format("org.apache.phoenix.spark").option("table", "test1").option("zkUrl", "192.168.56.11:2181").load()
df = df.filter("name not like 'hig%'")
.filter("password like '%0%'")
df.show()
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "192.168.56.11:2181")
// 第二种读取方法
df = spark.sqlContext.phoenixTableAsDataFrame("test1", Array("ID", "INFO.NAME", "INFO.PASSWORD"), conf = configuration)
df.show()
//第一种输出方法
df.write
.format("org.apache.phoenix.spark")
.mode("overwrite")
.option("table", "test2")
.option("zkUrl", "192.168.56.11:2181")
.save()
//第二种输出方法
df.saveToPhoenix(Map("table" -> "test2", "zkUrl" -> "192.168.56.11:2181"))
}
phoenixTableAsDataFrame()是org.apache.phoenix.spark.SparkSqlContextFunctions中的方法,saveToPhoenix()是org.apache.phoenix.spark.DataFrameFunctions中的方法,在phoenix-spark-4.10.0-HBase-1.2.jar中。使用这两个方法时必须 import org.apache.phoenix.spark._,否则编辑器识别不出语法,也不会自动import。
Demo4:(点个赞)
4.1在phoenix中建表
CREATE TABLE TABLE1 (ID BIGINT NOT NULL PRIMARY KEY, COL1 VARCHAR);
UPSERT INTO TABLE1 (ID, COL1) VALUES (1, 'test_row_1');
UPSERT INTO TABLE1 (ID, COL1) VALUES (2, 'test_row_2');
4.2启动spark-shelll
spark-shell --jars /opt/phoenix4.8/phoenix-spark-4.8.0-HBase-1.1.jar,/opt/phoenix4.8/phoenix-4.8.0-HBase-1.1-client.jar
4.3使用DataSource API,load为DataFrame
import org.apache.phoenix.spark._
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "TABLE1", "zkUrl" -> "192.38.0.231:2181"))
df.filter(df("COL1") === "test_row_1" && df("ID") === 1L).select(df("ID")).show
如果spark2.x版本,使用如下方式(前提是修改了phoenix-spark模块的代码,使之兼容spark2.x,重新编译):
import org.apache.phoenix.spark._
val df = spark.read.format("org.apache.phoenix.spark").options(Map("table" -> "TABLE1", "zkUrl" -> "192.38.0.231:2181")).load
df.filter(df("COL1") === "test_row_1" && df("ID") === 1L).select(df("ID")).show
使用spark-sql方式如下
spark-sql --jars /opt/phoenix4.8/phoenix-spark-4.8.0-HBase-1.1.jar,/opt/phoenix4.8/phoenix-4.8.0-HBase-1.1-client.jar
CREATE TABLE spark_ph
USING org.apache.phoenix.spark
OPTIONS (
table "TABLE1",
zkUrl "192.38.0.231:2181"
);
4.3使用Configuration类,load为DataFrame
import org.apache.hadoop.conf.Configuration
import org.apache.phoenix.spark._
val configuration = new Configuration()
val df = sqlContext.phoenixTableAsDataFrame("TABLE1", Array("ID", "COL1"), conf = configuration)
df.show
4.4使用Zookeeper URL ,load为RDD
import org.apache.phoenix.spark._
//将ID和COL1两列,加载为一个RDD
val rdd = sc.phoenixTableAsRDD("TABLE1", Seq("ID", "COL1"), zkUrl = Some("192.38.0.231:2181"))
rdd.count()
val firstId = rdd.first()("ID").asInstanceOf[Long]
val firstCol = rdd.first()("COL1").asInstanceOf[String]
4.5通过Spark向Phoenix中写入数据(RDD方式)
phoenix中创建表output_test_table:
CREATE TABLE OUTPUT_TEST_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val dataSet = List((1L, "1", 1), (2L, "2", 2), (3L, "3", 3))
sc.parallelize(dataSet).saveToPhoenix("OUTPUT_TEST_TABLE", Seq("ID","COL1","COL2"), zkUrl = Some("192.38.0.231:2181"))
0: jdbc:phoenix:localhost> select * from output_test_table;
±----±------±------+
| ID | COL1 | COL2 |
±----±------±------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
±----±------±------+
3 rows selected (0.168 seconds)
4.6通过Spark向Phoenix中写入数据(DataFrame方式)
phoenix中创建两张表:
CREATE TABLE INPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
upsert into input_table values(1,'col1',1);
upsert into input_table values(2,'col2',2);
CREATE TABLE OUTPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
import org.apache.phoenix.spark._
import org.apache.spark.sql.SaveMode
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE", "zkUrl" -> "192.38.0.231:2181"))
df.save("org.apache.phoenix.spark", Map("table" -> "OUTPUT_TABLE", "zkUrl" -> "192.38.0.231:2181"))
spark2.x向phoenix加载数据:
df.write.format("org.apache.phoenix.spark").options( Map("table" -> "OUTPUT_TABLE", "zkUrl" -> "192.38.0.231:2181")).mode("overwrite").save
0: jdbc:phoenix:localhost> select * from output_table;
±----±------±------+
| ID | COL1 | COL2 |
±----±------±------+
| 1 | col1 | 1 |
| 2 | col2 | 2 |
±----±------±------+
2 rows selected (0.092 seconds)
来源:https://blog.csdn.net/An1090239782/article/details/101073804