Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

折月煮酒 提交于 2019-11-28 13:01:21

TextSocketSource doesn't provide any integrated parsing options. It is only possible to use one of the two formats:

  • timestamp and text if includeTimestamp is set to true with the following schema:

    StructType([
        StructField("value", StringType()),
        StructField("timestamp", TimestampType())
    ])
    
  • text only if includeTimestamp is set to false with the schema as shown below:

    StructType([StructField("value", StringType())]))
    

If you want to change this format you'll have to transform the stream to extract fields of interest, for example with regular expressions:

from pyspark.sql.functions import regexp_extract
from functools import partial

fields = partial(
    regexp_extract, str="value", pattern="^(\w*)\s*,\s*(\w*)\s*,\s*([0-9]*)$"
)

lines.select(
    fields(idx=1).alias("name"),
    fields(idx=2).alias("last_name"), 
    fields(idx=3).alias("phone_number")
)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!