coalesce

Coalesce columns based on pattern in R [duplicate]

拟墨画扇 提交于 2020-03-22 09:21:10
问题 This question already has answers here : Merge multiple variables in R (6 answers) Closed 9 months ago . I have combined data sets in R, and each data set may use a different column name for the same data. I need to use a regular expression to identify the names of the columns I need to combine, and then run that list of column names through coalesce. I know the proper regex expression to identify my columns, and I know how to manually write the column names into the coalesce function to

coalesce()函数详解

不羁岁月 提交于 2020-02-28 15:18:54
coalesce()函数 返回列表中第一个非Null表达式的值。如果所有表达式求值为Null,则返回Null。 coalesce (expression_1, expression_2, ...,expression_n) 依次参考各参数表达式,遇到非null值即停止并返回该值。如果所有的表达式都是空值,最终将返回一个空值。 举例: coalesce(sch_name, sub_name, date, '2020') 判断sch_name,不为空则返回它的值;如果为空判断sub_name,不为空则返回它的值;如果为空判断date,不为空则返回它的值;如果为空返回2020。 来源: CSDN 作者: 二楼后座Tansen 链接: https://blog.csdn.net/qq_39072649/article/details/104551983

WITH在数据开发中的奇技淫巧

青春壹個敷衍的年華 提交于 2020-02-22 23:37:26
絮絮叨叨 笔者常见的数据开发中,发现如果脚本需要产生中间表,或者说想要提升脚本性能,把这段中间表逻辑变为子查询,在人肉堆SQL生涯中,不外乎两种办法: CREATE TABLE tmp.tmpxxxxx AS 优点:可以落物理表,验数时可追溯源头; 缺点:多一次落盘操作,讲白了说多IO,造成大量磁盘和网络开销 CACHE TABLE tmpxxxxx AS 优点:中间数据广播到每个节点,加快下次调用中间表读取速度 缺点:中间数据不可查,如果下游计算只调用一次,cache操作多一个stage浪费计算资源 讲重点-WITH是什么? WITH AS短语,也叫做子查询部分(subquery factoring),可以定义一个SQL片断,该SQL片断会被整个SQL语句用到。可以使SQL语句的可读性更高,也可以在UNION ALL的不同部分,作为提供数据的部分。 对于UNION ALL,使用WITH AS定义了一个UNION ALL语句,当该片断被调用2次以上,优化器会自动将该WITH AS短语所获取的数据放入一个Temp表中。而提示meterialize则是强制将WITH AS短语的数据放入一个全局临时表中。很多查询通过该方式都可以提高速度 WITH有什么用? 提供一个子查询,供整个SQL调用,同时便于整个脚本维护 WITH 使用场景及用法 WTITH用法 WITH a AS(), b AS

Strange TSQL behavior with COALESCE when using Order By [duplicate]

怎甘沉沦 提交于 2020-02-16 05:21:06
问题 This question already has answers here : nvarchar concatenation / index / nvarchar(max) inexplicable behavior (2 answers) Closed 4 years ago . I'm having some very strange behavior with coalesce. When I don't specify a return amount (TOP (50)) I'm only getting a single last result, but if I remove the "Order By" it works... Examples below DECLARE @result varchar(MAX) SELECT @result = COALESCE(@result + ',', '') + [Title] FROM Episodes WHERE [SeriesID] = '1480684' AND [Season] = '1' Order by

Strange TSQL behavior with COALESCE when using Order By [duplicate]

[亡魂溺海] 提交于 2020-02-16 05:15:43
问题 This question already has answers here : nvarchar concatenation / index / nvarchar(max) inexplicable behavior (2 answers) Closed 4 years ago . I'm having some very strange behavior with coalesce. When I don't specify a return amount (TOP (50)) I'm only getting a single last result, but if I remove the "Order By" it works... Examples below DECLARE @result varchar(MAX) SELECT @result = COALESCE(@result + ',', '') + [Title] FROM Episodes WHERE [SeriesID] = '1480684' AND [Season] = '1' Order by

Spark优化

我的梦境 提交于 2020-02-06 10:30:45
Spark优化总结 1.资源调优 在部署spark集群时指定资源分配的默认参数(配置文件) spark安装包的conf下spark-env.sh SPARK_WORKER_CORES SPARK_WORKER_MEMORY SPARK_WORKER_INSTANCES 每台机器启动的worker数 在提交Application的时候给当前的appliation分配更多的资源(liunx提交命令) 提交命令选项 –executor -cores (不设置,默认每一个worker为当前application开启一个executor,这个executor会使用这个Worker的所有cores和1G内存) –executor-memory –total-exexutor-cors (不设置,默认将集群剩下的所有的核数分配给当前application) Application的代码中设置或在Spark-default.conf中设置(代码中设置) spark.executor.cores spark.executor.memory spark.max.cores 动态分配资源 spark.shuffle.service.enableed true //启动external shuffle Service服务 spark.shuffle.service.port 7377 /

Coalescing many columns into one column

强颜欢笑 提交于 2020-02-05 03:49:05
问题 I apologize if I post a similar question to one I asked earlier, but I realized that my original question wasn't very clear. I have a dataframe with five columns and 6 rows (actually they are many more, just trying to simplify matters): One Two Three Four Five Cat NA NA NA NA NA Dog NA NA NA NA NA NA Mouse NA Cat NA Rat NA NA Horse NA NA NA NA NA NA NA NA NA Now, I would like to coalesce all the information in a new single column ('Summary'), like this: Summary Cat Dog Mouse Error Horse NA

数据追溯

半腔热情 提交于 2020-01-31 12:19:40
数据追溯: 数仓需要追溯,就是看以前的历史变化,比如一个月前的某一天的状态. 比如回溯2018-05-12 --query ".....where updated_time>=2018-05-12 00:00:00" ---->stage.tmp_a #方法一,分区 每天保留一个快照. insert overwirite table a partition (dt='2018-05-12') select coalesce(tb.id,ta.id) as id, coalesce(tb.name,ta.name) as name, ..... ..... from (select * from a where dt='2018-05-11') as ta full join stage.tmp_a as tb on ta.id=tb.id #方法二,全表 tmp表是昨天发生变化的,插入前天的并覆盖.所以只保存一张表,但要看历史的变化,不方便追溯. insert overwirite table a select coalesce(tb.id,ta.id) as id, coalesce(tb.name,ta.name) as name, ..... ..... from a as ta full join stage.tmp_a as tb on ta.id=tb.id 来源:

How to order by 2 columns combining COALESCE?

蓝咒 提交于 2020-01-26 04:22:24
问题 I have a question about ordering a SQL table. And I can't find a solution on stack or google. My table "Score" seems as follows: Name Total Tries Game1 Game2 Game3 ------------------------------------------ Sam 65 61 10 31 24 Tom 55 11 30 Jim 65 58 9 34 22 Dan 62 52 10 30 22 Note: "Total" column is COUNT(Game1 + Game2 + Game3). As you can see the Total record of Tom is empty, because Tom didn't play Game2. I want to order my table as follows (highest-lowest priority): Empty cells (at the

【Spark】(七)spark partition 理解 / coalesce 与 repartition的区别

牧云@^-^@ 提交于 2020-01-24 04:24:36
文章目录 一、spark 分区 partition的理解 二、coalesce 与 repartition的区别(我们下面说的coalesce都默认shuffle参数为false的情况) 三、实例 四、总结 一、spark 分区 partition的理解 spark中是以vcore级别调度task 如果读取的是hdfs,那么有多少个block,就有多少个partition 举例来说: sparksql 要读表T, 如果表T有1w个小文件,那么就有1w个partition 这时候读取效率会较低。假设设置资源为 --executor-memory 2g --executor-cores 2 --num-executors 5 。 步骤是: 拿出1-10号10个小文件(也就是10个partition) 分别给5个executor读取(spark调度会以vcore为单位,实际就是5个executor,10个task读10个partition) 如果5个executor执行速度相同,再拿11-20号文件 依次给这5个executor读取 而实际执行速度不会完全相同,那就是哪个task先执行完,哪个task领取下一个partition读取执行,以此类推。这样往往读取文件的调度时间大于读取文件本身,而且会频繁打开关闭文件句柄,浪费较为宝贵的io资源,执行效率也大大降低。 二、coalesce 与