Hadoop | 易学教程

Exception while deleting Spark temp dir in Windows 7 64 bit

阅读更多关于 Exception while deleting Spark temp dir in Windows 7 64 bit

问题 I am trying to run unit test of spark job in windows 7 64 bit. I have HADOOP_HOME=D:/winutils winutils path= D:/winutils/bin/winutils.exe I ran below commands: winutils ls \tmp\hive winutils chmod -R 777 \tmp\hive But when I run my test I get the below error. Running com.dnb.trade.ui.ingest.spark.utils.ExperiencesUtilTest Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.132 sec 17/01/24 15:37:53 INFO Remoting: Remoting shut down 17/01/24 15:37:53 ERROR ShutdownHookManager:

YARN not preempting resources based on fair shares when running a Spark job

阅读更多关于 YARN not preempting resources based on fair shares when running a Spark job

问题 I have a problem with re-balancing Apache Spark jobs resources on YARN Fair Scheduled queues. For the tests I've configured Hadoop 2.6 (tried 2.7 also) to run in pseudo-distributed mode with local HDFS on MacOS. For job submission used "Pre-build Spark 1.4 for Hadoop 2.6 and later" (tried 1.5 also) distribution from Spark's website. When tested with basic configuration on Hadoop MapReduce jobs, Fair Scheduler works as expected: When resources of the cluster exceed some maximum, fair shares

How to use NOT IN in Hive

阅读更多关于 How to use NOT IN in Hive

问题 Suppose I have 2 tables as shown below. Now, if I want to achieve result which sql will give using, insert into B where id not in(select id from A) which will insert 3 George in Table B. How to implement this in hive? Table A id name 1 Rahul 2 Keshav 3 George Table B id name 1 Rahul 2 Keshav 4 Yogesh 回答1: NOT IN in the WHERE clause with uncorrelated subqueries is supported since Hive 0.13 which was released more than 3 years ago, on 21 April, 2014. select * from A where id not in (select id

How to use NOT IN in Hive

阅读更多关于 How to use NOT IN in Hive

Spark Driver memory and Application Master memory

阅读更多关于 Spark Driver memory and Application Master memory

问题 Am I understanding the documentation for client mode correctly? client mode is opposed to cluster mode where the driver runs within the application master? In client mode the driver and application master are separate processes and therefore spark.driver.memory + spark.yarn.am.memory must be less than the machine's memory? In client mode is the driver memory is not included in the application master memory setting? 回答1: client mode is opposed to cluster mode where the driver runs within the

sqoop 从mysql 导入数据到hbase

阅读更多关于 sqoop 从mysql 导入数据到hbase

环境: 软件版本备注 Ubuntu 19.10 sqoop 1.4.7 mysql 8.0.20-0ubuntu0.19.10.1 (Ubuntu) hbase 2.2.4 必须启动 hadoop 3.1.2 必须启动 hive 3.0.0 之所以和hive有关系是因为需要在.bashrc中设置HCAT_HOME accumulo 2.0.0 需要配合sqoop在.bashrc中设置ACCUMULO_HOMT 数据导入目标: mysql数据------------->Hbase ############################################################################## 准备MYSQL数据集: mysql> create database sqoop_hbase; mysql> use sqoop_hbase; mysql> CREATE TABLE book( -> id INT(4) PRIMARY KEY NOT NULL AUTO_INCREMENT, -> NAME VARCHAR(255) NOT NULL, -> price VARCHAR(255) NOT NULL); 插入数据集 mysql> INSERT INTO book(NAME, price) VALUES('Lie Sporting',

sqoop 报 Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR 解决方法

阅读更多关于 sqoop 报 Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR 解决方法

Sqoop导入mysql表中的数据到hive，出现如下错误： ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly. 方法1：解决方法：往/etc/profile最后加入 export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/* 然后刷新配置，source /etc/profile 方法2：将hive 里面的lib下的hive-exec-**.jar 放到sqoop 的lib 下可以解决以下问题。来源： oschina 链接： https://my.oschina.net/xiaominmin/blog/4947382

硬件/操作系统/网络（九）：了解常见linux服务器硬件配置

阅读更多关于硬件/操作系统/网络（九）：了解常见linux服务器硬件配置

最近工作涉及服务器配置采购安装，以及网络布线、云平台搭建，看见硬件的配置有点懵逼，B站up主说得好，硬件配置、Linux系统、运维搞好什么都很简单，做开发应该了解底层的硬件和网络，所以对自己目前接触过的硬件知识和配置做个总结，方便后续使用查阅；一、服务器参数学习的目的是得到，先来看一组服务器配置参数以目的为导向去学习理解，就只有几项： CPU、内存条、网卡、磁盘、风扇、USB、主板、风扇，如果考虑组网要涉及多个服务器的组网以及交换机、路由器、布线、安全设备相关（暂不涉及）； CPU性能取决于加工工艺、线程数、震动频率、缓存、功率等，型号较复杂难理解；内存条介于CPU和磁盘之间，将读取/计算频率较高的热数据缓存以适配CPU快速的处理能力，内存的好坏取决于震动频率、缓存方式、通道类型；磁盘将较冷的数据存储下来，目前都是固态硬盘，性能通常考察其读/写能力、存储能力大小、Raid阵列类型；主板、网卡、风扇、USB也很重要，主板要综合衡量扩展性、功率、和对于存储/计算设备的支撑能力；1u=4.445cm CPU:2288H V5 配2颗英特尔至强金牌5218(2.3GHz/16-Core/22MB/125W)处理器；内存条：配8条DDR4 Registered DIMM 32GB;可支持24个内存插槽；网卡：标配2*GE+4*10GE网口以太网卡；磁盘：SR430C-M

how count in pyspark? [closed]

阅读更多关于 how count in pyspark? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 24 days ago . Improve this question I have a huge list of title. I wanna count each title in whole data set. for example: `title` A b A c c c output: title fre A 2 b 1 c 3 回答1: You can just groupBy title and then count : import pyspark.sql.functions as f df.groupBy('title').agg(f.count('*')

Extracting strings between distinct characters using hive SQL

阅读更多关于 Extracting strings between distinct characters using hive SQL

问题 I have a field called geo_data_display which contains country, region and dma. The 3 values are contained between = and & characters - country between the first "=" and the first "&", region between the second "=" and the second "&" and DMA between the third "=" and the third "&". Here's a re-producible version of the table. country is always character but region and DMA can be either numeric or character and DMA doesn't exist for all countries. A few sample values are: country=us&region=tx