Hive笔记 ---之hive 分区表分桶表详解

tech2024-12-07 68

--- 本章节目录

分区表（静态分区动态分区）

分桶表

抽样查询

分区表

数据分区的概念以及存在很久了，通常使用分区来水平分散压力，将数据从物理上移到和使用最频繁的用户更近的地方，以及实现其目的

hive中处理的数据在HDFS中 , select * from tb_name where dt=2020-09-03 ; 查询表中的数据是加载HDFS中对应表文件夹下的数据 ,文件夹下的数据很多,将数据全部加载以后再筛选过滤出数据, 显然效率低 ,Hive中的分区表起始就是根据某中维度将数据分文件夹管理 ,当安装这种维度查询的时候,直接从对应的文件夹下加载数,效率更高

补充: 分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多

分区表：静态分区表动态分区表

1> 静态分区表

1.1一级分区

下面数据为订单数据每天产生的订单信息

orders.txt01,2020-02-14,100 02,2020-02-14,100 03,2020-02-14,200 04,2020-02-14,200 05,2020-02-14,30 06,2020-02-14,200

orders2.txt11,2020-02-15,20 14,2020-02-15,200 15,2020-02-15,20 16,2020-02-15,100 17,2020-02-15,200 18,2020-02-15,200

没有分区的情况下

--建表 create table tb_order( oid int , dt string , cost double ) row format delimited fields terminated by "," ; --加载每天的数据到表中 load data local inpath "/demo/orders.txt" into table tb_order ; load data local inpath "/demo/orders2.txt" into table tb_order ; 对应的数据会被加载到表对应的文件夹中 ,当我们执行如下查询的时候 ,先加载两个文件数据,再过滤 select * from tb_order where dt='2020-02-14' ; --两个文件

静态分区

create table tb_part_order( oid int , dt string , cost double ) partitioned by (dy string) row format delimited fields terminated by "," ; load data local inpath "/hive/data/orders.txt" into table tb_part_order partition(dy="02-14"); load data local inpath "/hive/data/orderss.txt" into table tb_part_order partition(dy="02-15"); 0: jdbc:hive2://linux01:10000> select * from tb_part_order where dy="02-14"; +-----------------+----------------+------------------+----------------+ | tb_p_order.oid | tb_p_order.dt | tb_p_order.cost | tb_p_order.dy | +-----------------+----------------+------------------+----------------+ | 1 | 2020-02-14 | 100.0 | 02-14 | | 2 | 2020-02-14 | 100.0 | 02-14 | | 3 | 2020-02-14 | 200.0 | 02-14 | | 4 | 2020-02-14 | 200.0 | 02-14 | | 5 | 2020-02-14 | 30.0 | 02-14 | | 6 | 2020-06-18 | 200.0 | 02-14 | +----------------------------------------------------------------------+

--- 查看HDFS存储数据

1.2二级分区

将数据按照层级关系再细分

create table tb_static_partition_2( id int , ctime string , name string ) partitioned by (mt string , dt string) row format delimited fields terminated by "," ; load data local inpath "/demo/log/2020-09-02_01.log" into table tb_static_partition_2 partition(mt='09' , dt='02') ; load data local inpath "/demo/log/2020-09-02_02.log" into table tb_static_partition_2 partition(mt='09' , dt='02') ; load data local inpath "/demo/log/2020-09-01_01.log" into table tb_static_partition_2 partition(mt='09' , dt='01') ; load data local inpath "/demo/log/2020-09-01_02.log" into table tb_static_partition_2 partition(mt='09' , dt='01') ; load data local inpath "/demo/log/2020-08-31_01.log" into table tb_static_partition_2 partition(mt='08' , dt='31') ; load data local inpath "/demo/log/2020-08-31_02.log" into table tb_static_partition_2 partition(mt='08' , dt='31') ;

2> 动态分区

按照某个字段自动的将数据加载到指定的分区中

2.1 创建普通的表

create table demo( id int , birthday string , cost int ) row format delimited fields terminated by '\t' ;

2.2 加载数据

load data local inpath '/data/demo' into table demo ;

2.3 数据准备

1 2010-11-12 120

2 2010-11-12 121

3 2010-11-12 122

4 2010-11-12 124

5 2010-11-13 122

6 2010-11-13 120

7 2010-11-14 123

8 2010-11-13 1202

9 2010-11-12 1202

10 2010-11-12 1201

2.4 创建分区表

create table dyn_demo( id int , birthday string , cost int ) partitioned by(bt string) row format delimited fields terminated by '\t' ;

2.5 设置参数

set hive.exec.dynamic.partition=true; 使用动态分区

set hive.exec.dynamic.partition.mode=nonstrick; 无限制模式，如果模式是strict，则必须有一个静态分区且放在最前面

set hive.exec.max.dynamic.partitions.pernode=10000; 每个节点生成动态分区的最大个数

set hive.exec.max.dynamic.partitions=100000; 生成动态分区的最大个数

set hive.exec.max.created.files=150000; 一个任务最多可以创建的文件数目

set dfs.datanode.max.xcievers=8192; 限定一次最多打开的文件数

set hive.merge.mapfiles=true; map端的结果进行合并

set mapred.reduce.tasks =2; 设置reduce task个数

2.6 加载数据

insert into table demo2 partition(birthday)

select id , cost ,birthday from demo ; -- 注意顺序

分桶表

--- 分桶表数据存储

对Hive(Inceptor)表分桶可以将表中记录按分桶键(字段)的哈希值分散进多个文件中，这些小文件称为桶

分区针对的是数据的存储路径；分桶针对的是数据文件

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是之前所提到过的要确定合适的划分大小这个疑虑。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

把表分区划分成bucket有两个理由

更快，桶为表加上额外结构，链接相同列划分了桶的表，可以使用map-side join更加高效

取样sampling更高效没有分区的话需要扫描整个数据集

1.准备数据

将数据分文件存储 , 类似于分区id name 1 m1 2 m2 3 m3 4 m4 5 m5 6 m6 7 m7 8 m8 9 m9

2.创建分桶表

create table tb_cluster( id int , name string ) clustered by(id) into 4 buckets row format delimited fields terminated by "\t" ;

3.查看表的分桶信息

查看表的分桶信息 desc formatted tb_cluster ; +-------------------------------+----------------------------------------------------+-----|------------------------+ | col_name | data_type | | comment | +-------------------------------+----------------------------------------------------+-----------------------------+ | Num Buckets: | 4 | NULL | | Bucket Columns: | [id] | NULL | +-------------------------------+----------------------------------------------------+-----

4. 创建普通表导入数据到普通表中

create table if not exists tb_cluster2( id int , name string ) row format delimited fields terminated by "\t" ; load data local inpath "/demo/cluster.txt" into table tb_cluster2 ; +------------------+-------------------+ | tb_cluster2.uid | tb_cluster2.name | +------------------+-------------------+ | 1 | m1 | | 2 | m2 | | 3 | m3 | | 4 | m4 | | 5 | m5 | | 6 | m6 | | 7 | m7 | | 8 | m8 | | 9 | m9 | +--------------------------------------+

5.开启分桶

set hive.enforce.bucketing=true; set mapreduce.job.reduces=-1;

6.基于查询的方式将数据导入到分桶表中

insert into table tb_clusterselect uid , name from tb_cluster2 ;

分桶的抽样查询

从大量的数据中根据某个字段的hashcode%y获取部分样本数据 x(编号) OUT OF y(样本的个数 ,桶数)

对抽样的表是没有要求的 , 分桶表普通表都可以

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求

查询表stu_buck中的数据

select * from buck_demo tablesample(bucket 1 out of 4 on id);

补充：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y)

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据

x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table总bucket数为4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据

注意：x的值必须小于等于y的值，否则就会出现下面这种情况：

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

最新回复(0)

Hive笔记 ---之hive 分区表 分桶表 详解

分区表

1> 静态分区表

2> 动态分区

分桶表

--- 分桶表数据存储

分桶的抽样查询

Hive笔记 ---之hive 分区表分桶表详解