分区表

分区是将一个表或索引物理地分解为多个更小、更可管理的部分。
分区对应用透明，即对访问数据库的应用而言，逻辑上讲只有一个表或一个索引（相当于应用“看到”的只是一个表或索引），但在物理上这个表或索引可能由数十个物理分区组成。

在 Hadoop 中，Hive 分区表通常以特定的目录结构来存储。
每个分区对应一个独立的目录，目录名通常包含分区列的值。数据文件会存储在相应的分区目录下。

分桶表

•分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区。对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。

•分桶是将数据集分解成更容易管理的若干部分的另一个技术。

•分区针对的是数据的存储路径；分桶针对的是数据文件。

分桶策略：Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

	分区表	分桶表
划分依据	分区列的离散值（目录分割）	分桶列的哈希值（文件分割）
数量灵活性	分区数量可动态新增	桶数量固定（创建时指定）
优化目标	减少扫描范围（过滤查询）	优化 Join 和抽样（哈希分片）

实践练习

1.清洗excel文件超市数据.xlsx 为文本文件，按照分区字段对文件进行拆分
ps: 浮点数精确小数点后2位，不需要有表头和index ，数据分隔符不能是逗号
数据清洗脚本：

import pandas as pd
import os

# 读取Excel文件
df = pd.read_excel('超市数据.xlsx')

# 创建输出目录
output_dir = '超市数据_分区'
os.makedirs(output_dir, exist_ok=True)

# 需要处理的浮点数列
float_columns = ['销售额', '折扣', '利润']

# 按类别拆分并保存文件
for category in df['类别'].unique():
    # 使用.loc创建显式副本
    category_df = df.loc[df['类别'] == category].copy()

    # 浮点数保留两位小数（使用更安全的赋值方式）
    for col in float_columns:
        if col in category_df.columns:
            category_df[col] = category_df[col].round(2)

    # 保存为制表符分隔的文本文件
    file_path = os.path.join(output_dir, f'{category}.txt')
    category_df.to_csv(file_path, sep='\t', na_rep='nan', header=False, index=False)

print(f"数据已按类别清洗并拆分至目录: {output_dir}")

2.创建一个分区表，做分桶 clustered by （id ） into 2 buckets，将文本文件分别插入到分区中

create table supermarket_p1 (
id string, -- 行 ID  
ord_id string comment '订单 ID',  
ord_date string comment '订单日期', 
exch_date string comment '发货日期',
exch_type string comment '邮寄方式',
cust_id string comment '客户 ID ',
cust_name string comment '客户名称',
d_type string comment '细分',
city string comment '城市', 
prov string comment '省/自治区',  
country string comment'国家',
area string comment '地区', 
pro_id string comment '产品 ID',  
type1 string comment '类别',
type2 string comment '子类别', 
pro_name string comment '产品名称',
sales float comment '销售额',  
count1 int comment '数量  ',
discount float comment '折扣  ',
profit float comment '利润'
)
partitioned by (c_type1 string)
clustered by (id) into 2 buckets
row format delimited fields terminated by '\t'

上传数据到hdfs系统

hdfs dfs -mkdir -p /user/hive/warehouse/supermarket/data
hdfs dfs -put 办公用品.txt /user/hive/warehouse/supermarket/data
hdfs dfs -put 技术.txt /user/hive/warehouse/supermarket/data
hdfs dfs -put 家具.txt /user/hive/warehouse/supermarket/data

插入数据

load data inpath '/user/hive/warehouse/supermarket/data/办公用品.txt'
into table supermarket_p1 partition (c_type1='bgyp')

load data inpath '/user/hive/warehouse/supermarket/data/技术.txt'
into table supermarket_p1 partition (c_type1='jishu')

load data inpath '/user/hive/warehouse/supermarket/data/家具.txt'
into table supermarket_p1 partition (c_type1='jiajv')

3.创建一个按订单时间分区的分区表 (year_ string, month_ string)

create table supermarket_p2 (
id string, -- 行 ID  
ord_id string comment '订单 ID',  
ord_date string comment '订单日期', 
exch_date string comment '发货日期',
exch_type string comment '邮寄方式',
cust_id string comment '客户 ID ',
cust_name string comment '客户名称',
d_type string comment '细分',
city string comment '城市', 
prov string comment '省/自治区',  
country string comment'国家',
area string comment '地区', 
pro_id string comment '产品 ID',  
type1 string comment '类别',
type2 string comment '子类别', 
pro_name string comment '产品名称',
sales float comment '销售额',  
count1 int comment '数量  ',
discount float comment '折扣  ',
profit float comment '利润'
)
partitioned by (year_ string, month_ string)
row format delimited fields terminated by '\t'

动态分区配置

--开启动态分区(默认开启)
set hive.exec.dynamic.partition=true
--指定非严格模式 nonstrict模式表示允许所有的分区字段都可以使用动态分区
set hive.exec.dynamic.partition.mode=nonstrict
--在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000
set hive.exec.max.dynamic.partitions=1000
--在每个执行MR的节点上，最大可以创建多少个动态分区(分区字段有多少种设多少个)
set hive.exec.max.dynamic.partitions.pernode=100
--整个MR Job中，最大可以创建多少个HDFS文件。默认100000
set hive.exec.max.created.files=100000
--当有空分区生成时，是否抛出异常
set hive.error.on.empty.partition=false
--打开正则查询模式`(dt|hr)?+.+`
set hive.support.quoted.identifiers=none

4.将第二部的数据使用动态分区的方法导入到第三步的表中

1
2
3

insert into supermarket_p2 partition (year_,month_)
select p1.*,date_format(ord_date,"yyyy"),date_format(ord_date,"MM")
from (select `(c_type1)?+.+` from supermarket_p1) p1