Hive HQL面试题(窗口函数over)(二)

tech2024-12-01  16

Hive HQL面试题(窗口函数over)(二)

1. 数据准备1

文本文件 a,2020-02-10,600 a,2020-03-01,200 a,2020-03-02,300 a,2020-03-03,200 a,2020-03-04,400 a,2020-03-05,600 a,2020-02-05,200 a,2020-02-06,300 a,2020-02-07,200 a,2020-02-08,400 b,2020-02-05,200 b,2020-02-06,300 b,2020-02-08,200 b,2020-02-09,400 b,2020-02-10,600 c,2020-01-31,200 c,2020-02-01,300 c,2020-02-02,200 c,2020-02-03,400 c,2020-02-10,600 hive上建表,先建立database create database db_window2; create table tb_shop( name string , ctime string , cost double ) row format delimited fields terminated by ','; load data local inpath '/root/hive_sql/hive_window2/hivewindow2.txt' into table tb_shop; +---------------+----------------+---------------+ | tb_shop.name | tb_shop.ctime | tb_shop.cost | +---------------+----------------+---------------+ | a | 2020-02-10 | 600.0 | | a | 2020-03-01 | 200.0 | | a | 2020-03-02 | 300.0 | | a | 2020-03-03 | 200.0 | | a | 2020-03-04 | 400.0 | | a | 2020-03-05 | 600.0 | | a | 2020-02-05 | 200.0 | | a | 2020-02-06 | 300.0 | | a | 2020-02-07 | 200.0 | | a | 2020-02-08 | 400.0 | | b | 2020-02-05 | 200.0 | | b | 2020-02-06 | 300.0 | | b | 2020-02-08 | 200.0 | | b | 2020-02-09 | 400.0 | | b | 2020-02-10 | 600.0 | | c | 2020-01-31 | 200.0 | | c | 2020-02-01 | 300.0 | | c | 2020-02-02 | 200.0 | | c | 2020-02-03 | 400.0 | | c | 2020-02-10 | 600.0 | +---------------+----------------+---------------+

2. 题目1

2.1 题目内容

找出连续3天有购物记录的数据,包括名字算出都有谁连续4天及以上购物

2.2解题

2.2.1. 找出连续3天有购物记录的数据,包括名字

思路: 连续有3天,并且按照名字分类直接group by的话,无法获取分组前信息,考虑窗口函数连续多少天或者多少次,考虑使用row_number函数涉及日期,需要对日期排序 sql 初步检索出数据 select *, row_number() over(partition by name order by ctime) from tb_shop; +---------------+----------------+---------------+----------------------+ | tb_shop.name | tb_shop.ctime | tb_shop.cost | row_number_window_0 | +---------------+----------------+---------------+----------------------+ | a | 2020-02-05 | 200.0 | 1 | | a | 2020-02-06 | 300.0 | 2 | | a | 2020-02-07 | 200.0 | 3 | | a | 2020-02-08 | 400.0 | 4 | | a | 2020-02-10 | 600.0 | 5 | | a | 2020-03-01 | 200.0 | 6 | | a | 2020-03-02 | 300.0 | 7 | | a | 2020-03-03 | 200.0 | 8 | | a | 2020-03-04 | 400.0 | 9 | | a | 2020-03-05 | 600.0 | 10 | | b | 2020-02-05 | 200.0 | 1 | | b | 2020-02-06 | 300.0 | 2 | | b | 2020-02-08 | 200.0 | 3 | | b | 2020-02-09 | 400.0 | 4 | | b | 2020-02-10 | 600.0 | 5 | | c | 2020-01-31 | 200.0 | 1 | | c | 2020-02-01 | 300.0 | 2 | | c | 2020-02-02 | 200.0 | 3 | | c | 2020-02-03 | 400.0 | 4 | | c | 2020-02-10 | 600.0 | 5 | +---------------+----------------+---------------+----------------------+

这里已经检索出初始数据,按照name分类,同时日期升序排列。 接下来需要判断连续日期,这时候可以假设日期都是连续且递增,row number也是连续和递增。所以2者之差,应该一致。如果日期有断开,则2者之差就不会一样。

进行日期的相减,需要查询函数,进行日期处理的,资料如下 show functions; desc function date_sub ; +----------------------------------------------------+ | tab_name | +----------------------------------------------------+ | date_sub(start_date, num_days) - Returns the date that is num_days before start_date. | +----------------------------------------------------+ 进一步对数据进行处理 select *, date_sub(ctime, rn) as date_sub from (select *, row_number() over(partition by name order by ctime) rn from tb_shop) t1; +----------+-------------+----------+--------+-------------+ | t1.name | t1.ctime | t1.cost | t1.rn | date_sub | +----------+-------------+----------+--------+-------------+ | a | 2020-02-05 | 200.0 | 1 | 2020-02-04 | | a | 2020-02-06 | 300.0 | 2 | 2020-02-04 | | a | 2020-02-07 | 200.0 | 3 | 2020-02-04 | | a | 2020-02-08 | 400.0 | 4 | 2020-02-04 | | a | 2020-02-10 | 600.0 | 5 | 2020-02-05 | | a | 2020-03-01 | 200.0 | 6 | 2020-02-24 | | a | 2020-03-02 | 300.0 | 7 | 2020-02-24 | | a | 2020-03-03 | 200.0 | 8 | 2020-02-24 | | a | 2020-03-04 | 400.0 | 9 | 2020-02-24 | | a | 2020-03-05 | 600.0 | 10 | 2020-02-24 | | b | 2020-02-05 | 200.0 | 1 | 2020-02-04 | | b | 2020-02-06 | 300.0 | 2 | 2020-02-04 | | b | 2020-02-08 | 200.0 | 3 | 2020-02-05 | | b | 2020-02-09 | 400.0 | 4 | 2020-02-05 | | b | 2020-02-10 | 600.0 | 5 | 2020-02-05 | | c | 2020-01-31 | 200.0 | 1 | 2020-01-30 | | c | 2020-02-01 | 300.0 | 2 | 2020-01-30 | | c | 2020-02-02 | 200.0 | 3 | 2020-01-30 | | c | 2020-02-03 | 400.0 | 4 | 2020-01-30 | | c | 2020-02-10 | 600.0 | 5 | 2020-02-05 | +----------+-------------+----------+--------+-------------+ 根据date_sub字段进行计数,超过3次的就是需要的数据 select name, date_sub, count(1) as count from (select *, date_sub(ctime, rn) as date_sub from (select *, row_number() over(partition by name order by ctime) rn from tb_shop) t1) t2 group by name, date_sub having count > 3; # 注意这里别名t1 t2不要遗漏,否则会导致变量访问出问题 +-------+-------------+--------+ | name | date_sub | count | +-------+-------------+--------+ | a | 2020-02-04 | 4 | | a | 2020-02-24 | 5 | | c | 2020-01-30 | 4 | +-------+-------------+--------+ 需要进一步解析,再进一步根据名字区分,distinct name select distinct name from (select name, date_sub, count(1) as count from (select *, date_sub(ctime, rn) as date_sub from (select *, row_number() over(partition by name order by ctime) rn from tb_shop) t1) t2 group by name, date_sub having count > 3) t3; +-------+ | name | +-------+ | a | | c | +-------+
最新回复(0)