列转行
使用的函数 explode(clo) 将一列复杂的array或者map结构拆分为多行 LATERAL VIEW 为侧视图,意义就是为了配合UDTF来使用,把某一行数据拆分成多行数据,不加lateral view的UDTF只能提取单个字段拆分,并不能塞会原来数据表中.加上lateral view就可以将拆分的单个字段数据与原始表数据关联上.在使用lateral view的时候需要指定视图别名和生成的新列别名 select movie, category_name from movie_info lateral view explode(category) table_tmp as category_name;
窗口函数
相关的函数: over() 指定分析函数工作的数据窗口大小窗口函数通常是分析人员使用 hive ql 进行一些复杂逻辑计算时使用的特殊函数
current row 当前行 n preceding 往前n行 n following 往后n行 unbounded 尽头 unbounded preceding 从起点 unbounded following 最后,终点 lag(clo,n) 往前第n行数据 lead(clo,n) 往后第n行数据 ntile(n) 将数据分成几组,并编号,从1开始,n为int例子
(1)查询在2017年4月份购买过的顾客及总人数 select name,count(*) over () from window1 where substring(orderdate,1,7) = '2017-04' group by name; (2)查询顾客的购买明细及月购买总额 select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from window1; (3)上述的场景,要将cost按照日期进行累加 select name,orderdate,cost, sum(cost) over() as sample1,--所有行相加 sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 from window1; (4)查询顾客上次的购买时间 select name,orderdate,cost, lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from window1; (5)查询前20%时间的订单信息 select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate) sorted from window1 ) t where sorted = 1;Rank
排名 相关说明 ranl() 排序值相同时相同排名,总是不变 dense_rank() 排序值相同时相同排名,总数会减少 row_number() 会依次排序 例子 计算每门学科成绩排名 select name, subject, score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rmp from score;