Hive HQL面试题(group by和case when)(一)

tech2024-07-15  68

HQL面试题(一)

1. 数据准备

gz.txt 用户工资组成表

用户id(uid), 基本工资(jb) 奖金(jj) 提成(tc) 部门编号( deptno)

1,2000,3000,1500,1 2,5000,500,1000,2 3,1500,1000,3000,2 4,3000,6000,8000,3 5,1500,2000,1800,1 6,2500,1000,1900,1

bm.txt 部门表

1,销售 2,技术 3,行政

yg.txt 员工信息表

uid name gender age

1,zs,M,28 2,ww,F,36 3,zl,F,48 4,pp,M,44 5,wb,M,32 6,TQ,F,32

2. 建表并到导入数据到Hive数据表中

建立数据库db_company create database db_company; use db_company; 建表 # 在db_company 创建工资表 create table tb_gz( uid int, jb int, jj int, tc int, deptno int ) row format delimited fields terminated by ","; # 将linux节点服务器上本地文件加载到hive数据表中 load data local inpath "/root/hive_sql/gz.txt" into table gz; # 建立部门表 create table bm( deptno string , name string ) row format delimited fields terminated by ","; # 将本地文件导入数据表中 load data local inpath "/root/hive_sql/bm.txt" into table bm; create table yg( uid int, name string, gender string, age int ) row format delimited fields terminated by ","; load data local inpath "/root/hive_sql/yg.txt" into table yg; 表格建立后,数据如下 +---------+--------+--------+--------+------------+ | gz.uid | gz.jb | gz.jj | gz.tc | gz.deptno | +---------+--------+--------+--------+------------+ | 1 | 2000 | 3000 | 1500 | 1 | | 2 | 5000 | 500 | 1000 | 2 | | 3 | 1500 | 1000 | 3000 | 2 | | 4 | 3000 | 6000 | 8000 | 3 | | 5 | 1500 | 2000 | 1800 | 1 | | 6 | 2500 | 1000 | 1900 | 1 | +---------+--------+--------+--------+------------+ +------------+----------+ | bm.deptno | bm.name | +------------+----------+ | 1 | 销售 | | 2 | 技术 | | 3 | 行政 | +------------+----------+ +---------+----------+------------+---------+ | yg.uid | yg.name | yg.gender | yg.age | +---------+----------+------------+---------+ | 1 | zs | M | 28 | | 2 | ww | F | 36 | | 3 | zl | F | 48 | | 4 | pp | M | 44 | | 5 | wb | M | 32 | | 6 | TQ | F | 32 | +---------+----------+------------+---------+

3. 题目

3.1 求出公司中每个员工的姓名 和 三类收入中最高的那种收入的类型

思路, 需要员工名字,收入,不涉及部门,所以yg表和gz表join收入最高,也就是需要求出三个收入部分最高的,查阅资料可以知道有一个函数greatest需要知道哪个部分最高,使用case when sql 第一步,求出名字以及每个人对应工资中最高的部分信息 select yg.name, greatest(gz.jb, gz.jj, gz.tc) as highest from gz join yg on gz.uid=yg.uid; +----------+----------+ | yg.name | highest | +----------+----------+ | zs | 3000 | | ww | 5000 | | zl | 3000 | | pp | 8000 | | wb | 2000 | | TQ | 2500 | +----------+----------+ 再进行比对,最高的这部分是哪一个部分,case when即可 select yg.name, greatest(gz.jb, gz.jj, gz.tc) as highest, case when greatest(gz.jb, gz.jj, gz.tc)=gz.jb then '基本工资' when greatest(gz.jb, gz.jj, gz.tc)=gz.jj then '奖金' when greatest(gz.jb, gz.jj, gz.tc)=gz.tc then '提成' end highest_income_part from gz join yg on gz.uid=yg.uid; +----------+----------+----------------------+ | yg.name | highest | highest_income_part | +----------+----------+----------------------+ | zs | 3000 | 奖金 | | ww | 5000 | 基本工资 | | zl | 3000 | 提成 | | pp | 8000 | 提成 | | wb | 2000 | 奖金 | | TQ | 2500 | 基本工资 | +----------+----------+----------------------+

3.2 求出公司中每个岗位的薪资总和

思路 只有岗位也就是部门,以及薪资,所以gz表和bm表join即可需要求出每个部门的薪资总合,所以需要对薪资先求和,jj+jb+tc需要划分部门 sql 先求出部门中的薪资之和,再针对部门进行求和 select bm.name, (gz.jb+gz.jj+gz.tc) as total from gz join bm on gz.deptno=bm.deptno; +----------+--------+ | bm.name | total | +----------+--------+ | 销售 | 6500 | | 技术 | 6500 | | 技术 | 5500 | | 行政 | 17000 | | 销售 | 5300 | | 销售 | 5400 | +----------+--------+ select t.name, sum(t.total) from (select bm.name, (gz.jb+gz.jj+gz.tc) as total from gz join bm on gz.deptno=bm.deptno) t group by t.name; +---------+--------+ | t.name | _c1 | +---------+--------+ | 技术 | 12000 | | 行政 | 17000 | | 销售 | 17200 | +---------+--------+

3.3 求出公司中每个岗位不同性别员工薪资总和

思路 每个岗位,不同性别,不同员工,薪资总和也就是三张表都涉及到,join在一起、可以使用join on join on语法,也可以使用join 。。。join。。。join。。。on。。。and on。。。语法需要针对岗位,针对性别进行分类,同时对薪资进行求和。所以先针对岗位进行分类,再针对性别进行分类,再对薪资进行求和 sql 先把需要的信息都检索出来,然后根据需要进行处理 select yg.name, yg.gender, (gz.jb+gz.jj+gz.tc) as total, bm.name from yg join gz on yg.uid=gz.uid join bm on bm.deptno = gz.deptno; +----------+------------+--------+----------+ | yg.name | yg.gender | total | bm.name | +----------+------------+--------+----------+ | zs | M | 6500 | 销售 | | ww | F | 6500 | 技术 | | zl | F | 5500 | 技术 | | pp | M | 17000 | 行政 | | wb | M | 5300 | 销售 | | TQ | F | 5400 | 销售 | +----------+------------+--------+----------+ 这里需要的是根据部门岗位和性别进行划分 select yg.name as username, yg.gender, gz.jb, gz.jj, gz.tc, bm.name from yg join gz on yg.uid=gz.uid join bm on bm.deptno = gz.deptno; +-----------+------------+--------+----------+ | username | yg.gender | total | bm.name | +-----------+------------+--------+----------+ | zs | M | 6500 | 销售 | | ww | F | 6500 | 技术 | | zl | F | 5500 | 技术 | | pp | M | 17000 | 行政 | | wb | M | 5300 | 销售 | | TQ | F | 5400 | 销售 | +-----------+------------+--------+----------+ 再将信息根据部门和性别进行分组,group by with a as (select yg.name username, yg.gender, gz.jb, gz.jj, gz.tc, bm.name from yg join gz on yg.uid=gz.uid join bm on bm.deptno = gz.deptno) select name, gender, sum(jb+jj+tc) from a group by name, gender; +-------+---------+--------+ | name | gender | _c2 | +-------+---------+--------+ | 技术 | F | 12000 | | 行政 | M | 17000 | | 销售 | F | 5400 | | 销售 | M | 11800 | +-------+---------+--------+

3.4 求出公司中不同性别、不同年龄阶段(20-30,31-40,41-50)的员工薪资总和

思路 需要拿到员工信息、工资信息,但不涉及部门信息,所以gz表和yg表join即可需要针对性别、年龄段进行分类 sql 初步数据检索 select gender, age, gz.jb, gz.jj, gz.tc from gz join yg on gz.uid=yg.uid; +---------+------+--------+--------+--------+ | gender | age | gz.jb | gz.jj | gz.tc | +---------+------+--------+--------+--------+ | M | 28 | 2000 | 3000 | 1500 | | F | 36 | 5000 | 500 | 1000 | | F | 48 | 1500 | 1000 | 3000 | | M | 44 | 3000 | 6000 | 8000 | | M | 32 | 1500 | 2000 | 1800 | | F | 32 | 2500 | 1000 | 1900 | +---------+------+--------+--------+--------+ 进一步进行分类,性别和年龄段 select gender, age, gz.jb, gz.jj, gz.tc, case when age >=20 and age <=30 then '20~30' when age >=31 and age <=40 then '31~40' when age >=41 and age <50 then '41~50' else 'too old' end as age_range from gz join yg on gz.uid=yg.uid; +---------+------+--------+--------+--------+------------+ | gender | age | gz.jb | gz.jj | gz.tc | age_range | +---------+------+--------+--------+--------+------------+ | M | 28 | 2000 | 3000 | 1500 | 20~30 | | F | 36 | 5000 | 500 | 1000 | 31~40 | | F | 48 | 1500 | 1000 | 3000 | 41~50 | | M | 44 | 3000 | 6000 | 8000 | 41~50 | | M | 32 | 1500 | 2000 | 1800 | 31~40 | | F | 32 | 2500 | 1000 | 1900 | 31~40 | with a as (select yg.gender, yg.age, gz.jb, gz.jj, gz.tc, case when age >=20 and age <=30 then '20~30' when age >=31 and age <=40 then '31~40' when age >=41 and age <50 then '41~50' else 'too old' end as age_range from gz join yg on gz.uid=yg.uid) select gender, age_range, sum(jj+jb+tc) as total from a group by gender, age_range ; +---------+------------+--------+ | gender | age_range | total | +---------+------------+--------+ | F | 31~40 | 11900 | | F | 41~50 | 5500 | | M | 20~30 | 6500 | | M | 31~40 | 5300 | | M | 41~50 | 17000 | +---------+------------+--------+

4. 查阅Hive的函数

show functions; +------------------------------+ | tab_name | +------------------------------+ | ! | | != | | $sum0 | | % | | & | | * | | + | | - | | / | | < | | <= | | <=> | | <> | | = | | == | | > | | >= | | ^ | | abs | | acos | | add_months | | aes_decrypt | | aes_encrypt | | and | | array | | array_contains | | ascii | | asin | | assert_true | | assert_true_oom | | atan | | avg | | base64 | | between | | bin | | bloom_filter | | bround | | cardinality_violation | | case | | cbrt | | ceil | | ceiling | | char_length | | character_length | | chr | | coalesce | | collect_list | | collect_set | | compute_stats | | concat | | concat_ws | | context_ngrams | | conv | | corr | | cos | | count | | covar_pop | | covar_samp | | crc32 | | create_union | | cume_dist | | current_authorizer | | current_database | | current_date | | current_groups | | current_timestamp | | current_user | | date_add | | date_format | | date_sub | | datediff | | day | | dayofmonth | | dayofweek | | decode | | degrees | | dense_rank | | div | | e | | elt | | encode | | enforce_constraint | | exp | | explode | | extract_union | | factorial | | field | | find_in_set | | first_value | | floor | | floor_day | | floor_hour | | floor_minute | | floor_month | | floor_quarter | | floor_second | | floor_week | | floor_year | | format_number | | from_unixtime | +------------------------------+ | tab_name | +------------------------------+ | from_utc_timestamp | | get_json_object | | get_splits | | greatest | | grouping | | hash | | hex | | histogram_numeric | | hour | | if | | in | | in_bloom_filter | | in_file | | index | | initcap | | inline | | instr | | internal_interval | | isfalse | | isnotfalse | | isnotnull | | isnottrue | | isnull | | istrue | | java_method | | json_tuple | | lag | | last_day | | last_value | | lcase | | lead | | least | | length | | levenshtein | | like | | likeall | | likeany | | ln | | locate | | log | | log10 | | log2 | | logged_in_user | | lower | | lpad | | ltrim | | map | | map_keys | | map_values | | mask | | mask_first_n | | mask_hash | | mask_last_n | | mask_show_first_n | | mask_show_last_n | | matchpath | | max | | md5 | | min | | minute | | mod | | month | | months_between | | murmur_hash | | named_struct | | negative | | next_day | | ngrams | | noop | | noopstreaming | | noopwithmap | | noopwithmapstreaming | | not | | ntile | | nullif | | nvl | | octet_length | | or | | parse_url | | parse_url_tuple | | percent_rank | | percentile | | percentile_approx | | pi | | pmod | | posexplode | | positive | | pow | | power | | printf | | quarter | | radians | | rand | | rank | | reflect | | reflect2 | | regexp | | regexp_extract | | regexp_replace | | regr_avgx | +------------------------------+ | tab_name | +------------------------------+ | regr_avgy | | regr_count | | regr_intercept | | regr_r2 | | regr_slope | | regr_sxx | | regr_sxy | | regr_syy | | repeat | | replace | | replicate_rows | | restrict_information_schema | | reverse | | rlike | | round | | row_number | | rpad | | rtrim | | second | | sentences | | sha | | sha1 | | sha2 | | shiftleft | | shiftright | | shiftrightunsigned | | sign | | sin | | size | | sort_array | | sort_array_by | | soundex | | space | | split | | sq_count_check | | sqrt | | stack | | std | | stddev | | stddev_pop | | stddev_samp | | str_to_map | | struct | | substr | | substring | | substring_index | | sum | | tan | | to_date | | to_epoch_milli | | to_unix_timestamp | | to_utc_timestamp | | translate | | trim | | trunc | | ucase | | udftoboolean | | udftobyte | | udftodouble | | udftofloat | | udftointeger | | udftolong | | udftoshort | | udftostring | | unbase64 | | unhex | | unix_timestamp | | upper | | uuid | | var_pop | | var_samp | | variance | | version | | weekofyear | | when | | width_bucket | | windowingtablefunction | | xpath | | xpath_boolean | | xpath_double | | xpath_float | | xpath_int | | xpath_long | | xpath_number | | xpath_short | | xpath_string | | year | | | | | ~ | +------------------------------+
最新回复(0)