领料单(领料单号,领料人,商品号,数量,日期) 进料单(进料单号,订单号,进料人,收料人,日期) 库存(商品号,库房号,库存量,日期) 库房(库房号,仓库管理员,地点,库存商品描述)
员工(员工号,姓名,性别,年龄,文化程度,部门号) 部门(部门号,部门名称,部门主管,电话)
在每个主题当中,包含了相关主题的所有信息,同时又抛弃了与分析处理无关的数据
Hive只是为数据仓库提供了一种解决方案, 并不能说Hive就是数据仓库
hive 启动元数据服务
nohup hive --service metastore & hive 启动 hiveserver2 服务 nohup hive --service hiveserver2 &全路径启动beeline
/opt/install/hive/bin/beeline -u jdbc:hive2://localhost:10000ETL 抽取
全量抽取customer表 sqoop import \ --connect jdbc:mysql://localhost:3306/sales_source \ --username root \ --password root \ --table customer \ --hive-import \ --hive-table sales_rds.customer \ --hive-overwrite \ --target-dir temp 全量导入product表 sqoop import \ --connect jdbc:mysql://localhost:3306/sales_source \ --username root \ --password root \ --table product \ --hive-import \ --hive-table sales_rds.product \ --hive-overwrite \ --target-dir temp 增量抽取 sales_order sqoop import \ --connect jdbc:mysql://localhost:3306/sales_source \ --username root \ --password root \ --table sales_order \ --hive-import \ --hive-table sales_rds.sales_order \ --fields-terminated-by '\t' \ --lines-terminated-by '\n' \ --check-column entry_date \ --incremental append \ --last-value '1900-1-1'指定Hive表之间的字段分隔符 –fields-terminated-by ‘\t’
指定Hive表之间的行分隔符 –lines-terminated-by ‘\n’
检查列, 通过检查某一列的值实现递增 –check-column entry_date
增量的方式, 可以选择append或者是lastmodified –incremental append
上一次检查列最大的值 –last-value ‘1900-1-1’
删除job sqoop job --delete myjob
创建job
sqoop job \ --create myjob \ -- import \ --connect jdbc:mysql://localhost:3306/sales_source \ --username root \ --password root \ --table sales_order \ --hive-import \ --hive-table sales_rds.sales_order \ --fields-terminated-by '\t' \ --lines-terminated-by '\n' \ --check-column entry_date \ --incremental append \ --last-value '1900-1-1' 查看job sqoop job --list执行job sqoop job --exec myjob在mysql中新增数据 USE sales_source; SET @customer_number := FLOOR(1+RAND()*6); SET @product_code := FLOOR(1+RAND()* 3); SET @order_date := '2018-11-23'; SET @amount := FLOOR(1000+RAND()*9000); INSERT INTO sales_order VALUES(100001,@customer_number,@product_code,@order_date,@order_date,@amount); SET @customer_number := FLOOR(1+RAND()*6); SET @product_code := FLOOR(1+RAND()* 3); SET @order_date := '2018-11-24'; SET @amount := FLOOR(1000+RAND()*9000); INSERT INTO sales_order VALUES(100002,@customer_number,@product_code,@order_date,@order_date,@amount); COMMIT ;执行job sqoop job --exec myjob
在hive中查看数据条数, 可以看到多了两条数据, 一共100002条数据 ±--------±-+ | _c0 | ±--------±-+ | 100002 | ±--------±-+
删除job sqoop job --delete myjob
防止在执行的时候手动输入密码
echo -n "root" > sqoopPWD.pwd hdfs dfs -mkdir -p /sqoop/pwd hdfs dfs -put sqoopPWD.pwd /sqoop/pwd/ hdfs dfs -chmod 400 /sqoop/pwd/sqoopPWD.pwd 删除job sqoop job --delete myjob创建job sqoop job \ --create myjob \ -- import \ --connect jdbc:mysql://localhost:3306/sales_source \ --username root \ --password-file /sqoop/pwd/sqoopPWD.pwd \ --table sales_order \ --hive-import \ --hive-table sales_rds.sales_order \ --fields-terminated-by '\t' \ --lines-terminated-by '\n' \ --check-column entry_date \ --incremental append \ --last-value '1900-1-1' 执行job sqoop job --exec myjob获得数据后上传至hdfs上中date表的目录下
hdfs dfs -put dim_date.csv /hive/warehouse/sales_dw.db/dim_date想求一下 2018年10月20这一天的指标 顾客,产品,日期,当天订单个数, 当天的订单金额, 近2天的订单个数, 近2天的订单金额
create database if not exists sales_dm; CREATE TABLE `sales_dm.dm_order`( `customer_sk` int, `customer_number` int, `customer_name` varchar(128), `customer_street_address` varchar(256), `customer_zip_code` int, `customer_city` varchar(32), `customer_state` varchar(32), `product_sk` int, `product_code` int, `product_name` varchar(128), `product_category` varchar(256), `date_sk` int, `date` date, `month` tinyint, `month_name` varchar(16), `quarter` tinyint, `year` int, `one_order_cnt` bigint, `tow_order_cnt` bigint, `one_order_amount` decimal(28,2), `tow_order_amount` decimal(28,2)) ; from ( select dc.customer_sk, dc.customer_number, dc.customer_name , dc.customer_street_address, dc.customer_zip_code , dc.customer_city , dc.customer_state , dp.product_sk, dp.product_code , dp.product_name, dp.product_category, dd.date_sk , dd.`date`, dd.month , dd.month_name, dd.quarter , dd.year , sum(case when datediff('2018-10-20',dd.date)=0 then 1 else 0 end) one_order_cnt, sum(case when datediff('2018-10-20',dd.date)<=1 then 1 else 0 end) tow_order_cnt, sum(case when datediff('2018-10-20',dd.date)=0 then fso.order_amount else 0 end) one_order_amount, sum(case when datediff('2018-10-20',dd.date)<=1 then fso.order_amount else 0 end) tow_order_amount from sales_dw.fact_sales_order fso join sales_dw.dim_customer dc on fso.customer_sk=dc.customer_sk join sales_dw.dim_product dp on fso.product_sk=dp.product_sk join sales_dw.dim_date dd on fso.order_date_sk=dd.date_sk where dd.date>='2018-10-19' and dd.date<='2018-10-20' group by dc.customer_sk, dc.customer_number, dc.customer_name , dc.customer_street_address, dc.customer_zip_code , dc.customer_city , dc.customer_state , dp.product_sk, dp.product_code , dp.product_name, dp.product_category, dd.date_sk , dd.`date`, dd.month , dd.month_name, dd.quarter , dd.year ) temp insert into table sales_dm.dm_order select * from temp; ;