1、筛选出黑样本
create table account_black select * from foreigncard_account_black where dt between 20191201 and 20200115; drop table if exists event_black; create table event_black as select * from foreigncard_account_all where userid in (select userid from foreingncard_account_black ) and dt between '20191201' and '20200115'1、创建表时先drop table if exists 表名; 2、稍微要注意的地方:没有用 eventid 作为筛选黑白的条件,而user_id, 原因是 一个user_id 会对应多个event_id, 这里面会存在灰的情况。比如说,假设一个用户涉案,涉案的交易事件我们自然会认为是黑,但是用户下面非涉案事件的事件,这一部分我们不能把他们当做白的。 3、通常我们都会面临这样的样本不均匀的情况,存在少数类和多数类(这个case里面黑的record是少数类,而正常的交易即白的事件是多数类),我们这边采样的话是用了下采样,也就是我们是以数据量少的一方的样本为准,对多数类进行采样;上采样的话则是以数据量多的一方的样本为准(比如重复采黑样本),对少数类进行采样
2、筛选出白样本
drop table if exists event_white; create table event-white as select * from (foreign_account_all where userid not in (select userid from foreigncard_account_black) and dt between '20191201' and '20200115') a3、计算黑白样本的行数(假定我们已经知道要对白样本进行采样,1/27 采样以保证 黑样本浓度为10%左右),并对白样本进行采样
create table account_w select * from account_white sample(0.037); drop table if exists event_white; create table event-white as select * from (foreign_account_all where userid not in (select userid from foreigncard_account_black) and dt between '20191201' and '20200115') a where sample(27,1,eventid) = truesample 函数 表示数值会对每行数据进行随机哈希分配,sample(4,1,eventid) = true, 随机分配为4份,取第1份,这里的eventid是对eventid这个字段抽样;
4、4. 黑白事件打标并得到驱动表 (驱动表里的字段包括交易信息表的所有字段 + 黑白样本的标注字段 )
select *, 0 as sign from account_black union select *, 1 as sign from account_w; ```sql drop table if exists customer_drivertable; create table customer_drivertable as select *, 1 as target from event_black union all select * ,0 as target from event_white;