NER系列前2篇文章中讲解了数据的清理转换及自动标注。 本文实现Bi-LSTM+CRF模型。
隐马尔可夫模型描述由一个隐藏的马尔科夫链随机生成不可观测的状态随机序列,再由各个状态生成一个观测而产生观测随机序列的过程。隐马尔可夫模型由初始状态分布,状态转移概率矩阵以及观测概率矩阵所确定。命名实体识别是一种序列标注问题,观测到的是字组成的序列(观测序列),观测不到的是每个字对应的标注(状态序列)。
HMM模型中存在两个假设,一是输出观察值之间独立,二是状态转移过程中当前状态只与前一状态有关。条件随机场通过引入自定义的特征函数,可表达观测之间的依赖,还可表示当前观测与前后多个状态之间的依赖。
LSTM(Long Short-Term Memory)是RNN(Recurrent Neural Network)的一种,非常适合用于对文本时序数据的建模。BiLSTM是双向LSTM,是由前向LSTM与后向LSTM组合而成。
LSTM 模型示意图。 如以字为单位进行处理, 下图中w0,w1…表示句子里面的每一个字,经过BiLSTM处理,输出每个字对应每个标签的概率,将最大概率值表示对应字符预测的标签。
BiLSTM模型其实已经可以实现实体标签识别,为什么还要加上CRF层?
因为BiLSTM只能够预测文本序列与标签的关系,而不能预测标签与标签之间的关系,标签之间的相互关系是CRF中的转移矩阵。例如, "B"表示单词的开始,“B"后面不能连续标注"B”(只有是一个字是开始的),上图中的I-person表示人名中的中间的一个字,其前面的字不可能是一个地名的字(I-Organization),由于没有状态转移的条件约束,LSTM模型很有可能输出一个误差的标注序列。因此,需加上CRF层,文本序列经过BiLSTM模型处理,输出结果传入CRF层,输出一个整体的时序预测结果。 Bi-LSTM+CRF模型示意图。
和深度学习神经网络不同的地方,代码里面实现了CRF层,是用tensorflow.contrib.crf实现的,代码里面构建了一个转移矩阵,start_logits在标签状态矩阵基础上多加了1个维度,将初始状态包括在里面。
model.py代码如下(示例):
# encoding = utf8 import numpy as np import tensorflow as tf from tensorflow.contrib.crf import crf_log_likelihood from tensorflow.contrib.crf import viterbi_decode from tensorflow.contrib.layers.python.layers import initializers from tensorflow.contrib import rnn from utils import result_to_json from data_utils import create_input, iobes_iob,iob_iobes def network(inputs,shapes,num_tags,lstm_dim=100,initializer = tf.truncated_normal_initializer()): ''' 接收一个批次样本的特征数据,计算网络的输出值 :param char: type of int, a tensor of shape 2-D [None,None] :param bound: a tensor of shape 2-D [None,None] with type of int :param flag: a tensor of shape 2-D [None,None] with type of int :param radical: a tensor of shape 2-D [None,None] with type of int :param pinyin: a tensor of shape 2-D [None,None] with type of int :return: ''' # -----------------------------------特征嵌入------------------------------------- #将所有特征的id转换成一个固定长度的向量 embedding=[] keys = list(shapes.keys()) for key in keys: with tf.variable_scope(key+'_embedding'): char_lookup = tf.get_variable( name=key+'_embedding', shape=shapes[key], initializer=initializer ) # 每一个char的id找到char_lookup对应饿行,即该字对应的向量 embedding.append(tf.nn.embedding_lookup(char_lookup, inputs[key]))#实现特征的嵌入 embed = tf.concat(embedding,axis=-1)#shape [None, None, char_dim+bound_dim+flag_dim+radical_dim+pinyin_dim] #拿到输入里面的字符数据,正数变成1,0变成0 sign = tf.sign(tf.abs(inputs[keys[0]])) #得到每个句子真实的长度 lengths = tf.reduce_sum(sign,reduction_indices = 1) #得到序列的长度 num_time = tf.shape(inputs[keys[0]])[1] # --------------------------------循环神经网络编码-------------------------------- with tf.variable_scope('BiLSTM_layer1'): lstm_cell = {} for name in ['forward1','backward1']: with tf.variable_scope(name): lstm_cell[name] = rnn.BasicLSTMCell( #有多少个神经元是init指定好传过来的 lstm_dim ) #双向的动态rnn,来回都是100,拼接起来是200 outputs1,finial_states1 = tf.nn.bidirectional_dynamic_rnn( lstm_cell['forward1'], lstm_cell['backward1'], embed, dtype = tf.float32, #告知实际的长度 sequence_length = lengths ) outputs1 = tf.concat(outputs1,axis = -1) #b,L,2*lstm_dim with tf.variable_scope('BiLSTM_layer2'): lstm_cell = {} for name in ['forward','backward']: with tf.variable_scope(name): lstm_cell[name] = rnn.BasicLSTMCell( #有多少个神经元是init指定好传过来的 lstm_dim ) #双向的动态rnn,来回都是100,拼接起来是200 outputs,finial_statesl = tf.nn.bidirectional_dynamic_rnn( lstm_cell['forward'], lstm_cell['backward'], outputs1, dtype = tf.float32, #告知实际的长度 sequence_length = lengths ) output = tf.concat(outputs,axis = -1) #b,L,2*lstm_dim # --------------------------------输出映射-------------------------------- #矩阵乘法只能是两维的 #reshape成二维矩阵 batch_size*maxlength,2*lstm_dim output = tf.reshape(output,[-1,2*lstm_dim]) with tf.variable_scope('project_layer1'): w = tf.get_variable( name = 'w', shape = [2*lstm_dim,lstm_dim], initializer = initializer ) b = tf.get_variable( name = 'b', shape = [lstm_dim], initializer = tf.zeros_initializer() ) output =tf.nn.relu(tf.matmul(output,w)+b) with tf.variable_scope('project_layer2'): w = tf.get_variable( name = 'w', shape = [lstm_dim,num_tags], initializer = initializer ) b = tf.get_variable( name = 'b', shape = [num_tags], initializer = tf.zeros_initializer() ) output =tf.matmul(output,w)+b output = tf.reshape(output,[-1,num_time,num_tags]) #batch_size,max_length,num_tags return output,lengths class Model(object): def __init__(self, dict,lr = 0.0001): # --------------------------------用到的参数值-------------------------------- #可以选择读字典计算长度,也可以直接给出一个数字 self.num_char = len(dict['word'][0]) self.num_bound = len(dict['bound'][0]) self.num_flag = len(dict['flag'][0]) self.num_radical = len(dict['radical'][0]) self.num_pinyin = len(dict['pinyin'][0]) self.num_tags = len(dict['label'][0]) #指定每一个字被映射为多少长度的向量 self.char_dim = 100 self.bound_dim = 20 self.flag_dim = 50 self.radical_dim = 50 self.pinyin_dim = 50 self.lstm_dim = 100 self.lr = lr self.map = dict # -----------------------定义接受数据的placeholder---------------------------- self.char_inputs = tf.placeholder(dtype = tf.int32, shape = [None,None], name = 'char_inputs') self.bound_inputs = tf.placeholder(dtype=tf.int32, shape=[None, None], name='bound_inputs') self.flag_inputs = tf.placeholder(dtype=tf.int32, shape=[None, None], name='flag_inputs') self.radical_inputs = tf.placeholder(dtype=tf.int32, shape=[None, None], name='radical_inputs') self.pinyin_inputs = tf.placeholder(dtype=tf.int32, shape=[None, None], name='pinyin_inputs') self.targets = tf.placeholder(dtype=tf.int32, shape=[None, None], name='targets') self.global_step = tf.Variable(0,trainable = False)#不需要训练,只是用来计算 self.batch_size = tf.shape(self.char_inputs)[0] self.num_steps = tf.shape(self.char_inputs)[-1] # ------------------------------计算模型输出值------------------------------- self.logits,self.lengths = self.get_logits(self.char_inputs, self.bound_inputs, self.flag_inputs, self.radical_inputs, self.pinyin_inputs ) # ------------------------------计算损失------------------------------- self.cost = self.loss(self.logits,self.targets,self.lengths) # ----------------------------优化器优化------------------------------- #采用梯度截断技术 with tf.variable_scope('optimizer'): opt = tf.train.AdamOptimizer(self.lr) grad_vars = opt.compute_gradients(self.cost)#计算出所有参数的导数 clip_grad_vars = [[tf.clip_by_value(g,-5,5),v] for g,v in grad_vars]#得到截断之后的梯度 self.train_op =opt.apply_gradients(clip_grad_vars,self.global_step)#使用截断后的梯度对参数进行更新 self.saver = tf.train.Saver(tf.global_variables(),max_to_keep = 5) def get_logits(self,char,bound,flag,radical,pinyin): ''' 接收一个批次样本的特征数据,计算网络的输出值 :param char: type of int, a tensor of shape 2-D [None,None] :param bound: a tensor of shape 2-D [None,None] with type of int :param flag: a tensor of shape 2-D [None,None] with type of int :param radical: a tensor of shape 2-D [None,None] with type of int :param pinyin: a tensor of shape 2-D [None,None] with type of int :return: 3-d tensor [batch_size,max_length,num_tags] ''' shapes = {} #有多少个元素*每个元素的维度 shapes['char']=[self.num_char,self.char_dim] shapes['bound']=[self.num_bound,self.bound_dim] shapes['flag']=[self.num_flag,self.flag_dim] shapes['radical']=[self.num_radical,self.radical_dim] shapes['pinyin']=[self.num_pinyin,self.pinyin_dim] inputs= {} inputs['char'] = char inputs['bound'] = bound inputs['flag'] = flag inputs['radical'] =radical inputs['pinyin'] = pinyin return network(inputs,shapes,num_tags=self.num_tags,lstm_dim=self.lstm_dim,initializer = tf.truncated_normal_initializer()) def loss(self, output, targets, lengths, initializer=None): ''' 该函数的主要功能:计算损失 :param output: :param targets: :param lengths: :param initializer: :return: ''' b = tf.shape(lengths)[0] num_steps = tf.shape(output)[1] with tf.variable_scope('crf_loss'): small = -1000.0 start_logits = tf.concat( [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1 ) pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32) logits = tf.concat([output,pad_logits],axis = -1) logits = tf.concat([start_logits,logits],axis = 1) targets = tf.concat( [tf.cast(self.num_tags * tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1 ) self.trans = tf.get_variable( name = 'trans', shape = [self.num_tags+1,self.num_tags+1], initializer = tf.truncated_normal_initializer() ) log_likehood,self.trans = tf.contrib.crf.crf_log_likelihood( inputs = logits, tag_indices = targets, transition_params = self.trans, sequence_lengths = lengths ) return tf.reduce_mean(-log_likehood) def run_step(self,sess,batch,istrain = True,istest=False): ''' 该函数的主要功能:判断是否为训练集,并且分批读入数据 :param sess: :param batch: :param istrain: :return: ''' if istrain: feed_dict = { self.char_inputs:batch[0], self.targets: batch[1], self.bound_inputs:batch[2], self.flag_inputs:batch[3], self.radical_inputs:batch[4], self.pinyin_inputs:batch[5] } _, loss = sess.run([self.train_op, self.cost],feed_dict = feed_dict) return loss elif istest: feed_dict = { self.char_inputs:batch[0], self.bound_inputs:batch[2], self.flag_inputs:batch[3], self.radical_inputs:batch[4], self.pinyin_inputs:batch[5], } logits,lengths = sess.run([self.logits,self.lengths],feed_dict = feed_dict) return logits,lengths else: feed_dict = { self.char_inputs: batch[0], self.bound_inputs: batch[1], self.flag_inputs: batch[2], self.radical_inputs: batch[3], self.pinyin_inputs: batch[4], } logits, lengths = sess.run([self.logits, self.lengths], feed_dict=feed_dict) return logits, lengths def decode(self, logits, lengths, matrix): ''' 该函数的主要功能:对测试集进行预测 :param logits: :param lengths: :param matrix: :return: 解码出的id ''' paths = [] small = -1000.0 start = np.asarray([[small] * self.num_tags + [0]]) for score, length in zip(logits, lengths): # 只取有效字符的输出 score = score[:length] pad = small * np.ones([length, 1]) logits = np.concatenate([score, pad], axis=-1) logits = np.concatenate([start, logits], axis=0) path, _ = viterbi_decode(logits,matrix) paths.append(path[1:]) return paths def result_to_json(self,string, tags): item = {"string": string, "entities": []} entity_name = "" entity_start = 0 idx = 0 for char, tag in zip(string, tags): if tag[0] == "S": item["entities"].append({"word": char, "start": idx, "end": idx+1, "type":tag[2:]}) elif tag[0] == "B": entity_name += char entity_start = idx elif tag[0] == "I": entity_name += char elif tag[0] == "E": entity_name += char item["entities"].append({"word": entity_name, "start": entity_start, "end": idx + 1, "type": tag[2:]}) entity_name = "" else: entity_name = "" entity_start = idx idx += 1 return item def predict(self,sess,batch,istrain=False,istest=True): ''' 该函数的主要功能:进行实际的预测,并且展示字和每个字的标记 :param sess: :param batch: :return: ''' results = [] items = [] matrix = self.trans.eval() logits,lengths = self.run_step(sess,batch,istrain,istest) paths = self.decode(logits,lengths,matrix) chars = batch[0] judge = 0 total_length = 0 if istest: for i in range(len(paths)): #第i句话对应的真实的长度 length = lengths[i] string = [self.map['word'][0][index] for index in chars[i][:length]] tags = [self.map['label'][0][index] for index in paths[i]] result = [k for k in zip(string,tags)] results.append(result) #计算准确率 labels = batch[1] # print('path[{}]:{}'.format(i,paths[i])) # print('label[{}]:{}'.format(i,labels[i])) judge += sum(np.array([paths[i][index]==labels[i][index] for index in range(length)]).astype(int)) total_length += length presicion = judge/total_length*100 return results,presicion else: for i in range(len(paths)): # 第i句话对应的真实的长度 length = lengths[i] string = [self.map['word'][0][index] for index in chars[i][:length]] tags = [self.map['label'][0][index] for index in paths[i]] result = [k for k in zip(string, tags)] results.append(result) print(result) items = self.result_to_json(string, tags) return results,items # encoding = utf8 import re import math import codecs import random import os import numpy as np import pandas as pd import jieba import pickle from tqdm import tqdm jieba.initialize() def get_data(name = 'train'): ''' 该函数的主要功能是:把所有的数据都放在一个文件里面一起获取,并且将数据进行不同形式的拼接,进行数据增强 :param name:所有数据所在的位置 :return: ''' with open(f'data/Prepare/dict.pkl','rb') as f: map_dict = pickle.load(f) def item2id(data,w2i): ''' 该函数的主要功能是:把字符转变成id :param data: 等待转化的数据 :param w2i: 转化的方法 :return: 如果是认识的值就返回对应的ID,如果不认识,就返回UNK的id ''' return [w2i[x] if x in w2i else w2i['UNK'] for x in data] results = [] root = os.path.join('data/prepare/',name) files = list(os.listdir(root)) fileindex=-1 file_index = [] for file in tqdm(files): #for file in files: result=[] path = os.path.join(root,file) try: #samples = pd.read_csv(path, sep=',', encoding='gbk') samples = pd.read_csv(path, sep=',' ) except UnicodeEncodeError: #samples = pd.read_csv(path, sep=',', encoding='UTF-8',errors='ignore') samples = pd.read_csv(path, sep=',' , errors='ignore') except Exception as e: print(e) num_samples = len(samples) fileindex += num_samples file_index.append(fileindex) # 存储好每个句子开始的下标 sep_index = [-1]+samples[samples['word']=='sep'].index.tolist()+[num_samples]#-1,20,40,50 # -----------------------------获取句子并且将句子全部转换成id---------------------------- for i in range(len(sep_index)-1): start = sep_index[i]+1 end = sep_index[i+1] data = [] for feature in samples.columns: #print(list(samples[feature])[start:end],map_dict[feature][1]) try: data.append(item2id(list(samples[feature])[start:end],map_dict[feature][1])) except: print(item2id(list(samples[feature])[start:end],map_dict[feature][1])) #print(data) result.append(data) #按照数据进行不同的拼接,不拼接、拼接1个、拼接2个...从而增强数据学习的能力 # ----------------------------------------数据增强------------------------------------- if name == 'task': results.extend(result) else: two=[] for i in range(len(result)-1): first = result[i] second = result[i+1] two.append([first[k]+second[k] for k in range(len(first))]) three = [] for i in range(len(result) - 2): first = result[i] second = result[i + 1] third = result[i + 2] three.append([first[k] + second[k]+third[k] for k in range(len(first))]) #应该用extend而不是append results.extend(result+two+three) with open(f'data/prepare/'+name+'.pkl','wb') as f: pickle.dump(results,f) def create_dico(item_list): """ Create a dictionary of items from a list of list of items. """ assert type(item_list) is list dico = {} for items in item_list: for item in items: if item not in dico: dico[item] = 1 else: dico[item] += 1 return dico def create_mapping(dico): """ Create a mapping (item to ID / ID to item) from a dictionary. Items are ordered by decreasing frequency. """ sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0])) id_to_item = {i: v[0] for i, v in enumerate(sorted_items)} item_to_id = {v: k for k, v in id_to_item.items()} return item_to_id, id_to_item def zero_digits(s): """ Replace every digit in a string by a zero. """ return re.sub('\d', '0', s) def iob2(tags): """ Check that tags have a valid IOB format. Tags in IOB1 format are converted to IOB2. """ for i, tag in enumerate(tags): if tag == 'O': continue split = tag.split('-') if len(split) != 2 or split[0] not in ['I', 'B']: return False if split[0] == 'B': continue elif i == 0 or tags[i - 1] == 'O': # conversion IOB1 to IOB2 tags[i] = 'B' + tag[1:] elif tags[i - 1][1:] == tag[1:]: continue else: # conversion IOB1 to IOB2 tags[i] = 'B' + tag[1:] return True def iob_iobes(tags): """ IOB -> IOBES """ new_tags = [] for i, tag in enumerate(tags): if tag == 'O': new_tags.append(tag) elif tag.split('-')[0] == 'B': if i + 1 != len(tags) and \ tags[i + 1].split('-')[0] == 'I': new_tags.append(tag) else: new_tags.append(tag.replace('B-', 'S-')) elif tag.split('-')[0] == 'I': if i + 1 < len(tags) and \ tags[i + 1].split('-')[0] == 'I': new_tags.append(tag) else: new_tags.append(tag.replace('I-', 'E-')) else: raise Exception('Invalid IOB format!') return new_tags def iobes_iob(tags): """ IOBES -> IOB """ new_tags = [] for i, tag in enumerate(tags): if tag.split('-')[0] == 'B': new_tags.append(tag) elif tag.split('-')[0] == 'I': new_tags.append(tag) elif tag.split('-')[0] == 'S': new_tags.append(tag.replace('S-', 'B-')) elif tag.split('-')[0] == 'E': new_tags.append(tag.replace('E-', 'I-')) elif tag.split('-')[0] == 'O': new_tags.append(tag) else: raise Exception('Invalid format!') return new_tags def insert_singletons(words, singletons, p=0.5): """ Replace singletons by the unknown word with a probability p. """ new_words = [] for word in words: if word in singletons and np.random.uniform() < p: new_words.append(0) else: new_words.append(word) return new_words def get_seg_features(string): """ Segment text with jieba features are represented in bies format s donates single word """ #def features(self,string): #def _w2f(word): #lenth=len(word) #if lenth==1: #r=[0] #if lenth>1: #r=[2]*lenth #r[0]=1 #r[-1]=3 #return r #return list(chain.from_iterable([_w2f(word) for word in jieba.cut(string) if len(word.strip())>0])) seg_feature = [] for word in jieba.cut(string): if len(word) == 1: seg_feature.append(0) else: tmp = [2] * len(word) tmp[0] = 1 tmp[-1] = 3 seg_feature.extend(tmp) return seg_feature #return [i for word in jieba.cut(string) for i in range(1,len(word)+1) ] def create_input(data): """ Take sentence data and return an input for the training or the evaluation function. """ inputs = list() inputs.append(data['chars']) inputs.append(data["segs"]) inputs.append(data['tags']) return inputs def load_word2vec(emb_path, id_to_word, word_dim, old_weights): """ Load word embedding from pre-trained file embedding size must match """ new_weights = old_weights print('Loading pretrained embeddings from {}...'.format(emb_path)) pre_trained = {} emb_invalid = 0 for i, line in enumerate(codecs.open(emb_path, 'r', 'utf-8')): line = line.rstrip().split() if len(line) == word_dim + 1: pre_trained[line[0]] = np.array( [float(x) for x in line[1:]] ).astype(np.float32) else: emb_invalid += 1 if emb_invalid > 0: print('WARNING: %i invalid lines' % emb_invalid) c_found = 0 c_lower = 0 c_zeros = 0 n_words = len(id_to_word) # Lookup table initialization for i in range(n_words): word = id_to_word[i] if word in pre_trained: new_weights[i] = pre_trained[word] c_found += 1 elif word.lower() in pre_trained: new_weights[i] = pre_trained[word.lower()] c_lower += 1 elif re.sub('\d', '0', word.lower()) in pre_trained: new_weights[i] = pre_trained[ re.sub('\d', '0', word.lower()) ] c_zeros += 1 print('Loaded %i pretrained embeddings.' % len(pre_trained)) print('%i / %i (%.4f%%) words have been initialized with ' 'pretrained embeddings.' % ( c_found + c_lower + c_zeros, n_words, 100. * (c_found + c_lower + c_zeros) / n_words) ) print('%i found directly, %i after lowercasing, ' '%i after lowercasing + zero.' % ( c_found, c_lower, c_zeros )) return new_weights def full_to_half(s): """ Convert full-width character to half-width one """ n = [] for char in s: num = ord(char) if num == 0x3000: num = 32 elif 0xFF01 <= num <= 0xFF5E: num -= 0xfee0 char = chr(num) n.append(char) return ''.join(n) def cut_to_sentence(text): """ Cut text to sentences """ sentence = [] sentences = [] len_p = len(text) pre_cut = False for idx, word in enumerate(text): sentence.append(word) cut = False if pre_cut: cut=True pre_cut=False if word in u"!?\n": cut = True if len_p > idx+1: if text[idx+1] in ".\"\'?!": cut = False pre_cut=True if cut: sentences.append(sentence) sentence = [] if sentence: sentences.append("".join(list(sentence))) return sentences def replace_html(s): s = s.replace('"','"') s = s.replace('&','&') s = s.replace('<','<') s = s.replace('>','>') s = s.replace(' ',' ') s = s.replace("“", "") s = s.replace("”", "") s = s.replace("—","") s = s.replace("\xa0", " ") return(s) def get_dict(path): with open(path,'rb') as f: dict = pickle.load(f) return dict def input_from_line(line, char_to_id): """ Take sentence data and return an input for the training or the evaluation function. """ line = full_to_half(line) line = replace_html(line) inputs = list() inputs.append([line]) line.replace(" ", "$") inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"] for char in line]]) inputs.append([get_seg_features(line)]) inputs.append([[]]) return inputs class BatchManager(object): ''' def __init__(self, data, batch_size): self.batch_data = self.sort_and_pad(data, batch_size) self.len_data = len(self.batch_data) ''' def __init__(self,batch_size,name='train'): with open(f'data/prepare/' + name + '.pkl', 'rb') as f: data = pickle.load(f) self.batch_data = self.sort_and_pad(data,batch_size,name) self.len_data = len(self.batch_data) def sort_and_pad(self, data, batch_size, name): # 总共有多少批次 num_batch = int(math.ceil(len(data) / batch_size)) # print(len(data[0][0])) # 按照句子长度进行排序 sorted_data = sorted(data, key=lambda x: len(x[0])) batch_data = list() for i in range(num_batch): batch_data.append(self.pad_data(sorted_data[i * int(batch_size):(i + 1) * int(batch_size)], name)) return batch_data @staticmethod def pad_data(data, name): if name != 'task': chars = [] targets = [] bounds = [] flags = [] radicals = [] pinyins = [] max_length = max([len(sentence[0]) for sentence in data]) # len(data[-1][0]) for line in data: char, target, bound, flag, radical, pinyin = line padding = [0] * (max_length - len(char)) chars.append(char + padding) targets.append(target + padding) bounds.append(bound + padding) flags.append(flag + padding) radicals.append(radical + padding) pinyins.append(pinyin + padding) return [chars, targets, bounds, flags, radicals, pinyins] else: chars = [] bounds = [] flags = [] radicals = [] pinyins = [] max_length = max([len(sentence[0]) for sentence in data]) # len(data[-1][0]) for line in data: char, bound, flag, radical, pinyin = line padding = [0] * (max_length - len(char)) chars.append(char + padding) bounds.append(bound + padding) flags.append(flag + padding) radicals.append(radical + padding) pinyins.append(pinyin + padding) return [chars, bounds, flags, radicals, pinyins] def iter_batch(self, shuffle=False): if shuffle: random.shuffle(self.batch_data) for idx in range(self.len_data): yield self.batch_data[idx] ''' def sort_and_pad(self, data, batch_size): num_batch = int(math.ceil(len(data) /batch_size)) sorted_data = sorted(data, key=lambda x: len(x[0])) batch_data = list() for i in range(num_batch): batch_data.append(self.pad_data(sorted_data[i*int(batch_size) : (i+1)*int(batch_size)])) return batch_data @staticmethod def pad_data(data): strings = [] chars = [] segs = [] targets = [] max_length = max([len(sentence[0]) for sentence in data]) #len(data[-1][0]) for line in data: string, char, seg, target = line padding = [0] * (max_length - len(string)) strings.append(string + padding) chars.append(char + padding) segs.append(seg + padding) targets.append(target + padding) return [strings, chars, segs, targets] def iter_batch(self, shuffle=False): if shuffle: random.shuffle(self.batch_data) for idx in range(self.len_data): yield self.batch_data[idx] ''' if __name__ == '__main__': get_data('train') get_data('test')简化版本:
# encoding = utf-8 import numpy as np import tensorflow as tf from tensorflow.contrib.crf import crf_log_likelihood from tensorflow.contrib.crf import viterbi_decode from tensorflow.contrib.layers.python.layers import initializers from utils import result_to_json from data_utils import create_input, iobes_iob,iob_iobes class Model(object): #初始化模型参数 def __init__(self, config): self.config = config self.lr = config["lr"] self.char_dim = config["char_dim"] self.lstm_dim = config["lstm_dim"] self.seg_dim = config["seg_dim"] self.num_tags = config["num_tags"] self.num_chars = config["num_chars"]#样本中总字数 self.num_segs = 4 self.global_step = tf.Variable(0, trainable=False) self.best_dev_f1 = tf.Variable(0.0, trainable=False) self.best_test_f1 = tf.Variable(0.0, trainable=False) self.initializer = initializers.xavier_initializer() # add placeholders for the model self.char_inputs = tf.placeholder(dtype=tf.int32, shape=[None, None], name="ChatInputs") self.seg_inputs = tf.placeholder(dtype=tf.int32, shape=[None, None], name="SegInputs") self.targets = tf.placeholder(dtype=tf.int32, shape=[None, None], name="Targets") # dropout keep prob self.dropout = tf.placeholder(dtype=tf.float32, name="Dropout") used = tf.sign(tf.abs(self.char_inputs)) length = tf.reduce_sum(used, reduction_indices=1) self.lengths = tf.cast(length, tf.int32) self.batch_size = tf.shape(self.char_inputs)[0] self.num_steps = tf.shape(self.char_inputs)[-1] #Add model type by crownpku bilstm or idcnn self.model_type = config['model_type'] #parameters for idcnn self.layers = [ { 'dilation': 1 }, { 'dilation': 1 }, { 'dilation': 2 }, ] self.filter_width = 3 self.num_filter = self.lstm_dim self.embedding_dim = self.char_dim + self.seg_dim self.repeat_times = 4 self.cnn_output_width = 0 # embeddings for chinese character and segmentation representation embedding = self.embedding_layer(self.char_inputs, self.seg_inputs, config) if self.model_type == 'bilstm': # apply dropout before feed to lstm layer model_inputs = tf.nn.dropout(embedding, self.dropout) # bi-directional lstm layer model_outputs = self.biLSTM_layer(model_inputs, self.lstm_dim, self.lengths) # logits for tags self.logits = self.project_layer_bilstm(model_outputs) elif self.model_type == 'idcnn': # apply dropout before feed to idcnn layer model_inputs = tf.nn.dropout(embedding, self.dropout) # ldcnn layer model_outputs = self.IDCNN_layer(model_inputs) # logits for tags self.logits = self.project_layer_idcnn(model_outputs) else: raise KeyError # loss of the model self.loss = self.loss_layer(self.logits, self.lengths) with tf.variable_scope("optimizer"): optimizer = self.config["optimizer"] if optimizer == "sgd": self.opt = tf.train.GradientDescentOptimizer(self.lr) elif optimizer == "adam": self.opt = tf.train.AdamOptimizer(self.lr) elif optimizer == "adgrad": self.opt = tf.train.AdagradOptimizer(self.lr) else: raise KeyError # apply grad clip to avoid gradient explosion grads_vars = self.opt.compute_gradients(self.loss) capped_grads_vars = [[tf.clip_by_value(g, -self.config["clip"], self.config["clip"]), v] for g, v in grads_vars] self.train_op = self.opt.apply_gradients(capped_grads_vars, self.global_step) # saver of the model self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=5) def embedding_layer(self, char_inputs, seg_inputs, config, name=None): """ :param char_inputs: one-hot encoding of sentence :param seg_inputs: segmentation feature :param config: wither use segmentation feature :return: [1, num_steps, embedding size], """ #高:3 血:22 糖:23 和:24 高:3 血:22 压:25 char_inputs=[3,22,23,24,3,22,25] #高血糖 和 高血压 seg_inputs 高血糖=[1,2,3] 和=[0] 高血压=[1,2,3] seg_inputs=[1,2,3,0,1,2,3] embedding = [] self.char_inputs_test=char_inputs self.seg_inputs_test=seg_inputs with tf.variable_scope("char_embedding" if not name else name), tf.device('/gpu:0'): self.char_lookup = tf.get_variable( name="char_embedding", shape=[self.num_chars, self.char_dim], initializer=self.initializer) #输入char_inputs='常' 对应的字典的索引/编号/value为:8 #self.char_lookup=[2677*100]的向量,char_inputs字对应在字典的索引/编号/key=[1] embedding.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs)) #self.embedding1.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs)) if config["seg_dim"]: with tf.variable_scope("seg_embedding"), tf.device('/gpu:0'): self.seg_lookup = tf.get_variable( name="seg_embedding", #shape=[4*20] shape=[self.num_segs, self.seg_dim], initializer=self.initializer) embedding.append(tf.nn.embedding_lookup(self.seg_lookup, seg_inputs)) embed = tf.concat(embedding, axis=-1) self.embed_test=embed self.embedding_test=embedding return embed #IDCNN layer def IDCNN_layer(self, model_inputs, name=None): """ :param idcnn_inputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, cnn_output_width] """ #tf.expand_dims会向tensor中插入一个维度,插入位置就是参数代表的位置(维度从0开始)。 model_inputs = tf.expand_dims(model_inputs, 1) self.model_inputs_test=model_inputs reuse = False if self.dropout == 1.0: reuse = True with tf.variable_scope("idcnn" if not name else name): #shape=[1*3*120*100] shape=[1, self.filter_width, self.embedding_dim, self.num_filter] print(shape) filter_weights = tf.get_variable( "idcnn_filter", shape=[1, self.filter_width, self.embedding_dim, self.num_filter], initializer=self.initializer) """ shape of input = [batch, in_height, in_width, in_channels] shape of filter = [filter_height, filter_width, in_channels, out_channels] """ layerInput = tf.nn.conv2d(model_inputs, filter_weights, strides=[1, 1, 1, 1], padding="SAME", name="init_layer",use_cudnn_on_gpu=True) self.layerInput_test=layerInput finalOutFromLayers = [] totalWidthForLastDim = 0 for j in range(self.repeat_times): for i in range(len(self.layers)): #1,1,2 dilation = self.layers[i]['dilation'] isLast = True if i == (len(self.layers) - 1) else False with tf.variable_scope("atrous-conv-layer-%d" % i, reuse=True if (reuse or j > 0) else False): #w 卷积核的高度,卷积核的宽度,图像通道数,卷积核个数 w = tf.get_variable( "filterW", shape=[1, self.filter_width, self.num_filter, self.num_filter], initializer=tf.contrib.layers.xavier_initializer()) if j==1 and i==1: self.w_test_1=w if j==2 and i==1: self.w_test_2=w b = tf.get_variable("filterB", shape=[self.num_filter]) #tf.nn.atrous_conv2d(value,filters,rate,padding,name=None) #除去name参数用以指定该操作的name,与方法有关的一共四个参数: #value: #指需要做卷积的输入图像,要求是一个4维Tensor,具有[batch, height, width, channels]这样的shape,具体含义是[训练时一个batch的图片数量, 图片高度, 图片宽度, 图像通道数] #filters: #相当于CNN中的卷积核,要求是一个4维Tensor,具有[filter_height, filter_width, channels, out_channels]这样的shape,具体含义是[卷积核的高度,卷积核的宽度,图像通道数,卷积核个数],同理这里第三维channels,就是参数value的第四维 #rate: #要求是一个int型的正数,正常的卷积操作应该会有stride(即卷积核的滑动步长),但是空洞卷积是没有stride参数的, #这一点尤其要注意。取而代之,它使用了新的rate参数,那么rate参数有什么用呢?它定义为我们在输入 #图像上卷积时的采样间隔,你可以理解为卷积核当中穿插了(rate-1)数量的“0”, #把原来的卷积核插出了很多“洞洞”,这样做卷积时就相当于对原图像的采样间隔变大了。 #具体怎么插得,可以看后面更加详细的描述。此时我们很容易得出rate=1时,就没有0插入, #此时这个函数就变成了普通卷积。 #padding: #string类型的量,只能是”SAME”,”VALID”其中之一,这个值决定了不同边缘填充方式。 #ok,完了,到这就没有参数了,或许有的小伙伴会问那“stride”参数呢。其实这个函数已经默认了stride=1,也就是滑动步长无法改变,固定为1。 #结果返回一个Tensor,填充方式为“VALID”时,返回[batch,height-2*(filter_width-1),width-2*(filter_height-1),out_channels]的Tensor,填充方式为“SAME”时,返回[batch, height, width, out_channels]的Tensor,这个结果怎么得出来的?先不急,我们通过一段程序形象的演示一下空洞卷积。 conv = tf.nn.atrous_conv2d(layerInput, w, rate=dilation, padding="SAME") self.conv_test=conv conv = tf.nn.bias_add(conv, b) conv = tf.nn.relu(conv) if isLast: finalOutFromLayers.append(conv) totalWidthForLastDim += self.num_filter layerInput = conv finalOut = tf.concat(axis=3, values=finalOutFromLayers) keepProb = 1.0 if reuse else 0.5 finalOut = tf.nn.dropout(finalOut, keepProb) #Removes dimensions of size 1 from the shape of a tensor. #从tensor中删除所有大小是1的维度 #Given a tensor input, this operation returns a tensor of the same type with all dimensions of size 1 removed. If you don’t want to remove all size 1 dimensions, you can remove specific size 1 dimensions by specifying squeeze_dims. #给定张量输入,此操作返回相同类型的张量,并删除所有尺寸为1的尺寸。 如果不想删除所有尺寸1尺寸,可以通过指定squeeze_dims来删除特定尺寸1尺寸。 finalOut = tf.squeeze(finalOut, [1]) finalOut = tf.reshape(finalOut, [-1, totalWidthForLastDim]) self.cnn_output_width = totalWidthForLastDim return finalOut def project_layer_bilstm(self, lstm_outputs, name=None): """ hidden layer between lstm layer and logits :param lstm_outputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, num_tags] """ with tf.variable_scope("project" if not name else name): with tf.variable_scope("hidden"): W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim], dtype=tf.float32, initializer=self.initializer) b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32, initializer=tf.zeros_initializer()) output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2]) hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b)) # project to score of tags with tf.variable_scope("logits"): W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags], dtype=tf.float32, initializer=self.initializer) b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32, initializer=tf.zeros_initializer()) pred = tf.nn.xw_plus_b(hidden, W, b) return tf.reshape(pred, [-1, self.num_steps, self.num_tags]) #Project layer for idcnn by crownpku #Delete the hidden layer, and change bias initializer def project_layer_idcnn(self, idcnn_outputs, name=None): """ :param lstm_outputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, num_tags] """ with tf.variable_scope("project" if not name else name): # project to score of tags with tf.variable_scope("logits"): W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags], dtype=tf.float32, initializer=self.initializer) b = tf.get_variable("b", initializer=tf.constant(0.001, shape=[self.num_tags])) pred = tf.nn.xw_plus_b(idcnn_outputs, W, b) return tf.reshape(pred, [-1, self.num_steps, self.num_tags]) def loss_layer(self, project_logits, lengths, name=None): """ calculate crf loss :param project_logits: [1, num_steps, num_tags] :return: scalar loss """ with tf.variable_scope("crf_loss" if not name else name): small = -1000.0 # pad logits for crf loss start_logits = tf.concat( [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1) pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32) logits = tf.concat([project_logits, pad_logits], axis=-1) logits = tf.concat([start_logits, logits], axis=1) targets = tf.concat( [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1) self.trans = tf.get_variable( "transitions", shape=[self.num_tags + 1, self.num_tags + 1], initializer=self.initializer) #crf_log_likelihood在一个条件随机场里面计算标签序列的log-likelihood #inputs: 一个形状为[batch_size, max_seq_len, num_tags] 的tensor, #一般使用BILSTM处理之后输出转换为他要求的形状作为CRF层的输入. #tag_indices: 一个形状为[batch_size, max_seq_len] 的矩阵,其实就是真实标签. #sequence_lengths: 一个形状为 [batch_size] 的向量,表示每个序列的长度. #transition_params: 形状为[num_tags, num_tags] 的转移矩阵 #log_likelihood: 标量,log-likelihood #transition_params: 形状为[num_tags, num_tags] 的转移矩阵 log_likelihood, self.trans = crf_log_likelihood( inputs=logits, tag_indices=targets, transition_params=self.trans, sequence_lengths=lengths+1) return tf.reduce_mean(-log_likelihood) def create_feed_dict(self, is_train, batch): """ :param is_train: Flag, True for train batch :param batch: list train/evaluate data :return: structured data to feed """ _, chars, segs, tags = batch feed_dict = { self.char_inputs: np.asarray(chars), self.seg_inputs: np.asarray(segs), self.dropout: 1.0, } if is_train: feed_dict[self.targets] = np.asarray(tags) feed_dict[self.dropout] = self.config["dropout_keep"] return feed_dict def run_step(self, sess, is_train, batch): """ :param sess: session to run the batch :param is_train: a flag indicate if it is a train batch :param batch: a dict containing batch data :return: batch result, loss of the batch or logits """ feed_dict = self.create_feed_dict(is_train, batch) if is_train: global_step, loss,_,char_lookup_out,seg_lookup_out,char_inputs_test,seg_inputs_test,embed_test,embedding_test,\ model_inputs_test,layerInput_test,conv_test,w_test_1,w_test_2,char_inputs_test= sess.run( [self.global_step, self.loss, self.train_op,self.char_lookup,self.seg_lookup,self.char_inputs_test,self.seg_inputs_test,\ self.embed_test,self.embedding_test,self.model_inputs_test,self.layerInput_test,self.conv_test,self.w_test_1,self.w_test_2,self.char_inputs], feed_dict) return global_step, loss else: lengths, logits = sess.run([self.lengths, self.logits], feed_dict) return lengths, logits def decode(self, logits, lengths, matrix): """ :param logits: [batch_size, num_steps, num_tags]float32, logits :param lengths: [batch_size]int32, real length of each sequence :param matrix: transaction matrix for inference :return: """ # inference final labels usa viterbi Algorithm paths = [] small = -1000.0 start = np.asarray([[small]*self.num_tags +[0]]) for score, length in zip(logits, lengths): score = score[:length] pad = small * np.ones([length, 1]) logits = np.concatenate([score, pad], axis=1) logits = np.concatenate([start, logits], axis=0) path, _ = viterbi_decode(logits, matrix) paths.append(path[1:]) return paths def evaluate(self, sess, data_manager, id_to_tag): """ :param sess: session to run the model :param data: list of data :param id_to_tag: index to tag name :return: evaluate result """ results = [] trans = self.trans.eval() for batch in data_manager.iter_batch(): strings = batch[0] tags = batch[-1] lengths, scores = self.run_step(sess, False, batch) batch_paths = self.decode(scores, lengths, trans) for i in range(len(strings)): result = [] string = strings[i][:lengths[i]] gold = iobes_iob([id_to_tag[int(x)] for x in tags[i][:lengths[i]]]) pred = iobes_iob([id_to_tag[int(x)] for x in batch_paths[i][:lengths[i]]]) #gold = iob_iobes([id_to_tag[int(x)] for x in tags[i][:lengths[i]]]) #pred = iob_iobes([id_to_tag[int(x)] for x in batch_paths[i][:lengths[i]]]) for char, gold, pred in zip(string, gold, pred): result.append(" ".join([char, gold, pred])) results.append(result) return results def evaluate_line(self, sess, inputs, id_to_tag): trans = self.trans.eval(session=sess) lengths, scores = self.run_step(sess, False, inputs) batch_paths = self.decode(scores, lengths, trans) tags = [id_to_tag[idx] for idx in batch_paths[0]] return result_to_json(inputs[0][0], tags) # encoding = utf8 import re import math import codecs import random import numpy as np import jieba jieba.initialize() def create_dico(item_list): """ Create a dictionary of items from a list of list of items. """ assert type(item_list) is list dico = {} for items in item_list: for item in items: if item not in dico: dico[item] = 1 else: dico[item] += 1 return dico def create_mapping(dico): """ Create a mapping (item to ID / ID to item) from a dictionary. Items are ordered by decreasing frequency. """ sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0])) id_to_item = {i: v[0] for i, v in enumerate(sorted_items)} item_to_id = {v: k for k, v in id_to_item.items()} return item_to_id, id_to_item def zero_digits(s): """ Replace every digit in a string by a zero. """ return re.sub('\d', '0', s) def iob2(tags): """ Check that tags have a valid IOB format. Tags in IOB1 format are converted to IOB2. """ for i, tag in enumerate(tags): if tag == 'O': continue split = tag.split('-') if len(split) != 2 or split[0] not in ['I', 'B']: return False if split[0] == 'B': continue elif i == 0 or tags[i - 1] == 'O': # conversion IOB1 to IOB2 tags[i] = 'B' + tag[1:] elif tags[i - 1][1:] == tag[1:]: continue else: # conversion IOB1 to IOB2 tags[i] = 'B' + tag[1:] return True def iob_iobes(tags): """ IOB -> IOBES """ new_tags = [] for i, tag in enumerate(tags): if tag == 'O': new_tags.append(tag) elif tag.split('-')[0] == 'B': if i + 1 != len(tags) and \ tags[i + 1].split('-')[0] == 'I': new_tags.append(tag) else: new_tags.append(tag.replace('B-', 'S-')) elif tag.split('-')[0] == 'I': if i + 1 < len(tags) and \ tags[i + 1].split('-')[0] == 'I': new_tags.append(tag) else: new_tags.append(tag.replace('I-', 'E-')) else: raise Exception('Invalid IOB format!') return new_tags def iobes_iob(tags): """ IOBES -> IOB """ new_tags = [] for i, tag in enumerate(tags): if tag.split('-')[0] == 'B': new_tags.append(tag) elif tag.split('-')[0] == 'I': new_tags.append(tag) elif tag.split('-')[0] == 'S': new_tags.append(tag.replace('S-', 'B-')) elif tag.split('-')[0] == 'E': new_tags.append(tag.replace('E-', 'I-')) elif tag.split('-')[0] == 'O': new_tags.append(tag) else: raise Exception('Invalid format!') return new_tags def insert_singletons(words, singletons, p=0.5): """ Replace singletons by the unknown word with a probability p. """ new_words = [] for word in words: if word in singletons and np.random.uniform() < p: new_words.append(0) else: new_words.append(word) return new_words def get_seg_features(string): """ Segment text with jieba features are represented in bies format s donates single word """ seg_feature = [] for word in jieba.cut(string): if len(word) == 1: seg_feature.append(0) else: tmp = [2] * len(word) tmp[0] = 1 tmp[-1] = 3 seg_feature.extend(tmp) return seg_feature def create_input(data): """ Take sentence data and return an input for the training or the evaluation function. """ inputs = list() inputs.append(data['chars']) inputs.append(data["segs"]) inputs.append(data['tags']) return inputs def load_word2vec(emb_path, id_to_word, word_dim, old_weights): """ Load word embedding from pre-trained file embedding size must match """ #把字典中所有的字转化为向量,假设字在字向量文件中,那就用字向量文件中的值初始化向量, new_weights = old_weights print('Loading pretrained embeddings from {}...'.format(emb_path)) pre_trained = {} emb_invalid = 0 for i, line in enumerate(codecs.open(emb_path, 'r', 'utf-8')): line = line.rstrip().split() if len(line) == word_dim + 1: pre_trained[line[0]] = np.array( [float(x) for x in line[1:]] ).astype(np.float32) else: emb_invalid += 1 if emb_invalid > 0: print('WARNING: %i invalid lines' % emb_invalid) c_found = 0 c_lower = 0 c_zeros = 0 n_words = len(id_to_word) # Lookup table initialization for i in range(n_words): word = id_to_word[i] if word in pre_trained: new_weights[i] = pre_trained[word] c_found += 1 elif word.lower() in pre_trained: new_weights[i] = pre_trained[word.lower()] c_lower += 1 elif re.sub('\d', '0', word.lower()) in pre_trained: new_weights[i] = pre_trained[ re.sub('\d', '0', word.lower()) ] c_zeros += 1 print('Loaded %i pretrained embeddings.' % len(pre_trained)) print('%i / %i (%.4f%%) words have been initialized with ' 'pretrained embeddings.' % ( c_found + c_lower + c_zeros, n_words, 100. * (c_found + c_lower + c_zeros) / n_words) ) print('%i found directly, %i after lowercasing, ' '%i after lowercasing + zero.' % ( c_found, c_lower, c_zeros )) return new_weights def full_to_half(s): """ Convert full-width character to half-width one """ n = [] for char in s: num = ord(char) if num == 0x3000: num = 32 elif 0xFF01 <= num <= 0xFF5E: num -= 0xfee0 char = chr(num) n.append(char) return ''.join(n) def cut_to_sentence(text): """ Cut text to sentences """ sentence = [] sentences = [] len_p = len(text) pre_cut = False for idx, word in enumerate(text): sentence.append(word) cut = False if pre_cut: cut=True pre_cut=False if word in u"!?\n": cut = True if len_p > idx+1: if text[idx+1] in ".\"\'?!": cut = False pre_cut=True if cut: sentences.append(sentence) sentence = [] if sentence: sentences.append("".join(list(sentence))) return sentences def replace_html(s): s = s.replace('"','"') s = s.replace('&','&') s = s.replace('<','<') s = s.replace('>','>') s = s.replace(' ',' ') s = s.replace("“", "") s = s.replace("”", "") s = s.replace("—","") s = s.replace("\xa0", " ") return(s) def input_from_line(line, char_to_id): """ Take sentence data and return an input for the training or the evaluation function. """ line = full_to_half(line) line = replace_html(line) inputs = list() inputs.append([line]) line.replace(" ", "$") inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"] for char in line]]) inputs.append([get_seg_features(line)]) inputs.append([[]]) return inputs class BatchManager(object): def __init__(self, data, batch_size): self.batch_data = self.sort_and_pad(data, batch_size) self.len_data = len(self.batch_data) def sort_and_pad(self, data, batch_size): num_batch = int(math.ceil(len(data) /batch_size)) sorted_data = sorted(data, key=lambda x: len(x[0])) batch_data = list() for i in range(num_batch): batch_data.append(self.pad_data(sorted_data[i*int(batch_size) : (i+1)*int(batch_size)])) return batch_data @staticmethod def pad_data(data): strings = [] chars = [] segs = [] targets = [] max_length = max([len(sentence[0]) for sentence in data]) for line in data: string, char, seg, target = line padding = [0] * (max_length - len(string)) strings.append(string + padding) chars.append(char + padding) segs.append(seg + padding) targets.append(target + padding) return [strings, chars, segs, targets] def iter_batch(self, shuffle=False): if shuffle: random.shuffle(self.batch_data) for idx in range(self.len_data): yield self.batch_data[idx] import os import json import shutil import logging import tensorflow as tf from conlleval import return_report models_path = "./models" eval_path = "./evaluation" eval_temp = os.path.join(eval_path, "temp") eval_script = os.path.join(eval_path, "conlleval") def get_logger(log_file): logger = logging.getLogger(log_file) logger.setLevel(logging.DEBUG) fh = logging.FileHandler(log_file) fh.setLevel(logging.DEBUG) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s") ch.setFormatter(formatter) fh.setFormatter(formatter) logger.addHandler(ch) logger.addHandler(fh) return logger # def test_ner(results, path): # """ # Run perl script to evaluate model # """ # script_file = "conlleval" # output_file = os.path.join(path, "ner_predict.utf8") # result_file = os.path.join(path, "ner_result.utf8") # with open(output_file, "w") as f: # to_write = [] # for block in results: # for line in block: # to_write.append(line + "\n") # to_write.append("\n") # # f.writelines(to_write) # os.system("perl {} < {} > {}".format(script_file, output_file, result_file)) # eval_lines = [] # with open(result_file) as f: # for line in f: # eval_lines.append(line.strip()) # return eval_lines def test_ner(results, path): """ Run perl script to evaluate model """ output_file = os.path.join(path, "ner_predict.utf8") with open(output_file, "w",encoding='utf8') as f: to_write = [] for block in results: for line in block: to_write.append(line + "\n") to_write.append("\n") f.writelines(to_write) eval_lines = return_report(output_file) return eval_lines def print_config(config, logger): """ Print configuration of the model """ for k, v in config.items(): logger.info("{}:\t{}".format(k.ljust(15), v)) def make_path(params): """ Make folders for training and evaluation """ if not os.path.isdir(params.result_path): os.makedirs(params.result_path) if not os.path.isdir(params.ckpt_path): os.makedirs(params.ckpt_path) if not os.path.isdir("log"): os.makedirs("log") def clean(params): """ Clean current folder remove saved model and training log """ if os.path.isfile(params.vocab_file): os.remove(params.vocab_file) if os.path.isfile(params.map_file): os.remove(params.map_file) if os.path.isdir(params.ckpt_path): shutil.rmtree(params.ckpt_path) if os.path.isdir(params.summary_path): shutil.rmtree(params.summary_path) if os.path.isdir(params.result_path): shutil.rmtree(params.result_path) if os.path.isdir("log"): shutil.rmtree("log") if os.path.isdir("__pycache__"): shutil.rmtree("__pycache__") if os.path.isfile(params.config_file): os.remove(params.config_file) if os.path.isfile(params.vocab_file): os.remove(params.vocab_file) def save_config(config, config_file): """ Save configuration of the model parameters are stored in json format """ with open(config_file, "w", encoding="utf8") as f: json.dump(config, f, ensure_ascii=False, indent=4) def load_config(config_file): """ Load configuration of the model parameters are stored in json format """ with open(config_file, encoding="utf8") as f: return json.load(f) def convert_to_text(line): """ Convert conll data to text """ to_print = [] for item in line: try: if item[0] == " ": to_print.append(" ") continue word, gold, tag = item.split(" ") if tag[0] in "SB": to_print.append("[") to_print.append(word) if tag[0] in "SE": to_print.append("@" + tag.split("-")[-1]) to_print.append("]") except: print(list(item)) return "".join(to_print) def save_model(sess, model, path, logger): checkpoint_path = os.path.join(path, "ner.ckpt") model.saver.save(sess, checkpoint_path) logger.info("model saved") def create_model(session, Model_class, path, load_vec, config, id_to_char, logger): # create model, reuse parameters if exists model = Model_class(config) ckpt = tf.train.get_checkpoint_state(path) if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path): logger.info("Reading model parameters from %s" % ckpt.model_checkpoint_path) model.saver.restore(session, ckpt.model_checkpoint_path) else: logger.info("Created model with fresh parameters.") session.run(tf.global_variables_initializer()) if config["pre_emb"]: emb_weights = session.run(model.char_lookup.read_value()) emb_weights = load_vec(config["emb_file"],id_to_char, config["char_dim"], emb_weights) session.run(model.char_lookup.assign(emb_weights)) logger.info("Load pre-trained embedding.") return model def result_to_json(string, tags): item = {"string": string, "entities": []} entity_name = "" entity_start = 0 idx = 0 for char, tag in zip(string, tags): if tag[0] == "S": item["entities"].append({"word": char, "start": idx, "end": idx+1, "type":tag[2:]}) elif tag[0] == "B": entity_name += char entity_start = idx elif tag[0] == "I": entity_name += char elif tag[0] == "E": entity_name += char item["entities"].append({"word": entity_name, "start": entity_start, "end": idx + 1, "type": tag[2:]}) entity_name = "" else: entity_name = "" entity_start = idx idx += 1 return itemmain2.py
# encoding=utf8 import codecs import pickle import itertools from collections import OrderedDict import os import tensorflow as tf import numpy as np from model import Model from loader import load_sentences, update_tag_scheme from loader import char_mapping, tag_mapping from loader import augment_with_pretrained, prepare_dataset from utils import get_logger, make_path, clean, create_model, save_model from utils import print_config, save_config, load_config, test_ner from data_utils import load_word2vec, create_input, input_from_line, BatchManager root_path=os.getcwd()+os.sep flags = tf.app.flags flags.DEFINE_boolean("clean", True, "clean train folder") flags.DEFINE_boolean("train", False, "Whether train the model") # configurations for the model flags.DEFINE_integer("seg_dim", 20, "Embedding size for segmentation, 0 if not used") flags.DEFINE_integer("char_dim", 100, "Embedding size for characters") flags.DEFINE_integer("lstm_dim", 100, "Num of hidden units in LSTM, or num of filters in IDCNN") flags.DEFINE_string("tag_schema", "iobes", "tagging schema iobes or iob") # configurations for training flags.DEFINE_float("clip", 5, "Gradient clip") flags.DEFINE_float("dropout", 0.5, "Dropout rate") flags.DEFINE_float("batch_size", 60, "batch size") flags.DEFINE_float("lr", 0.001, "Initial learning rate") flags.DEFINE_string("optimizer", "adam", "Optimizer for training") flags.DEFINE_boolean("pre_emb", True, "Wither use pre-trained embedding") flags.DEFINE_boolean("zeros", True, "Wither replace digits with zero") flags.DEFINE_boolean("lower", False, "Wither lower case") flags.DEFINE_integer("max_epoch", 100, "maximum training epochs") flags.DEFINE_integer("steps_check", 100, "steps per checkpoint") flags.DEFINE_string("ckpt_path", "ckpt", "Path to save model") flags.DEFINE_string("summary_path", "summary", "Path to store summaries") flags.DEFINE_string("log_file", "train.log", "File for log") flags.DEFINE_string("map_file", "maps.pkl", "file for maps") flags.DEFINE_string("vocab_file", "vocab.json", "File for vocab") flags.DEFINE_string("config_file", "config_file", "File for config") flags.DEFINE_string("script", "conlleval", "evaluation script") flags.DEFINE_string("result_path", "result", "Path for results") flags.DEFINE_string("emb_file", os.path.join(root_path+"data", "vec.txt"), "Path for pre_trained embedding") flags.DEFINE_string("train_file", os.path.join(root_path+"data", "example.train"), "Path for train data") flags.DEFINE_string("dev_file", os.path.join(root_path+"data", "example.dev"), "Path for dev data") flags.DEFINE_string("test_file", os.path.join(root_path+"data", "example.test"), "Path for test data") flags.DEFINE_string("model_type", "idcnn", "Model type, can be idcnn or bilstm") #flags.DEFINE_string("model_type", "bilstm", "Model type, can be idcnn or bilstm") FLAGS = tf.app.flags.FLAGS assert FLAGS.clip < 5.1, "gradient clip should't be too much" assert 0 <= FLAGS.dropout < 1, "dropout rate between 0 and 1" assert FLAGS.lr > 0, "learning rate must larger than zero" assert FLAGS.optimizer in ["adam", "sgd", "adagrad"] # config for the model def config_model(char_to_id, tag_to_id): config = OrderedDict() config["model_type"] = FLAGS.model_type config["num_chars"] = len(char_to_id) config["char_dim"] = FLAGS.char_dim config["num_tags"] = len(tag_to_id) config["seg_dim"] = FLAGS.seg_dim config["lstm_dim"] = FLAGS.lstm_dim config["batch_size"] = FLAGS.batch_size config["emb_file"] = FLAGS.emb_file config["clip"] = FLAGS.clip config["dropout_keep"] = 1.0 - FLAGS.dropout config["optimizer"] = FLAGS.optimizer config["lr"] = FLAGS.lr config["tag_schema"] = FLAGS.tag_schema config["pre_emb"] = FLAGS.pre_emb config["zeros"] = FLAGS.zeros config["lower"] = FLAGS.lower return config def evaluate(sess, model, name, data, id_to_tag, logger): logger.info("evaluate:{}".format(name)) ner_results = model.evaluate(sess, data, id_to_tag) eval_lines = test_ner(ner_results, FLAGS.result_path) for line in eval_lines: logger.info(line) f1 = float(eval_lines[1].strip().split()[-1]) if name == "dev": best_test_f1 = model.best_dev_f1.eval() if f1 > best_test_f1: tf.assign(model.best_dev_f1, f1).eval() logger.info("new best dev f1 score:{:>.3f}".format(f1)) return f1 > best_test_f1 elif name == "test": best_test_f1 = model.best_test_f1.eval() if f1 > best_test_f1: tf.assign(model.best_test_f1, f1).eval() logger.info("new best test f1 score:{:>.3f}".format(f1)) return f1 > best_test_f1 def train(): # load data sets train_sentences = load_sentences(FLAGS.train_file, FLAGS.lower, FLAGS.zeros) dev_sentences = load_sentences(FLAGS.dev_file, FLAGS.lower, FLAGS.zeros) test_sentences = load_sentences(FLAGS.test_file, FLAGS.lower, FLAGS.zeros) # Use selected tagging scheme (IOB / IOBES) update_tag_scheme(train_sentences, FLAGS.tag_schema) update_tag_scheme(test_sentences, FLAGS.tag_schema) update_tag_scheme(dev_sentences, FLAGS.tag_schema) # create maps if not exist if not os.path.isfile(FLAGS.map_file): # create dictionary for word if FLAGS.pre_emb: dico_chars_train = char_mapping(train_sentences, FLAGS.lower)[0] dico_chars, char_to_id, id_to_char = augment_with_pretrained( dico_chars_train.copy(), FLAGS.emb_file, list(itertools.chain.from_iterable( [[w[0] for w in s] for s in test_sentences]) ) ) else: _c, char_to_id, id_to_char = char_mapping(train_sentences, FLAGS.lower) # Create a dictionary and a mapping for tags _t, tag_to_id, id_to_tag = tag_mapping(train_sentences) #with open('maps.txt','w',encoding='utf8') as f1: #f1.writelines(str(char_to_id)+" "+id_to_char+" "+str(tag_to_id)+" "+id_to_tag+'\n') with open(FLAGS.map_file, "wb") as f: pickle.dump([char_to_id, id_to_char, tag_to_id, id_to_tag], f) else: with open(FLAGS.map_file, "rb") as f: char_to_id, id_to_char, tag_to_id, id_to_tag = pickle.load(f) # prepare data, get a collection of list containing index train_data = prepare_dataset( train_sentences, char_to_id, tag_to_id, FLAGS.lower ) dev_data = prepare_dataset( dev_sentences, char_to_id, tag_to_id, FLAGS.lower ) test_data = prepare_dataset( test_sentences, char_to_id, tag_to_id, FLAGS.lower ) print("%i / %i / %i sentences in train / dev / test." % ( len(train_data), 0, len(test_data))) train_manager = BatchManager(train_data, FLAGS.batch_size) dev_manager = BatchManager(dev_data, 100) test_manager = BatchManager(test_data, 100) # make path for store log and model if not exist make_path(FLAGS) if os.path.isfile(FLAGS.config_file): config = load_config(FLAGS.config_file) else: config = config_model(char_to_id, tag_to_id) save_config(config, FLAGS.config_file) make_path(FLAGS) log_path = os.path.join("log", FLAGS.log_file) logger = get_logger(log_path) print_config(config, logger) # limit GPU memory #tf_config = tf.ConfigProto() tf_config = tf.ConfigProto(allow_soft_placement = True) tf_config.gpu_options.allow_growth = True steps_per_epoch = train_manager.len_data with tf.Session(config=tf_config) as sess: model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger) logger.info("start training") loss = [] with tf.device("/gpu:0"): for i in range(100): for batch in train_manager.iter_batch(shuffle=True): step, batch_loss = model.run_step(sess, True, batch) loss.append(batch_loss) if step % FLAGS.steps_check == 0: iteration = step // steps_per_epoch + 1 logger.info("iteration:{} step:{}/{}, " "NER loss:{:>9.6f}".format( iteration, step%steps_per_epoch, steps_per_epoch, np.mean(loss))) loss = [] # best = evaluate(sess, model, "dev", dev_manager, id_to_tag, logger) if i%7==0: save_model(sess, model, FLAGS.ckpt_path, logger) #evaluate(sess, model, "test", test_manager, id_to_tag, logger) def evaluate_line(): config = load_config(FLAGS.config_file) logger = get_logger(FLAGS.log_file) # limit GPU memory #tf_config = tf.ConfigProto() tf_config = tf.ConfigProto(allow_soft_placement=True) tf_config.gpu_options.allow_growth = True with open(FLAGS.map_file, "rb") as f: char_to_id, id_to_char, tag_to_id, id_to_tag = pickle.load(f) with tf.Session(config=tf_config) as sess: model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger) while True: # try: # line = input("请输入测试句子:") # result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag) # print(result) # except Exception as e: # logger.info(e) line = input("请输入测试句子:") result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag) print(result) def main(_): #if 1: if 0: if FLAGS.clean: clean(FLAGS) train() else: evaluate_line() if __name__ == "__main__": tf.app.run(main)运行main2.py,结果如下:
....... optimizer/Adam/update_char_embedding/seg_embedding/seg_embedding/AssignSub (AssignSub) /device:GPU:0 optimizer/Adam/update_char_embedding/seg_embedding/seg_embedding/group_deps (NoOp) /device:GPU:0 save/Assign_4 (Assign) /device:GPU:0 save/Assign_17 (Assign) /device:GPU:0 save/Assign_18 (Assign) /device:GPU:0 请输入测试句子:现患者一般情况可,双肺呼吸音清晰,未闻及啰音,律齐,各瓣膜听诊区未闻及病理性杂音,腹平坦,软,全腹无压痛、反跳痛及肌紧张,全腹未触及异常包块。右腕及右膝部压痛,表面轻度红肿,活动稍受限。 {'string': '现患者一般情况可,双肺呼吸音清晰,未闻及啰音,律齐,各瓣膜听诊区未闻及病理性杂音,腹平坦,软,全腹无压痛、反跳痛及肌紧张,全腹未触及异常包块。右腕及右膝部压痛,表面轻度红肿,活动稍受限。', 'entities': [{'word': '情况', 'start': 5, 'end': 7, 'type': 'DRU'}, {'word': '双肺呼吸音', 'start': 9, 'end': 14, 'type': 'SYM'}, {'word': '啰音', 'start': 20, 'end': 22, 'type': 'SGN'}, {'word': '瓣膜听诊', 'start': 27, 'end': 31, 'type': 'TES'}, {'word': '病理性杂音', 'start': 35, 'end': 40, 'type': 'SGN'}, {'word': '平坦', 'start': 42, 'end': 44, 'type': 'DRU'}, {'word': '全腹', 'start': 47, 'end': 49, 'type': 'REG'}, {'word': '压痛', 'start': 50, 'end': 52, 'type': 'SGN'}, {'word': '反跳痛', 'start': 53, 'end': 56, 'type': 'SGN'}, {'word': '肌紧张', 'start': 57, 'end': 60, 'type': 'SGN'}, {'word': '全腹', 'start': 61, 'end': 63, 'type': 'REG'}, {'word': '异常包块', 'start': 66, 'end': 70, 'type': 'SGN'}, {'word': '膝部', 'start': 75, 'end': 77, 'type': 'REG'}, {'word': '压痛', 'start': 77, 'end': 79, 'type': 'SGN'}, {'word': '表面', 'start': 80, 'end': 82, 'type': 'ORG'}, {'word': '轻度', 'start': 82, 'end': 84, 'type': 'DEG'}, {'word': '红肿', 'start': 84, 'end': 86, 'type': 'SYM'}, {'word': '活动稍受限', 'start': 87, 'end': 92, 'type': 'SYM'}]} 请输入测试句子: