文本对齐方式数据的输入方式有哪些

数据库的作业题目及答案_百度文库
两大类热门资源免费畅读
续费一年阅读会员,立省24元!
数据库的作业题目及答案
阅读已结束,下载本文需要
想免费下载本文?
定制HR最喜欢的简历
下载文档到电脑,同时保存到云知识,更方便管理
加入VIP
还剩16页未读,
定制HR最喜欢的简历
你可能喜欢博主最新文章
博主热门文章
您举报文章:
举报原因:
原文地址:
原因补充:
(最多只允许输入30个字)输入文本型数据 教学设计_百度文库
两大类热门资源免费畅读
续费一年阅读会员,立省24元!
输入文本型数据 教学设计
&&电子工业出版社,WORD,文本型数据
阅读已结束,下载本文需要
想免费下载本文?
定制HR最喜欢的简历
你可能喜欢这里做了一些小的修改,感谢谷歌rd的帮助,使得能够统一处理dense的数据,或者类似文本分类这样sparse的输入数据。后续会做进一步学习优化,比如如何多线程处理。
&具体如何处理sparse 主要是使用embedding_lookup_sparse,参考
https://github.com/tensorflow/tensorflow/issues/342
binary_classification.py
代码和数据已经上传到&https://github.com/chenghuige/tensorflow-example , 关于sparse处理可以先参考&sparse_tensor.py
python ./binary_classification.py --tr corpus/feature.trate.0_2.normed.txt --te corpus/feature.trate.1_2.normed.txt --batch_size 200 --method mlp --num_epochs 1000
... loading dataset: corpus/feature.trate.0_2.normed.txt
finish loading train set corpus/feature.trate.0_2.normed.txt
... loading dataset: corpus/feature.trate.1_2.normed.txt
finish loading test set corpus/feature.trate.1_2.normed.txt
num_features: 4762348
trainSet size: 70968
testSet size: 17742
batch_size: 200 learning_rate: 0.001 num_epochs: 1000
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24
0 auc: 0. cost: 0.
1 auc: 0. cost: 0.
2 auc: 0. cost: 0.
3 auc: 0. cost: 0.
4 auc: 0. cost: 0.
5 auc: 0. cost: 0.
6 auc: 0. cost: 0.
7 auc: 0. cost: 0.
8 auc: 0. cost: 0.
9 auc: 0. cost: 0.
10 auc: 0. cost: 0.
11 auc: 0. cost: 0.
12 auc: 0. cost: 0.
13 auc: 0. cost: 0.
14 auc: 0. cost: 0.
15 auc: 0. cost: 0.
16 auc: 0. cost: 0.
17 auc: 0. cost: 0.
18 auc: 0. cost: 0.
19 auc: 0. cost: 0.
20 auc: 0. cost: 0.
21 auc: 0. cost: 0.
22 auc: 0. cost: 0.
23 auc: 0. cost: 0.
24 auc: 0. cost: 0.
从实验结果来看 简单的mlp 可以轻松超越linearSVM
mlt feature.trate.0_2.normed.txt -c tt -test feature.trate.1_2.normed.txt --iter 1000000
I:36.02 Melt.h:59] _cmd.randSeed --- []
I:36.02 Melt.h:1209] omp_get_num_procs() --- [24]
I:36.02 Melt.h:1221] get_num_threads() --- [22]
I:36.02 Melt.h:1224] commandStr --- [tt]
I:36.02 time_util.h:102] TrainTest! started
I:36.02 time_util.h:102] ParseInputDataFile started
I:36.02 time_util.h:113] ParseInputDataFile finished using: [298.557 ms] (0.298551 s)
I:36.02 TrainerFactory.cpp:99] Creating LinearSVM trainer
I:36.02 time_util.h:102] Train started
MinMaxNormalizer prepare [ 70968 ] (0.193283 s)100% |******************************************|
I:37.02 time_util.h:102] Normalize started
I:37.02 time_util.h:113] Normalize finished using: [31.945 ms] (0.031939 s)
LinearSVM training [ 1000000 ] (1.14643 s)100% |******************************************|
Sigmoid/PlattCalibrator calibrating [ 70968 ] (0.139669 s)100% |******************************************|
I:38.02 Trainer.h:65] Param: [numIterations:1000000 learningRate:0.001 trainerTyper:peagsos loopType:stochastic sampleSize:1 performProjection:0 ]
I:38.02 time_util.h:113] Train finished using: [1671.9 ms] (1.6719 s)
I:38.02 time_util.h:102] ParseInputDataFile started
I:38.02 time_util.h:113] ParseInputDataFile finished using: [73.094 ms] (0.073092 s)
I:38.02 Melt.h:603] Test feature.trate.1_2.normed.txt and writting instance predict file to ./result/0.inst.txt
TEST POSITIVE RATIO:&&&&&&&&0./())
Confusion table:
||===============================||
|| PREDICTED ||
TRUTH || positive | negative || RECALL
||===============================||
positive|| 3195 | 1908 || 0./5103)
negative|| 2137 | 10502 || 0./12639)
||===============================||
PRECISION 0./3()
LOG-LOSS/instance:&&&&&&&&&&&&&&&&0.4843
LOG-LOSS-PROB/instance:&&&&&&&&&&&&&&&&0.6256
TEST-SET ENTROPY (prior LL/in):&&&&&&&&0.6000
LOG-LOSS REDUCTION (RIG):&&&&&&&&-4.2637%
OVERALL 0/1 ACCURACY:&&&&&&&&0./17742)
POS.PRECISION:&&&&&&&&&&&&&&&&0.5992
POS.RECALL:&&&&&&&&&&&&&&&&0.6261
NEG.PRECISION:&&&&&&&&&&&&&&&&0.8463
NEG.RECALL:&&&&&&&&&&&&&&&&0.8309
F1.SCORE:&&&&&&&& &&&&&&&&0.6124
OuputAUC: 0.7984
AUC: [0.7984]
----------------------------------------------------------------------------------------
I:38.02 time_util.h:113] TrainTest! finished using: [2242.72 ms] (2.24272 s)
#---------------------melt.py
#!/usr/bin/env python
#coding=gbk
# ==============================================================================
# \file melt.py
# \author chenghuige
13:40:19.506009
# \Description
# ==============================================================================
import numpy as np
#---------------------------melt load data
#Now support melt dense and sparse input file format, for sparse input no
#for dense input will ignore header
#also support libsvm format @TODO
def guess_file_format(line):
is_dense = True
has_header = False
if line.startswith('#'):
has_header = True
return is_dense, has_header
elif line.find(':') & 0:
is_dense = False
return is_dense, has_header
def guess_label_index(line):
label_idx = 0
if line.startswith('_'):
label_idx = 1
return label_idx
#@TODO implement [a:b] so we can use [a:b] in application code
class Features(object):
def __init__(self):
self.data = []
def mini_batch(self, start, end):
return self.data[start: end]
def full_batch(self):
return self.data
class SparseFeatures(object):
def __init__(self):
self.sp_indices = []
self.start_indices = [0]
self.sp_ids_val = []
self.sp_weights_val = []
self.sp_shape = None
def mini_batch(self, start, end):
batch = SparseFeatures()
start_ = self.start_indices[start]
end_ = self.start_indices[end]
batch.sp_ids_val = self.sp_ids_val[start_: end_]
batch.sp_weights_val = self.sp_weights_val[start_: end_]
row_idx = 0
max_len = 0
#@TODO better way to construct sp_indices for each mini batch ?
for i in xrange(start + 1, end + 1):
len_ = self.start_indices[i] - self.start_indices[i - 1]
if len_ & max_len:
max_len = len_
for j in xrange(len_):
batch.sp_indices.append([i - start - 1, j])
row_idx += 1
batch.sp_shape = [end - start, max_len]
return batch
def full_batch(self):
if len(self.sp_indices) == 0:
row_idx = 0
max_len = 0
for i in xrange(1, len(self.start_indices)):
len_ = self.start_indices[i] - self.start_indices[i - 1]
if len_ & max_len:
max_len = len_
for j in xrange(len_):
self.sp_indices.append([i - 1, j])
row_idx += 1
self.sp_shape = [len(self.start_indices) - 1, max_len]
return self
class DataSet(object):
def __init__(self):
self.labels = []
self.features = None
self.num_features = 0
def num_instances(self):
return len(self.labels)
def full_batch(self):
return self.features.full_batch(), self.labels
def mini_batch(self, start, end):
if end & 0:
end = num_instances() + end
return self.features.mini_batch(start, end), self.labels[start: end]
def load_dense_dataset(lines):
dataset_x = []
dataset_y = []
label_idx = guess_label_index(lines[0])
for i in xrange(len(lines)):
if nrows % 10000 == 0:
print nrows
nrows += 1
line = lines[i]
l = line.rstrip().split()
dataset_y.append([float(l[label_idx])])
dataset_x.append([float(x) for x in l[label_idx + 1:]])
dataset_x = np.array(dataset_x)
dataset_y = np.array(dataset_y)
dataset = DataSet()
dataset.labels = dataset_y
dataset.num_features = dataset_x.shape[1]
features = Features()
features.data = dataset_x
dataset.features = features
return dataset
def load_sparse_dataset(lines):
dataset_x = []
dataset_y = []
label_idx = guess_label_index(lines[0])
num_features = int(lines[0].split()[label_idx + 1])
features = SparseFeatures()
start_idx = 0
for i in xrange(len(lines)):
if nrows % 10000 == 0:
print nrows
nrows += 1
line = lines[i]
l = line.rstrip().split()
dataset_y.append([float(l[label_idx])])
start_idx += (len(l) - label_idx - 2)
features.start_indices.append(start_idx)
for item in l[label_idx + 2:]:
id, val = item.split(':')
features.sp_ids_val.append(int(id))
features.sp_weights_val.append(float(val))
dataset_y = np.array(dataset_y)
dataset = DataSet()
dataset.labels = dataset_y
dataset.num_features = num_features
dataset.features = features
return dataset
def load_dataset(dataset, has_header=False):
print '... loading dataset:',dataset
lines = open(dataset).readlines()
if has_header:
return load_dense_dataset(lines[1:])
is_dense, has_header = guess_file_format(lines[0])
if is_dense:
return load_dense_dataset(lines[has_header:])
return load_sparse_dataset(lines)
#-----------------------------------------melt for tensorflow
import tensorflow as tf
def init_weights(shape):
return tf.Variable(tf.random_normal(shape, stddev = 0.01))
def matmul(X, w):
if type(X) == tf.Tensor:
return tf.matmul(X,w)
return tf.nn.embedding_lookup_sparse(w, X[0], X[1], combiner = "sum")
class BinaryClassificationTrainer(object):
def __init__(self, dataset):
self.labels = dataset.labels
self.features = dataset.features
self.num_features = dataset.num_features
self.X = tf.placeholder("float", [None, self.num_features])
self.Y = tf.placeholder("float", [None, 1])
def gen_feed_dict(self, trX, trY):
return {self.X: trX, self.Y: trY}
class SparseBinaryClassificationTrainer(object):
def __init__(self, dataset):
self.labels = dataset.labels
self.features = dataset.features
self.num_features = dataset.num_features
self.sp_indices = tf.placeholder(tf.int64)
self.sp_shape = tf.placeholder(tf.int64)
self.sp_ids_val = tf.placeholder(tf.int64)
self.sp_weights_val = tf.placeholder(tf.float32)
self.sp_ids = tf.SparseTensor(self.sp_indices, self.sp_ids_val, self.sp_shape)
self.sp_weights = tf.SparseTensor(self.sp_indices, self.sp_weights_val, self.sp_shape)
self.X = (self.sp_ids, self.sp_weights)
self.Y = tf.placeholder("float", [None, 1])
def gen_feed_dict(self, trX, trY):
return {self.Y: trY, self.sp_indices: trX.sp_indices, self.sp_shape: trX.sp_shape, self.sp_ids_val: trX.sp_ids_val, self.sp_weights_val: trX.sp_weights_val}
def gen_binary_classification_trainer(dataset):
if type(dataset.features) == Features:
return BinaryClassificationTrainer(dataset)
return SparseBinaryClassificationTrainer(dataset)
#------------------------- binary_classification.py
#!/usr/bin/env python
#coding=gbk
# ==============================================================================
# \file binary_classification.py
# \author chenghuige
16:06:52.693026
# \Description
# ==============================================================================
import sys
import tensorflow as tf
import numpy as np
from sklearn.metrics import roc_auc_score
import melt
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_float('learning_rate', 0.001, 'Initial learning rate.')
flags.DEFINE_integer('num_epochs', 120, 'Number of epochs to run trainer.')
flags.DEFINE_integer('batch_size', 500, 'Batch size. Must divide evenly into the dataset sizes.')
flags.DEFINE_string('train', './corpus/feature.normed.rand..txt', 'train file')
flags.DEFINE_string('test', './corpus/feature.normed.rand..txt', 'test file')
flags.DEFINE_string('method', 'logistic', 'currently support logistic/mlp')
#----for mlp
flags.DEFINE_integer('hidden_size', 20, 'Hidden unit size')
trainset_file = FLAGS.train
testset_file = FLAGS.test
learning_rate = FLAGS.learning_rate
num_epochs = FLAGS.num_epochs
batch_size = FLAGS.batch_size
method = FLAGS.method
trainset = melt.load_dataset(trainset_file)
print "finish loading train set ",trainset_file
testset = melt.load_dataset(testset_file)
print "finish loading test set ", testset_file
assert(trainset.num_features == testset.num_features)
num_features = trainset.num_features
print 'num_features: ', num_features
print 'trainSet size: ', trainset.num_instances()
print 'testSet size: ', testset.num_instances()
print 'batch_size:', batch_size, ' learning_rate:', learning_rate, ' num_epochs:', num_epochs
trainer = melt.gen_binary_classification_trainer(trainset)
class LogisticRegresssion:
def model(self, X, w):
return melt.matmul(X,w)
def run(self, trainer):
w = melt.init_weights([trainer.num_features, 1])
py_x = self.model(trainer.X, w)
return py_x
class Mlp:
def model(self, X, w_h, w_o):
h = tf.nn.sigmoid(melt.matmul(X, w_h)) # this is a basic mlp, think 2 stacked logistic regressions
return tf.matmul(h, w_o) # note that we dont take the softmax at the end because our cost fn does that for us
def run(self, trainer):
w_h = melt.init_weights([trainer.num_features, FLAGS.hidden_size]) # create symbolic variables
w_o = melt.init_weights([FLAGS.hidden_size, 1])
py_x = self.model(trainer.X, w_h, w_o)
return py_x&&&&&&&&
def gen_algo(method):
if method == 'logistic':
return LogisticRegresssion()
elif method == 'mlp':
return Mlp()
print method, ' is not supported right now'
algo = gen_algo(method)
py_x = algo.run(trainer)
Y = trainer.Y
cost = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(py_x, Y))
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # construct optimizer
predict_op = tf.nn.sigmoid(py_x)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
teX, teY = testset.full_batch()
num_train_instances = trainset.num_instances()
for i in range(num_epochs):
predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))
print i, 'auc:', roc_auc_score(teY, predicts), 'cost:', cost_ / len(teY)
for start, end in zip(range(0, num_train_instances, batch_size), range(batch_size, num_train_instances, batch_size)):
trX, trY = trainset.mini_batch(start, end)
sess.run(train_op, feed_dict = trainer.gen_feed_dict(trX, trY))
predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))
print 'final ', 'auc:', roc_auc_score(teY, predicts),'cost:', cost_ / len(teY)
阅读(...) 评论()登录以解锁更多InfoQ新功能
获取更新并接收通知
给您喜爱的内容点赞
关注您喜爱的编辑与同行
966,690 三月 独立访问用户
语言 & 开发
架构 & 设计
文化 & 方法
您目前处于:
文本数据的机器学习自动分类方法(下)
文本数据的机器学习自动分类方法(下)
14&他的粉丝
日. 估计阅读时间:
,PWA、Web框架、Node等最新最热的大前端话题邀你一起共同探讨。
亲爱的读者:我们最近添加了一些个人消息定制功能,您只需选择感兴趣的技术主题,即可获取重要资讯的。
Author Contacted
语言 & 开发
343 他的粉丝
架构 & 设计
818 他的粉丝
221 他的粉丝
0 他的粉丝
自然语言处理
0 他的粉丝
190 他的粉丝
1 他的粉丝
2699 他的粉丝
0 他的粉丝
相关厂商内容
相关赞助商
告诉我们您的想法
允许的HTML标签: a,b,br,blockquote,i,li,pre,u,ul,p
当有人回复此评论时请E-mail通知我
允许的HTML标签: a,b,br,blockquote,i,li,pre,u,ul,p
当有人回复此评论时请E-mail通知我
允许的HTML标签: a,b,br,blockquote,i,li,pre,u,ul,p
当有人回复此评论时请E-mail通知我
赞助商链接
InfoQ每周精要
订阅InfoQ每周精要,加入拥有25万多名资深开发者的庞大技术社区。
架构 & 设计
文化 & 方法
InfoQ.com及所有内容,版权所有 ©
C4Media Inc. InfoQ.com 服务器由 提供, 我们最信赖的ISP伙伴。
极客邦控股(北京)有限公司
找回密码....
InfoQ账号使用的E-mail
关注你最喜爱的话题和作者
快速浏览网站内你所感兴趣话题的精选内容。
内容自由定制
选择想要阅读的主题和喜爱的作者定制自己的新闻源。
设置通知机制以获取内容更新对您而言是否重要
注意:如果要修改您的邮箱,我们将会发送确认邮件到您原来的邮箱。
使用现有的公司名称
修改公司名称为:
公司性质:
使用现有的公司性质
修改公司性质为:
使用现有的公司规模
修改公司规模为:
使用现在的国家
使用现在的省份
Subscribe to our newsletter?
Subscribe to our industry email notices?
我们发现您在使用ad blocker。
我们理解您使用ad blocker的初衷,但为了保证InfoQ能够继续以免费方式为您服务,我们需要您的支持。InfoQ绝不会在未经您许可的情况下将您的数据提供给第三方。我们仅将其用于向读者发送相关广告内容。请您将InfoQ添加至白名单,感谢您的理解与支持。

我要回帖

更多关于 文本对齐方式 的文章

 

随机推荐