好用的time series的用法 database有推荐的吗

OpenTSDB: A Scalable, Distributed Time Series Database: Open Source Convention - O'Reilly OSCON, July 25 - 29, 2011 in
Portland, OR
Sponsorship Opportunities
For information on exhibition and sponsorship opportunities at the convention, contact Sharon Cordesse at
Download the
Media Partner Opportunities
For information on trade opportunities with O'Reilly conferences or contact
Press and Media
For media-related inquiries, contact Maureen Jennings at
OSCON Bulletin
To stay abreast of convention news and announcements, please sign up for the
(login required)
Contact Us
View a complete list of
OpenTSDB: A Scalable, Distributed Time Series Database
(StumbleUpon, Inc.)
Presentation:
Average rating:
(4.30, 10 ratings)
In this talk, we’ll take a deep dive into OpenTSDB, the open-source, horizontally scalable, time series database built on top of HBase.
We’ll show how its design allows you to easily monitor large clusters of commodity machines at an unprecedented level of granularity.
We’ll review how its implementation enables you to store billions of data points and to track orders of magnitude more time series from thousands of hosts and applications.
With a resolution of a few seconds, OpenTSDB provides accurate real-time monitoring as well as long term trending, without ever having to delete or destructively downsample any data.
People planning to attend this session also want to see:
Benoit Sigoure
StumbleUpon, Inc.
is a software engineer with a strong UNIXy/Linux background.
He specializes in designing, writing & running large-scale distributed serving systems that serve millions of users. He has a deep understanding of the entire technology stack (including Google’s), from on-wire protocols and low-level implementation details all the way up to high-level designs used in high-availability distributed systems (both in software and in the datacenter).
Benoit designed and implemented , the open-source, highly scalable, distributed, monitoring system.
Stay Connected
More O'Reilly SitesdeveloperWorks 社区
Continuent 的产品管理总监
随着家庭自动化越来越普及,记录统计数据的传感器数量和提供该数据所需的信息也越来越多。通过使用
中的 Time Series Database,您可以轻松地记录记录了时间的数据,并查询和报告该数据。在本教程中,我们将介绍如何使用 Time
Series Database 来创建、存储并最终报告信息。我们还将使用该数据库跨多个传感器关联数据点,跟踪一个多区域住宅里的供热系统的效率。创建核心应用程序&我们的应用程序的核心是一个 Node.js 应用程序,它被连接到后端(也就是 Time Series Database)。首先,在 Bluemix 仪表板中创建一个新的 Node.js 应用程序:登录到 。单击 Create an App。单击 Node.js App Starter 并配置您的应用程序参数。创建您的应用程序并为其命名后,添加一个新的服务,即 Time Series Database。完成这个步骤后,需要访问该代码。我喜欢下载软件包,然后使用 cf 命令行工具与该流程交互。这为我提供了快速更改,然后使用 cf
push 更新正在运行的 Bluemix 实例上的版本的功能。无论您希望如何与代码交互和处理它,您应用程序的核心都是在应用程序包的顶级目录中创建的 app.js 文件。在该文件中有不同的部分。顶部的一节包含配置信息和核心集成,用于确定应用程序应如何连接其他组件。以下代码显示了这个标头部分的示例。/*jshint node:true*/
// This file contains the server-side JavaScript code for your application.
// This sample application uses express as the web application framework (/),
// and jade as the template engine (/).
var express = require('express');
// setup middleware
var app = express();
var fs = require('fs');
app.use(app.router);
app.use(express.errorHandler());
app.use(express.static(__dirname + '/public')); //setup static public directory
app.set('view engine', 'jade');
app.set('views', __dirname + '/views'); //optional because express defaults to CWD/views
// render index page
app.get('/', function(req, res){
res.render('index');
// There are many useful environment variables available in process.env.
// VCAP_APPLICATION contains useful information about a deployed application.
var appInfo = JSON.parse(process.env.VCAP_APPLICATION || "{}");
// TODO: Get application information and use it in your app.
// VCAP_SERVICES contains all the credentials of services bound to
// this application. For details of its content, please refer to
// the document or sample of each service.
var services = JSON.parse(process.env.VCAP_SERVICES || "{}");
// TODO: Get service credentials and communicate with bluemix services.
// The IP address of the Cloud Foundry DEA (Droplet Execution Agent) that hosts this application:
var host = (process.env.VCAP_APP_HOST || 'localhost');
// The port on the DEA for communication with the application:
var port = (process.env.VCAP_APP_PORT || 3000);这里有很多行,其中许多行仅在初始化时使用。最重要的一行将会获取当前服务的信息:var services = JSON.parse(process.env.VCAP_SERVICES || "{}");此过程抓取的信息包括您应用程序中所有已配置的服务的访问信息。具体地讲,它将包含用于访问您的 Time Series Database 的 URL 和凭据。另一个重要的代码段将会确定在访问您应用程序中某个特定 URL 时执行的操作:// render index page
app.get('/', function(req, res){
res.render('index');
});在本例中,只要 myapp.bluemix.net/(根或索引)被访问,Node.js 应用程序就会直接响应或调用另一个函数。在这里,我们使用了调用
render() 函数的内联 JavaScript 函数调用另一个函数。这是 res (response)
对象上的一个方法,基本上讲,它会读取该文件的内容并返回一个索引文档。请求传递到该函数,展示对采用这种方法调用所有函数都很有用的一个元素。请求包含 URL 信息和任何可能包含在 URL 内的查询参数。本教程后面将讨论如何处理该信息。无论对每个元素的响应是什么,一个重要事实是:app.get() 函数定义一个路线来将访问的 URL
映射到一个需要返回的操作、函数或页面。请记住,我们实质上是在一个 JavaScript 应用程序的界限内运行此应用程序,所以我们必须拥有一种在访问的 URL
与响应之间建立连接的方法。app.get() 方法是一个注册 URL 和该 URL 对应用程序的含义的方法。app.js 文件的最后一部分包含启动应用程序(并开始监听传入请求)的函数,以及一个记录已启动的应用程序的行:app.listen(port, host);
console.log('App started on port ' + port);再次声明,请记住这是一个 JavaScript 应用程序,所以没有 stdout 或 stderr 需要讲解。您使用
console.log() 方法记录错误和信息。对于 BlueMix 应用程序,此信息可通过仪表板提供,可以使用对 cf 命令行工具的 'cf
log appname' 调用来获得。在开始处理数据之前,需要打开一个与 Time Series Database 的连接。& 中的 Time Series Database 可以通过许多不同的接口提供,具体情况取决于应用程序环境。对于 Node.js
应用程序,有两个解决方案:使用 REST API 或使用 Mongo API。如果想要为基础数据提供一个直接转交接口,例如在提取图表和其他信息时,那么 REST
接口可能很有用。我们使用了 Mongo API,因为它提供了非常简单的方式来访问 Node.js 应用程序中的数据,也那就是说,使用 JSON 执行调用来描述、分割和定义记录和信息。要创建一个接口,必须知道连接数据库的凭据(URL,包括用户名和密码)。此信息在服务变量内提供,该变量本身已在之前在 Node.js
脚本中自动填充。事实上,要处理多个服务,需要连接到主要服务对象中的 timeseriesdatabase 对象类的第一个 URL,类似以下代码:services.timeseriesdatabase[0].credentials.json_url要实际访问数据库,可以将整个流程嵌入在一个连接调用中:require('mongodb').connect(services.timeseriesdatabase[0].credentials.json_url, function(err, conn) {
}这将使用所提取的凭据,通过 MongoDB API 打开与 Time Series Database
的一个连接,调用一个函数来提供错误和连接对象。尽管这看似有点浪费,但它提供了跟踪任何问题或错误并从中恢复的最佳方法。需要通过 MongoDB 连接来访问数据。MongoDB 中的数据存储在集合中,集合大体上相当于典型 SQL 数据库中的一个数据表。&存储在 Time Series Database 中的信息是为了引用和链接到某个特定的度量结果和环境而设计的。可以将 Time Series Database
视为一个日记条目,您在其中记录您在何处、度量类型和值。因为它是一个时序数据库,所以您能够不断查看同一个值,确定度量结果趋势是上升还是下降,或者最新的度量结果比一个历史值更高还是更低。得益于 Time Series Database 的工作方式,必须为每个度量结果提供一些限制和参数,包括:它必须有一个度量单位 (measure_unit)。在此应用程序,该单位为摄氏度,但它也可以是 MB、秒或 mm,只要存储的是这类数据。它必须有一个位置或身份 (loc_esi_id)。此信息定义了此条目的标识符。它必须有一个方向 (direction),也就是说,它是一个生产者(即创建者或值)还是一个使用者(它们的使用者)。这是一个字节值:P 或
C。它自己必须有一个值 (value)。无法记录一个 'empty' 或 NULL
值。该值实际记录为小数类型,最多精确到小数点后 3 位。它必须有一个读取时间戳 (tstamp)。时间戳必须具有格式
00:00:00.00000。与传统数据库不同,时间戳必须以 15 为单位。所以,在过去的每个小时,可在 00、15、30 或 45
分钟时记录一个值。也可仅为时间戳、方向、度量单位和位置记录一个值。例如,如果记录在特定的一天 16:11:00
来自休息室中的一个温度传感器的值,这是惟一的数据点。但是,同时可记录来自车库的温度,因为这是一个不同的位置。所有数据都写入一个叫做 ts_data_v
的集合中。因为我们可将不同的位置记录在此表中,所以我们只需一个表即可,因为我们可能仅使用一个不同的位置或度量类型。但是,请记住您不能对同一个位置(比如休息室)记录两个温度(度量类型
C),不过完全可以使用 'lounge_front' 和 'lounge_back'。所以,可以创建一条像这样的记录:{
"direction" : "P",
"tstamp" : " 11:17:15.00000",
"measure_unit" : "C",
"value" : 21,
"loc_esi_id" : "lounge"
}要记录该值,需要将一个 URL 解析到一个应用程序,这样才能提取该位置和值。可以通过处理来自路线调用(来自
app.get())的请求对象来实现此目的:var parsedUrl = require('url').parse(req.url, true);
var queryObject = parsedUrl.
var name = (queryObject["name"] || 'lounge');
var temppoint = parseFloat((queryObject["temp"] || 0));首先,解析来自请求对象的 URL。这会生成一个 queryObject,您可以从中提取 URL 值,在本例中为位置和温度点。所以,下面这行:http://mcslptds.mybluemix.net/reading?name=kitchen&temp=23为我们提供了位置 'kitchen' 和值 '23'。接下来,我们需要构造一种合适的时间戳格式。您一定还记得,它必须与数据库需要的格式准确匹配。这意味着您需要在其中填充 0,构造一种合适的字符串与日期/时间组合。与提取的 URL 值相结合,将得到:var create_datapoint = function(req, res) {
console.log("Recording a datapoint");
require('mongodb').connect(services.timeseriesdatabase[0].credentials.json_url, function(err, conn) {
if (err) { res.write(err.stack); }
console.log("Extracting values");
var collection = conn.collection('ts_data_v');
var parsedUrl = require('url').parse(req.url, true);
var queryObject = parsedUrl.
var name = (queryObject["name"] || 'lounge');
var temppoint = parseFloat((queryObject["temp"] || 0));
var tsdate = new Date();
var datestring = tsdate.getUTCFullYear() + '-';
if (parseInt(tsdate.getUTCMonth() + 1) & 10) {
datestring = datestring + '0' + (tsdate.getUTCMonth()+1) + '-';
datestring = datestring + (tsdate.getUTCMonth()+1) + '-';
if (parseInt(tsdate.getUTCDate()) & 10) {
datestring = datestring + '0' + tsdate.getUTCDate() + ' ';
datestring = datestring + tsdate.getUTCDate() + ' ';
if (parseInt(tsdate.getUTCHours()) & 10) {
datestring = datestring + '0' + tsdate.getUTCHours() + ':';
datestring = datestring + tsdate.getUTCHours() + ':';
// Minutes should only be logged if they are are a multiple of 15
var realminutes = (parseInt(tsdate.getUTCMinutes()/15)*15);
if (realminutes & 10) {
datestring = datestring + '00:';
datestring = datestring + realminutes + ':';
datestring = datestring + '00:00000';
console.log("Date: " + datestring);
var message = { 'loc_esi_id': name, 'measure_unit' : 'C', 'direction' : 'P', 'value': temppoint, 'tstamp': datestring};
console.log("Constructed record");
console.log(JSON.stringify(message));
collection.insert(message, {safe:true}, function(err){
if (err) { console.log(err.stack); }
res.write(JSON.stringify(message));
res.end();
};这会创建 create_datapoint() 函数,需要在此应用程序中添加连接到该函数的路线,所以我们还添加了一个适合其他路线的条目:app.get("/reading", function (req, res) {
create_datapoint(req,res);
});现在您已经有了传入数据的方式,您可以通过再次将应用程序部署到
来测试它,然后,使用 curl 记录一个点:$ curl 'http://mcslptds.mybluemix.net/reading?name=kitchen&temp=23'我们已拥有传入数据的方式,接下来看看如何传出数据。检索一个条目列表&要获取一个条目列表,可以使用 MongoDB 接口连接到数据库,然后,使用 collection.find() 函数提取存储在数据库中的值:var list_datapoint = function(req, res) {
var parsedUrl = require('url').parse(req.url, true);
var queryObject = parsedUrl.
var name = (queryObject["name"] || 'lounge');
res.writeHead(200, {'Content-Type': 'text/plain'});
require('mongodb').connect(services.timeseriesdatabase[0].credentials.json_url, function(err, conn) {
var collection = conn.collection('ts_data_v');
res.write("Reading from collection ts_data_v");
collection.find({"loc_esi_id":name}, {limit:1000, sort:[['loc_esi_id','ascending'],['tstamp','ascending']]}, function(err, cursor) {
cursor.toArray(function(err, items) {
if (err) { res.write(err.stack); }
for (i=0; i & items. i++) {
res.write(JSON.stringify(items[i]) + "\n");
res.end();
};首先,返回一个有效的标头 (HTTP 200)。然后,连接到数据库,使用所提供的名称或默认的 'lounge'
在数据库上执行查询。为了简便起见,我们使用 URL /dumplist 在应用程序中注册此函数:app.get("/dumplist", function (req, res) {
list_datapoint(req,res);
});然后会返回一个值列表。当然,这不是很有用。返回一个输出曲线图是个一种不错的方法,这样我们就可更轻松地了解趋势和值随时间的变化。毕竟这是一个时序数据库。创建曲线图&提供温度数据的输出曲线图实际上包含两步:构建一个方法来生成一个适合曲线图的值列表。构建一个 HTML 页面,它包含曲线图绘制软件并调用该 URL 来获取数据。要首先完成第二步,我们的方法是创建一个静态 HTML 页面,在调用一个特定 URL 时,我们将会解析该页面并将它显示给用户:var graph_view = function(req, res) {
fs.readFile('./public/graph.html', function(err, data) {
res.end(data);
}这会从应用程序目录中返回 /public/graph.html 的内容。要显示该页面,我们向此函数添加一个路径 /graph:app.get("/graph", function (req, res) {
graph_view(req,res);
});该 HTML 本身非常简单。加载 JQuery,然后加载 JQuery FLOT 库来打印图表信息:&html&
&title&Graph&/title&
&link rel="stylesheet" type="text/css" href="stylesheets/style.css" /&
&script src="jquery.js"&&/script&
&script src="jquery.flot.js"&&/script&
&script src="jquery.flot.time.js"&&/script&
$(document).ready(function() {
var dataseries = [];
$.getJSON("/graphpoints", function(json) {
$.plot($("#plot") , [json],{ xaxis: { mode: "time"}});
&div id="plot"&&/div&
&div id="plotdata"&&/div&
&/html&确保您将此信息添加到正确的位置,使之与之前调用的 parse() 函数相匹配。关键部分是定义该数据的内联函数。在本例中,我们通过 URL /graphpoints 调用我们自己的应用程序,并通过 Ajax 来加载它:
$.getJSON("/graphpoints", function(json) {
$.plot($("#plot") , [json],{ xaxis: { mode: "time"}});
});该信息需要具有特定的格式,包含一个嵌套数组,其中每个嵌套的数组是一对 X 和 Y 轴。例如:[[1,21],[2,23],…]第一个元素是一个顺序号或日期/时间字符串,可由 Flot 库解析为数据。要输出此数据,我们需要创建 /dumplist 函数的一个修改版本,以正确的格式生成该信息。整个函数类似于:var graph_datapoints = function(req, res) {
var parsedUrl = require('url').parse(req.url, true);
var queryObject = parsedUrl.
var name = (queryObject["name"] || 'lounge');
res.writeHead(200, {'Content-Type': 'application/json'});
require('mongodb').connect(services.timeseriesdatabase[0].credentials.json_url, function(err, conn) {
var collection = conn.collection('ts_data_v');
var dataseries = new Array();
collection.find({"loc_esi_id": name}, {limit:200, sort:[['tstamp','descending']]}, function(err, cursor) {
cursor.toArray(function(err, items) {
if (err) { res.write(err.stack); }
for (i=0; i & items. i++) {
timeint = (new Date(items[i].tstamp).getTime())/1000;
dataseries.push([timeint,items[i].value]);
console.log(JSON.stringify(dataseries));
console.log("Final: " + JSON.stringify(dataseries));
res.write(JSON.stringify(dataseries));
res.end();
}该函数的核心是调用 Mongo API 来访问该集合。该集合按时间戳字段,以降序自动排序。然后,对于每个返回的条目,会创建一个包含日期时间字符串(从时间戳解析而来)的数组:
timeint = (new Date(items[i].tstamp).getTime())/1000;
dataseries.push([timeint,items[i].value]);访问曲线图页面时,将返回 HTML 页面,并且 Ajax 代码会加载 /graphpoints URL,这会从 Time Series Database
返回数据,描绘这些值并为我们提供一个曲线图。结束语&使用
和 Node.js 创建一个新应用程序实际上非常简单。通过使用样板内容和现成的应用程序,您可以非常轻松地创建一个基本应用程序。对于 Node.js
应用程序,关键元素包括类似于 REST 的 API(可创建它来提交和返回信息),以及解析内容以显示传统的 HTML 页面的能力。通过将这些元素与一个后端数据库(比如 Time
Series Database)相组合,可以非常轻松地存储和检索信息。使用 Time Series
Database,数据的顺序和排序的复杂性已得到处理,您获得的是正确排序的信息。
注意:评论中不支持 HTML 语法
剩余 1000 字符
developerWorks: 登录
标有星(*)号的字段是必填字段。
保持登录。
单击提交则表示您同意developerWorks 的条款和条件。 查看条款和条件。
在您首次登录 developerWorks 时,会为您创建一份个人概要。您的个人概要中的信息(您的姓名、国家/地区,以及公司名称)是公开显示的,而且会随着您发布的任何内容一起显示,除非您选择隐藏您的公司名称。您可以随时更新您的 IBM 帐户。
所有提交的信息确保安全。
选择您的昵称
当您初次登录到 developerWorks 时,将会为您创建一份概要信息,您需要指定一个昵称。您的昵称将和您在 developerWorks 发布的内容显示在一起。昵称长度在 3 至 31 个字符之间。
您的昵称在 developerWorks 社区中必须是唯一的,并且出于隐私保护的原因,不能是您的电子邮件地址。
标有星(*)号的字段是必填字段。
(昵称长度在 3 至 31 个字符之间)
单击提交则表示您同意developerWorks 的条款和条件。 .
所有提交的信息确保安全。
static.content.url=/developerworks/js/artrating/SITE_ID=10Zone=Big data and analytics, Cloud computingArticleID=990504ArticleTitle=使用 Bluemix 中的 Time Series Database 处理家庭监视数据publish-date=human pol ii promoter prediction: time series descriptors and machine learning 
【关键词】& human,pol,ii
rajeev gangal and pankaj sharma*
scinova technologies pvt. ltd 528/43 vishwashobha, adjacent to modi ganpati, narayan peth, pune 411030, maharashtra, india
*to whom correspondence should be addressed. tel: +91 20 4450282; fax: +91 20 4450282; email:
although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. the limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. in this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase ii promoters. in this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (svm), are used to discriminate between promoter and non-promoter regions. the basic idea here is to use descriptors that do not depend on the primary dna sequence and provide a clear distinction between promoter and non-promoter regions. the classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. this approach correctly identified all 20 experimentally verified promoters of human chromosome 22. the high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as lyapunov component stability and tsallis entropy, and supervised machine-learning methods, such as svms, can be useful in the identification of pol ii promoters.
introduction
one of the challenges in the field of computational biology and especially in the area of computational dna sequence analysis is the automatic detection of promoter sites. promoter sites typically have a complex structure consisting of multi-functional binding sites for proteins involved in the transcription initiation process. promoters have been defined as modular dna structures containing a complex array of cis-acting regulatory elements required for accurate and efficient initiation of transcription and for controlling expression of a gene.
eukaryotic cells basically contain three different types of rna polymerases in their nuclei, rna polymerases i, ii and iii. rna polymerase ii transcribes all protein-coding sequences in eukaryotic cells, and is the most important of the three polymerases. promoters in general contain two consensus sequences: (i) a tata box located 30 bp upstream from the transcriptional start site and (ii) a ccaat box located somewhere around
75 bp, with a consensus sequence of ggccaatct. there are also a number of other consensus sequences that frequently occur in eukaryotic promoters, which serve as binding sites for a wide variety of protein transcription factors, such as gc box and enchancers. since, eukaryotic promoters have highly diver it has been very difficult to find generalized patterns or rules by conventional sequence analysis methods. promoters contain vital information about gene expression and regulatory networks, including gene targets of individual cascades/signalling pathways (1). the basic aim of computer-assisted promoter recognition is the elucidation of gene transcription and associated genetic regulatory networks. prediction of the functionality of a promoter would also be welcome for gene therapy approaches to improve the expression of newly created vector constructs.
several algorithms for the prediction of promoters, transcriptional start points and transcription factor binding sites in eukaryotic dna sequence now exist (2,3). although current algorithms perform much better than the earlier attempts, it is probably fair to say that performance is still far from satisfactory.
prometheus, a machine-learning tool, is designed to address the problem of low-prediction accuracy. it specifically deals with the application of non-linear dynamics and statistical thermodynamics descriptors, such as lyapunov component and tsallis entropy along with non-linear machine-learning algorithms. prometheus is found to perform significantly better than some other promoter finding programs, nnpp 2.2, promoter scan version 1.7, promoter 2.0 prediction server (4), soft berry (5) and dragon promoter finder (6).
a dna sequence can be pictured as a dynamical system. it evolves continuously in the course of evolution and is thus subject to perturbation, i.e. losses and gains of single residues or fragments. it can perhaps further be characterized as a chaotic dynamical system, since a slight change in initial conditions can lead to different outcomes in terms of the final function (7).
the aim of the present study is to provide a distinct classification between promoter and non-promoter sequences. in the present study, we have used properties such as 3mer, 4mer (n-mer frequencies) (8) and gc% along with non-linear time series descriptors, i.e. lyapunov exponent and tsallis entropy (9). non-linear time series analysis is being increasingly applied in the fields of biology and physiology, where the systems are expected to be non-linear and a simple linear stochastic description often does not account for the highly complex nature of the observed behaviour. the maximum lyapunov exponent used here is a qualitative measure of the stability of a dynamical system. a quantitative measure of the sensitive dependence on the initial conditions is the lyapunov exponent. it is the averaged rate of divergence (or convergence) of two neighbouring trajectories. lyapunov exponents quantify this divergence by measuring the mean rate of exponential divergence of initially neighbouring trajectories (10). a trajectory of a system with a negative lyapunov exponent is stable and will converge to an attractor exponentially with time. the magnitude of the lyapunov exponent determines how fast the attractor is approached. a trajectory of a system with a positive lyapunov exponent is unstable and will not converge to an attractor. the magnitude of the positive lyapunov exponent determines the rate of exponential divergence of the trajectory.
in recent years, considerable interest has been generated in the question of non-extensivity of entropy and statistics of a number of systems. tsallis entropy, which gives the usual shannon boltzmann gibbs entropy as a special case (11) has enjoyed considerable success in dealing with a number or non-equilibrium phenomena and hence, is a prime candidate for application to biological systems. since, biological systems ranging from genes and proteins to cells, organisms and ecosystems are open and far from equilibrium, so tsallis entropy might have an important role to play in chemical and biological dynamics in general (12). tsallis entropy is given by
where x is a dimensionless state-variable, f corresponds to the probability distribution and the entropic index q is any real number. this entropy recovers the standard boltzmann gibbs entropy s =
flnfdx in the limit q& 1. sq is non-extensive such that sq(a + b) = sq(a) + sq(b) + (1
q) sq(a) sq(b), where a and b are two systems independent in the sense that f(a + b) = f(a) f(b). it is clear that q can be seen as measuring the degree of non-extensivity (13). the tsallis entropic form has been applied for protein folding problems and other biological phenomena (). here, we use it for functional annotation as follows.
the tsallis index can be estimated by using
where min and max are minimum and maximum values, respectively, of& in multifractal spectrum. in this way, the values obtained which ar clearly indicate that the thermo statistics is non-extensive and that the tsallis form is more suitable for analysis of such sequences. next, we calculate tsallis entropy for all sequences and the classifying criterion is the rate of growth of this entropy along the sequence.
materials and methods
in order to accomplish the task of eukaryotic polymerase ii promoter prediction, the dataset was taken from the eukaryotic promoter database (epd) release 76 and release 50 (). eukaryotic promoter database is an annotated non-redundant collection of eukaryotic pol ii promoters, for which the transcription start site has been, determined experimentally (14). the model was trained by using two types of datasets, promoter and non-promoter. the promoter sequences were taken as positive train set and non-promoter sequences as negative train set.
a total of 1871 entries of human promoter sequence with window size of 250 bp upstream and 50 downstream of transcription start site (tss) were obtained from epd. sequences having regions with ‘n’ were manually filtered out from both the train and test datasets.
we trained the model using 1000 promoter and 1500 non-promoter sequences, with a window size of 300 bp each. the negative train set of non-promoter sequences comprises 1000 intron sequences and 500 cds.
for the above selected sequences the n-mer properties were calculated, followed by transforming the train set (both negative and positive) into time series using chaos game theory representation (15). further the maximum lyapunov exponent and tsallis entropy parameters were calculated.
the properties calculated were input into a support vector machine (svm) algorithm to build classification model. for validation of the machine-learning model, we used 10-fold cross-validation and the independent test data. of the total training set, 20% of the data were used as test dataset. the 10-fold cross-validation test was done on the remaining 80% of train dataset. for 10-fold cross-validation test, the training data are divided into 10 equal parts. of these 10 parts, 9 parts are used for training and the tenth is used for testing. this is done repeatedly 10 times for all 10 parts, i.e. keeping one part as test and the remaining 9 parts for training. finally, a consensus over all results is taken into consideration. independent dataset was not part of training dataset on which it was being tested.
figure 1 shows a principle components analysis (pca) plot for promoters and non-promoter seperately. the clear separation into two clusters indicates that, the descriptors calculated provide an excellent way to characterize promoters and non-promoters.
figure 1 principle components analysis (pca) plot for each promoters and non-promoter. the descriptors used to discriminate between promoters and non-promoters are transformed to three orthogonal axes. a clear separation between promoter and non-promoter sequences is shown in the pca plot.
support vector machine
we have used svm, a supervised machine-learning technique for discriminating between promoter and non-promoter sequences. vapnik and co-workers (16) originally introduced this technique. svm classifiers solve multiclass classification problems using the structural minimization principle. given a training set in a vector space, svms find the best decision hyperplane that separates two classes (17). the quality of a decision hyperplane is determined by the distance (i.e. hard or soft margin) between two hyperplanes defined by the support vectors. the best decision hyperplane is the one that maximizes this margin. by defining the hyperplane in this fashion, svm is able to generalize unseen instances quite effectively. svm extends its applicability to the linearly non-separable datasets by mapping the original data vectors onto a higher dimensional space in which the data points are linearly separable. the mapping to higher dimensional spaces is done using appropriate kernels such as gaussian kernel and polynomial kernel (18). in our method we have used polynomial kernel for this purpose.
two main motivations suggest the use of svms in computational biology: first, many biological problems involve high-dimensional, noisy data, for which svms are known to behave well compared with other statistical or machine-learning methods. second, in contrast to most machine-learning methods, kernel methods such as the svm can easily handle non-vector inputs, such as variable length sequences or graphs.
prediction accuracy
in order to present the significance of non-linear time series descriptors, two different models were built using 1000 promoter and 1500 non-promoter sequences, with a window size of 300 bp each. in model 1, time series properties are calculated and in model 2, only n-mer frequencies are calculated. our primary experimental results are summarized in table 1 in which percentage of correct value, correlation coefficient and value of kappa statistics are given. kappa is used as a measure of agreement between the two individuals. value of kappa is always 1. a value of 1 implies perfect agreement and values <1 imply less than perfect agreement (19).
table 1 results of models built for promoter prediction
in order to test the prediction accuracy of above model, the three different test sets used were:
800 known promoter sequences,
20 experimentally verified promoters of human chromosome 22 and
1000 intron sequences.
the 800 promoter and 1000 intron sequences used, for validating our model, were retrieved from epd, whereas the 20 experimentally verified promoter sequences were retrieved from genk/embl/epd. the details of these experimentally verified promoters are available in table 2. the test sets were completely independent from the training set.
table 2 list of experimentally verified promoters on human chromosome 22
the high-percentage of correct value, correlation coefficient and value of kappa statistic for model 1 clearly indicates that the time series descriptors calculated here are capable of discriminating between promoter and non-promoter regions (table 3). the promoter and intron sequences for testing the model accuracy was also taken from epd but, these data were definitely not the part of the datasets used for training.
table 3 prediction done using the above models
comparison with existing methods
there are several different promoter prediction tools used for promoter prediction, e.g. neural network promoter prediction (), soft berry (), dragon promoter finder (), promoter 2.0 prediction server () and promoter scan (). for benchmarking of our method, we compared it with some of the on-line available promoter prediction tools. in order to check the prediction accuracy, a sample of 100 sequences was taken from epd, comprising equal number of randomly chosen promoter and intron sequences. the results shown in table 4 clearly indicate that the prediction accuracy of our software is relatively very high in comparison with other tools.
table 4 program accuracy
conclusion
the successful prediction of promoters with high accuracy using time series descriptors clearly indicates that, the novel method has a promise as an approach, for successful eukaryotic promoter prediction. the experience gained from the above example shows that n-mer frequencies and non-linear time series descriptors used along with non-linear machine-learning algorithms are quite suitable to classify between promoter and non-promoter regions.
the main aim of this project is to develop an efficient tool that can discriminate between promoter and non-promoter in a given sequence with high accuracy. high result accuracy of the program indicates that the novel approach can be further successfully used for the prediction of eukaryortic pol ii promoters in entire chromosome. we are currently applying this method for estimating the number of promoters in different chromosomes of the human genome. another challenge being addressed is the localization of promoters rather than a simple classification similar to the one at present. we hope that the promising results using novel descriptors will improve the performance of biomolecular sequence analysis and promoter prediction in particular.
supplementary material
supplementary material is available at nar online.
acknowledgements
we thank sandeep saxena and dr shefali bharti for their valuable comments on the work. the open access publication charges for this article were waived by oxford university press.
references
matthias, s., andreas, k., kornelie, f., kerstin, q., ralf, s., korbinian, g., matthias, f., vale'rie, g.d., alexander, s., ruth, b.w., thomas, w. (2001) first pass annotation of promoters on human chromosome 22 genome res., 11, 333 340 .
bucher, p. (1990) weight matrix description of four eukaryotic rna polymerase ii promoter elements derived from 502 unrelated promoter sequences j. mol. biol., 212, 563 578 .
fickett, j.w. and hatzigeorgiou, a.c. (1997) eukaryotic promoter recognition genome res., 7, 861 878 .
knudsen, s. (1999) promoter 2.0: for recognition of pol ii promoter sequences biotechnologies, 15, 356 361 .
liu, h.f., yang, y.z., dai, z.h., yu, z.h. (2003) the largest lyapunov exponent of chaotic dynamical system in scale space and its application chaos, 13, 839 844 .
bajic, v.b., seah, s.h., chong, a., zhang, g., koh, j.l.y., brusic, c.v. (2001) recognition of vertebrate rna polymerase ii promoters biotechnologies, 18, 198 199 .
hao, b.l. (2000) fractals from genomeDexact solutions of a biology-inspired problem physica a, 282, 225 246 .
fofanov, y., luo, y., katili, c., wang, j., belosludtsev, y., powdrill, t., belapurkar, c., fofanov, v., li, t.b., chumakov, s., pettitt, b.m. (2004) how independent are the appearances of n-mers in different genomes bioinformatics, 20,
schmitt, a.o. and herzel, h. (1997) estimating the entropy of dna sequences j. theor. biol., 7, 369 377 .
sandri, m. (1996) numerical calculation of lyapunov exponents math. j., 6, 78 84 .
roberto, j.v.s. (1997) generalization of shannon's theorem for tsallis entropy j. math. phys., 38,
abe, s. and suzuki, n. (2003) itineration of the interest over nonequilibrium stationary states in tsallis statics phys. rev. e stat. nonlin. soft matter. phys., 67, 016106 .
plastino, a. and plastino, a.r. (1999) tsallis entropy and jayne's information theory formalism braz. j. phys., 29, 50 60 .
perier, r.c., junier, t., bucher, p. (1998) the eukaryotic promoter database nucleic acids res., 26, 353 357 .
almeida, j.s., carrico, j.a., maretzek, a., nobel, p.a., fletcher, m. (2001) analysis of genomic sequences by chaos game representation biotechnologies, 17, 429 437 .
boser, b.e., guyon, i.m., vapnik, v.n. (1992) a training algorithm for optimal margin classifiers proceedings of the 5th annual acm workshop on computational learning theory, pittsburgh, pa acm press pp. 144 152 .
yuan, x., buckles, b.p., zhang, j. (2003) a comparison study of decision tree and svm to classify gene sequence electrical engineering and computer science department, tulane university .
gordon, c., chervonenkis, a.y., gammerman, a.j., shahmuradov, i.a., solovyev, v.v. (2003) sequence alignment kernel for recognition of promoter regions biotechnologies, 15,
feinstein, a.r. and cicchetti, d.v. (1990) high agreement but low kappa: the problems of two paradoxes j. clin. epidemiol., 43, 543 549 .
摘自:   
更多关于“human pol ii promoter prediction: time series descriptors and machine learning”的相关文章
杂志约稿信息
& 南阳市网友
& 莆田市网友
& 金华市网友
& 广东省网友
& 广东省网友
& 河北省网友
& 大连市网友
& 广东省网友
& 哈尔滨市网友
品牌杂志推荐
支持中国杂志产业发展,请购买、订阅纸质杂志,欢迎杂志社提供过刊、样刊及电子版。
全刊杂志赏析网 2015

我要回帖

更多关于 matlab time series 的文章

 

随机推荐