nxt head 报8409(mtcars[vars])报错

羽毛球技术 | 体育赛事 | 英文歌曲 | 住宅风水 | 用户界面设计师 | 六爻 | 书籍改编电影 | 德国足球甲级联赛 | 欧美明星 | PLC | 中国足球 | aj1 | 国家队 | 拜仁慕尼黑足球俱乐部 | 小说创作 | 配音 | iOS应用 | NBA 2K | 古典音乐 | 面相 | 火影忍者 | 武汉大学 | 土拨鼠 | 营销策划 | 秦时明月之天行九歌 | 设计师 | 巴塞罗那足球俱乐部 | 尤文图斯 | 实况足球（游戏） | 少帅 | 罗玉凤 | 比利时 | 跑鞋 | 冷知识 | 肖战 | 李元胜 | 古琴 | 按键精灵 | 罗兰 | 徐波 | 激光手术 | 角色扮演 | 关晓彤 | 微电影 | safari | 北京国安 | 古汉语 | 曼彻斯特联 | 玄幻小说 | 科幻小说 | 双眼皮手术 | 主题曲 | 年会 | 检测仪 | 徒步 | 互联网公司 | 百度输入法 | 镜头 | 宜昌市 | 自拍 | 金蝶 | 电子烟 | 网站建设 | 广播体操 | 文身 | nba篮球 | 索尼(sony) | 天体物理学 | 痛风 | 象棋 | 牛皮癣 | 皮肤护理 | 周星驰（人物） | 试管婴儿 | 亚足联亚洲杯（AFC Asian Cup） | 健美 | 美术生 | 迅雷（软件） | 战斗机 | 穿越小说 | 张璐 | 姓氏 | 诸葛亮 | 后宫·甄嬛传（书籍） | 虎牙直播 | snh48 | 阿迪达斯 | 投影仪 | 组装机 | 微信群 | 阿迪达斯(adidas) | 网球王子 | 分子生物学 | 耽美 | 武磊 | 婚礼 | 表演 | 中国武术 | 动画电影 | Air Jordan | 张子枫 | 免费软件 | 相声演员 | 摩羯座 | 宿舍 | ansys | 法国足球甲级联赛 | 户外 | 剧场版 | 杨凡 | 科幻电影 | galgame | 融资 | 关节炎 | NBA季后赛 | 神话 | 王力宏（人物） | 建模 | 计算机病毒 | 广州恒大淘宝足球俱乐部 | 北京奥运会 | 电脑电源 | 百度翻译 | 字幕 | 讯飞输入法 | 海关 | 易烊千玺 | 深度学习 | 编辑器 | 澳门特别行政区 | 直播 | 流氓软件 | 事故 | 大片 | 李景亮 | 郭富城 | 日语歌曲 | 卡牌游戏 | 小品 | 东京 | 花卉 | 音乐剧 | 互联网创业 | 占卜 | 羽毛球拍 | 婆媳关系 | 日本动画 | 巴黎 | 拳击比赛 | 东南亚 | 足球经理（FM）（游戏） | youtube | 胡歌（演员） | 地铁跑酷 | 植发 | 张继科 | 三国 | 用户界面 | 演技 | 百度竞价 | 青梅竹马 | 移动硬盘 | 韩晓鹏 | 马龙 | 瘦腿 | 宠物医疗 | 巨蟹座 | 徐峥 | 天蝎座 | 胸肌 | 赵丽颖（演员） | adidas阿迪达斯 | 低音炮 | 星际争霸（游戏） | 豆瓣电影 | 微信开放平台 | 手绘 | 吉他学习 | 江苏卫视 | 模特 | 创意 | 团队管理 | 奢侈品 | 王源 | TANK | 笛子 | 偶像 | 莱斯特城 | 维生素 | 新百伦 | 国际物流 | 前女友 | 李小龙 | 华语流行音乐 | 猎头公司 | crm | 搏击项目 | 网站运营 | 鼻炎 | 篮球游戏 |

你的位置：网站首页 >> 频道首页 >>编程语言 >>nxt head 报8409(mtcars[vars])报错

nxt head 报8409(mtcars[vars])报错

来源：蜘蛛抓取(WebSpider) 时间：2018-05-13 02:02 标签： git reset head

(飞马网FMI)
(大家都叫我伦哥)
(法语初学)
(知识改变焦虑)
第三方登录：当前位置： >>
R语言讲义(包括各种回归)
R 语言讲义? 免费(没有权力和铜臭) ? 资源公开, 可改变代码(不是黑盒子,也不是吝啬鬼, 透明是防止“腐败”的最好方式) ? 容易学习。可编程以实行复杂的课题 ? 可扩展: 通过数千个网上提供的适用于不同领域、不同目的、不同方法的软件包来实现你的目标。也可以把你的方法贡献出来 ? 功能强大(绘图功能, 优秀的内在帮助系统, R社区的支持，不断更新，不断修正) ? 没有任何一个商业软件有如此多和如此新的算法? 世界应用统计学家大都把自己的方法首先以R来实现，并尽量放到R 网站上 ? 一年多，R网站的软件包数量增加了两倍，从近 1000个到近3000多个。大都都有关于计算、演示和输入输出方法的函数和例子数据 ? 除非得到巨额资助(或者永远使用盗版软件), 没有理由在公立学校教授商业软件 ? 绝大多数美国统计研究生都会的语言(Berkeley 统计和应用数学本科都开设R语言课) ? 我的很大一部分数据分析知识的来源就是R. ? 我都能学会, 并且到处宣传和普及, 相信你们会做得更好!下载R(http://www.r-project.org/)点击CRAN得到一批镜像网站点击镜像网站比如Berkeley选择 base选择这个,下载安装文件选择这个,下载软件包
Packages （每个都有大量数据和可以读写修改的函数/程序）? ? ? ? ? ? ? ? ? ? ? ? ? ? ? base The R Base Package boot Bootstrap R (S-Plus) Functions (Canty) class Functions for Classification cluster Cluster Analysis Extended Rousseeuw et al. concord Concordance and reliability datasets The R Datasets Package exactRankTests Exact Distributions for Rank and Permutation Tests foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ... graphics The R Graphics Package grDevices The R Graphics Devices and Support for Colours and Fonts grid The Grid Graphics Package KernSmooth Functions for kernel smoothing for Wand & Jones (1995) lattice Lattice Graphics Interface tools Tools for Package Development utils The R Utils PackagePackages （继续）? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MASSMain Package of Venables and Ripley's MASS methodsFormal Methods and Classes mgcvGAMs with GCV smoothness estimation and GAMMs by REML/PQL multtestResampling-based multiple hypothesis testing nlmeLinear and nonlinear mixed effects models nnetFeed-forward Neural Networks and Multinomial Log-Linear Models nortestTests for Normality outliersTests for outliers plsPartial Least Squares Regression (PLSR) and Principal Component Regression (PCR) pls.pcrPLS and PCR functions rpartRecursive Partitioning SAGxStatistical Analysis of the GeneChip smaStatistical Microarray Analysis spatialFunctions for Kriging and Point Pattern Analysis splinesRegression Spline Functions and Classes statsThe R Stats Package stats4Statistical Functions using S4 Classes survivalSurvival analysis, including penalised likelihood. tcltkTcl/Tk Interface toolsTools for Package Development utilsThe R Utils PackagePackages （网上）? 网上还有许多所有这些Packages可以自由下载? Base中的package包含常用的函数和数据 ? 而其他的packages包含各个方向统计学家自己发展的方法和数据。 ? 希望你是下一个加盟这些 packages的作者之一。?安装Packages
关机时是否保存?? 如果是, 你的运算结果(赋值的变量及函数等)保存在一个文件(名字为.RData)中, 下次开机时还会重新载入. 如果你不要则删去该文件即可. ? 其实, 除非是做一个需要多次才完成的大课题, 一般你都不想保存. 你所用的代码可以以程序脚本形式(*.R, 注意: 一定要自己敲入”.R”, 没有默认)保存几个有用的函数函数：f(x): 名字(变元) getwd() setwd(dir = &f:/2010stat&)#或setwd(&f:/2010stat&) getwd() x=rnorm(100) ls() ?rnorm#或help(rnorm) apropos(“norm“) identical(1:10,1:10) identical(1:10,as.numeric(1:10)) identical(1:10,as.integer(1:10))? ? ? ? ??赋值和运算 z = rnorm(,0.1) median(z) 赋值: “=”可以用“&-”代替 x&-z-&y-&w 简单数学运算有: +,-,*,/, ^,%*%,%%(mod) %/%(整数除法)等等常用的数学函数有:abs , sign , log , log2,log10 , logb, expm1, log1p(x), sqrt , exp , sin , cos , tan , acos , asin, atan , cosh , sinh， tanh赋值和运算? ? ? ? round, floor, ceiling gamma , lgamma, digamma and trigamma. sum, prod, cumsum, cumprod max, min, cummax, cummin, pmax, pmin, range ? mean, length, var, duplicated, unique ? union, intersect, setdiff ? &, &=, &, &=, &, |, !从高到低的运算次序一些基本运算例子x=1:100 (x=1:100) sample(x,20) set.seed(0);sample(1:10,3)#随机种子! z=sample(1:00) z[1:10]#向量下标 y=c(1,3,7,3,4,2) z[y]一些基本运算例子z=sample(x,20,rep=T) z (z1=unique(z));length(z1) z=sample(x,100,rep=T) xz=setdiff(x,z) sort(union(xz,z)) sort(union(xz,z))==x setequal(union(xz,z),x) intersect(1:10,7:50) sample(1:100,20,prob=1:100)一些基本运算例子pi * 10^2 #能够用?”*”来看基本算术运算方法 &*&(pi, &^&(10, 2)) pi * (1:10)^2 x &- pi * 10^2 x print(x) (x=pi * 10^2) pi^(1:5) print(x, digits = 12) class(x) typeof(x)一些基本运算例子 class(cars) typeof(cars) names(cars) summary(cars) str(cars) row.names(cars) class(dist ~ speed) plot(dist ~ speed,cars)一些基本运算例子head(cars)#cars[1:6,] tail(cars) ncol(cars);nrow(cars) dim(cars) lm(dist ~ speed, data = cars) cars$qspeed =cut(cars$speed, breaks =quantile(cars$speed), include.lowest = TRUE) names(cars) cars[3] table(cars[3]) is.factor(cars$qspeed) plot(dist ~ qspeed, data = cars) (a=lm(dist ~ qspeed, data = cars)) summary(a)一些基本运算例子x &- round(runif(20,0,20), digits=2) summary(x) min(x);max(x) median(x) # median mean(x) # mean var(x) # variance sd(x) # standard deviation sqrt(var(x)) rank(x) # rank order(x) x[order(x)] sort(x) sort(x,decreasing=T)#sort(x,dec=T) sum(x);length(x) round(x)一些基本运算例子fivenum(x) # quantiles quantile(x) # quantiles (different convention)有多种定义 quantile(x, c(0,.33,.66,1)) mad(x) # normalized mean deviation to the median (“median average distance“) 可用?mad查看 cummax(x) cummin(x) cumprod(x) cor(x,sin(x/20)) # correlation一些基本运算例子#直方图 x &- rnorm(200) hist(x, col = &light blue&) rug(x) #茎叶图 stem(x) #散点图 N &- 500 x &- rnorm(N) y &- x + rnorm(N) plot(y ~ x) a=lm(y~x) abline(a,col=&red&)#或者abline(lm(y~x),col=&red&) print(&Hello World!&) paste(&x 的最小值= &, min(x)) #cat(&\\end{document}\n&, file=&RESULT.tex&, append=TRUE) demo(graphics)#演示画图一些基本运算例子#复数运算 x=2+3i (z &- complex(real=rnorm(10), imaginary =rnorm(10))) complex(re=rnorm(3),im=rnorm(3)) Re(z) Im(z) Mod(z) Arg(z) choose(3,2);factorial(6) #解方程 f =function(x) x^3-2*x-1 uniroot(f,c(0,2))#迭代 #如果知道根为极值 f =function(x) x^2+2*x+1 optimize(f,c(-2,2))分布和产生随机数? 正态分布 : pnorm(1.2,2,1); dnorm(1.2,2,1); qnorm(.7,2,1); rnorm(10,0,1) #rnorm(10) ? t 分布 : pt(1.2,1); dt(1.2,2); qt(.7,1); rt(10,1) ? 此外还有指数分布、F分布、“卡方”分布、 Beta分布、二项分布、Cauchy分布、Gamma分布、几何分布、超几何分布、对数正态分布、 Logistic分布、负二项分布、Poisson分布、均匀分布、Weibull分布、Willcoxon分布等? 变元可以是向量!可能遇到的问题a=factor(letters[1:10]) a[3]=&w&#不行 a=as.character(a) a[3]=&w& a=factor(a) a输入输出数据x=scan() 1.5 2.6 3.7 2.1 8.9 12 -1.2 -4 #等价于x=c(1.5,2.6,3.7,2.1,8.9,12,-1.2,-4) w=read.table(file.choose(),header=T)#从列表中选择 setwd(“f:/2010stat”)#或setwd(&f:\\2010stat&) (x=rnorm(20)) write(x,&f:/2010stat/test.txt&) y=scan(&f:/2010stat/test.txt&);y y=y[1:5,];str(y) write.table(y,&f:/2010stat/test.txt&,row.names=F) w=read.table(&f:/2010stat/test.txt&,header=T) str(w) write.csv(y,&f:/2010stat/test.csv&) v=read.csv(&f:/2010stat/test.csv&) str(v) data=read.table(&clipboard&) write.table(&clipboard&)序列和向量z=seq(-1,10,length=100)#z=seq(-1,10, len=100) z=seq(10,-1,-1) #z=10:-1 x=rep(1:3,3) x=rep(3:5,1:3) x=rep(c(1,10),c(4,5)) w=c(1,3,x,z);w[3] x=rep(0,10);z=1:3;x+z x*z rev(x) z=c(&no cat&,&has &,&nine&,&tails&) z[1]==&no cat& z=1:5 z[7]=8;z z=NULL z[c(1,3,5)]=1:3; z rnorm(10)[c(2,5)] z[-c(1,3)] #去掉第1、3元素 z=sample(1:100,10);z which(z==max(z))#给出下标向量?矩阵x=sample(1:100,12);x all(x&0);all(x!=0);any(x&0);(1:10)[x&0] diff(x) diff(x,lag=2) x=matrix(1:20,4,5);x x=matrix(1:20,4,5,byrow=T);x t(x) x=matrix(sample(1:100,20),4,5) 2*x x+5 y=matrix(sample(1:100,20),5,4) x+t(y) (z=x%*%y) z1=solve(z) # solve(a,b)可以解ax=b方程 z1%*%z round(z1%*%z,14)矩阵nrow(x); ncol(x);dim(x)#行列数目 x=matrix(rnorm(24),4,6) x[c(2,1),]#第2和第1行 x[,c(1,3)] #第1和第3列 x[2,1] #第[2,1]元素 x[x[,1]&0,1] #第1列大于0的元素 sum(x[,1]&0) #第1列大于0的元素的个数 sum(x[,1]&=0) #第1列不大于0的元素的个数 x[,-c(1,3)] #没有第1、3列的x. diag(x) diag(1:5) diag(5) x[-2,-c(1,3)] #没有第2行、第1、3列的x. x[x[,1]&0&x[,3]&=1,1]; #第1中大于0并且相应于第3列中小于或等于1的元 x[x[,2]&0|x[,1]&.51,1] #第1中小于.51或者相应于第2列中大于0的元素 (“或”) x[!x[,2]&.51,1]#第一列中相应于第2列中不小于.51的元素(“非”) apply(x,1,mean);apply(x,2,sum)矩阵/高维数组#上下三角阵 x=matrix(rnorm(24),4,6) diag(x) diag(1:5) diag(5) x[lower.tri(x)]=0#x[upper.tri(x)]=0;diag(x)=0 x=array(runif(24),c(4,3,2));x is.matrix(x) #可由dim(x)得到维数(4,3,2) is.matrix(x[1,,]) x=array(1:24,c(4,3,2)) x[c(1,3),,] x=array(1:24,c(4,3,2)) apply(x,1,mean) apply(x,1:2,sum) apply(x,c(1,3),prod)矩阵/高维数组/scale#矩阵与向量之间的运算 x=matrix(1:20,5,4) sweep(x,1,1:5,&*&) x*1:5 sweep(x,2,1:4,&+&) (x=matrix(sample(1:100,24),6,4));(x1=scale(x)) (x2=scale(x,scale=F)); (x3=scale(x,center=F)) round(apply(x1,2,mean),14) apply(x1,2,sd) round(apply(x2,2,mean),14);apply(x2,2,sd) round(apply(x3,2,mean),14);apply(x3,2,sd)Data.framex=matrix(1:6,2,3) z=data.frame(x);z z$X2 attributes(z) names(z)=c(&TOYOTA&,&GM&,&HUNDA&) row.names(z)=c(&2001&,&2002&) Z attach(x) GM detach(x) GM sapply(z,is.numeric)#apply(z,2,is.numeric)缺失值问题等airquality complete.cases(airquality)#哪一行没有缺失值 which(complete.cases(airquality)==F) sum(complete.cases(airquality)) na.omit(airquality) #append,cbind,vbind x=1:10;x[12]=3 (x1=append(x,77,after=5)) cbind(1:3,4:6);rbind(1:3,4:6) #去掉矩阵重复的行 (x=rbind(1:5,runif(5),runif(5),1:5,7:11)) x[!duplicated(x),] unique(x)List#list可以是任何对象的集合(包括lists) z=list(1:3,Tom=c(1:2, a=list(&R&,letters[1:5]),w=&hi!&)) z[[1]];z[[2]] z$T z$T$a2 z$T[[3]] z$T$w attributes(airquality)#属性! airquality$Ozone attributes(matrix(1:6,2,3))Categorical data A survey asks people if they smoke or not. The data is Yes, No, No, Yes, Yes x=c(&Yes&,&No&,&No&,&Yes&,&Yes&) table(x);x factor(x)?Barplot:Suppose, a group of 25 people are surveyed as to their beer-drinking preference. The categories were (1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw data is 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1beer = scan()1barplot(beer) # this isn't correctbarplot(table(beer)) # Yes, call with summarized databarplot(table(beer)/length(beer)) # divide by n for proportiontable(beer)/length(beer)Table/categorical datalibrary(MASS) quine attach(quine) table(Age) table(Sex, Age); tab=xtabs(~ Sex + Age, quine); unclass(tab) tapply(Days, Age, mean) tapply(Days, list(Sex, Age), mean) #apply, sapply, tapply, lapplysmokes = c(&Y&,&N&,&N&,&Y&,&N&,&Y&,&Y&,&Y&,&N&,&Y&) amount = c(1,2,2,3,3,1,2,1,3,2) (tmp=table(smokes,amount)) # store the tableoptions(digits=3) # only print 3 decimal placesprop.table(tmp,1) # the rows sum to 1 nowprop.table(tmp,2) # the columns sum to 1 now#上两行等价于下面两行sweep(tmp, 1, margin.table(tmp, 1), &/&)sweep(tmp, 2, margin.table(tmp, 2), &/&)prop.table(tmp)#amount # all the numbers sum to 1options(digits=7) # restore the number of digitsarray/matrix??table??data.frame## Start with a contingency table. ftable(Titanic, row.vars = 1:3) ftable(Titanic, row.vars = 1:2) data.frame(Titanic)#把array变成data.frame a=xtabs(Freq~Survived+Sex, w) biplot(corresp(a, nf=2))#应用之一 ## Start with a data frame. str(mtcars) x &- ftable(mtcars[c(&cyl&, &vs&, &am&, &gear&)]) x #为array，其维的次序为(&cyl&, &vs&, &am&, &gear&) ftable(x, row.vars = c(2, 4))#从x（array）确定表的行变量 ## Start with expressions, use table()'s &dnn& to change labels ftable(mtcars$cyl, mtcars$vs, mtcars$am, mtcars$gear, row.vars = c(2, 4), dnn = c(&Cylinders&, &V/S&, &Transmission&, &Gears&)) ftable(vs~carb,mtcars)#vs是列，carb是行#或ftable(mtcars$vs~mtcars$carb) ftable(carb~vs,mtcars) #vs是行，carb是列 ftable(mtcars[,c(8,11)])#和上面ftable(carb~vs,mtcars)等价 ftable(breaks~wool+tension,warpbreaks) #as.data.frame (DF &- as.data.frame(UCBAdmissions)) #等价于data.frame(UCBAdmissions) xtabs(Freq ~ Admit+ Gender + Dept, DF)#:把方阵变成原来的列联表 (a=xtabs( Freq~ Admit + Gender, data=DF))#如无频数(权)，左边为空写函数ss=function(n=100){z=2;for (i in 2:n)if(any(i%%2:(i1)==0)==F)z=c(z,i);return(z) } fix(ss) ss() t1=Sys.time() ss(10000) Sys.time()-t1 system.time(ss(10000)) #函数可以不写return,这时最后一个值为return的值.为了输出多个值最好使用list#几个图一起: par(mfrow=c(2,4))#par(mfcol=c(2,4)) layout(matrix(c(1,1,1,2,3,4,2,3,4),nr=3,byrow=T)) hist(rnorm(100),col=&Red&,10) hist(rnorm(100),col=&Blue&,8) hist(rnorm(100),col=&Green&) hist(rnorm(100),col=&Brown&) #par(mar = c(bottom, left, top, right))设置边缘 #缺省值c(5, 4, 4, 2) + 0.1 (英寸) spring= data.frame(compression=c(41,39,43,53,42,48,47,46), distance=c(120,114,132,157,122,144,137,141)) attach(spring)#(Hooke’s law: f=.5ks) par(mfcol=c(2,2)) plot(distance ~ compression) plot(distance ~ compression,type=&l&) plot(compression, distance,type=&o&) plot(compression, distance,type=&b&)关于画图关于画图par(mfrow=c(2,2))#准备画2x2的4个图 plot(compression, distance,main= &Hooke's Law&) #只有标题 plot(compression, distance,main= &Hooke's Law&, xlab= &x&,ylab= &y&) #标题+x,y标记 identify(compression,distance) #标出点号码 plot(compression, distance,main=&Hooke's Law&) #只有标题 text(46,120, expression(f==frac(1,2)*k*s))#在指定位写入文字 plot(compression, distance,main=&Hooke's Law&) #只有标题的图 text(locator(2), c(&I am here!&,&you are there!&)) #在点击的两个位置写入文字 par(mfrow=c(1,1)) plot(1:10,sin(1:10),type=&l&,lty=2,col=4,main=paste(strwrap(&The title is too long, and I hate to make it shorter, !@#$%^&*&,width=50),collapse=&\n&)) legend(1.2,1.0,&Just a sine&,lty=2,col=4)关于画图library(MASS);data(Animals);attach(Animals) par(mfrow=c(2,2)) plot(body, brain) plot(sqrt(body), sqrt(brain)) plot((body)^0.1, (brain)^0.1) plot(log(body),log(brain)) #或者plot(brain~body,log=&xy&) par(mfrow=c(1,1)) par(cex=0.7,mex=0.7) #character (cex) & margin (mex) expansion plot(log(body),log(brain)) text(x=log(body), y=log(brain),labels=row.names(Animals), adj=1.5)# adj=0 implies left adjusted text plot(log(body),log(brain)) identify(log(body),log(brain),row.names(Animals))关于画图(符号颜色大小形状等)plot(1,1,xlim=c(1,7.5),ylim=c(0,5),type=&n&) # Do not plot points points(1:7,rep(4.5,7),cex=seq(1,4,l=7),col=1:7, pch=0:6) text(1:7,rep(3.5,7),labels=paste(0:6,letters[1:7]),cex=seq(1,4,l=7), col=1:7) points(1:7,rep(2,7), pch=(0:6)+7) # Plot symbols 7 to 13 text((1:7)+0.25, rep(2,7), paste((0:6)+7)) # Label with symbol number points(1:7,rep(1,7), pch=(0:6)+14) # Plot symbols 14 to 20 text((1:7)+0.25, rep(1,7), paste((0:6)+14)) # Labels with symbol number#调色板 par(mfrow=c(2,4)) palette(); barplot(rnorm(15,10,3),col=1:15) palette(rainbow(15));barplot(rnorm(15,10,3),col=1:15) palette(heat.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(terrain.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(topo.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(cm.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(gray(seq(0,0.9,l=15)));barplot(rnorm(15,10,3),col=1:15) palette(grey(seq(0,0.5,l=15))));barplot(rnorm(15,10,3),col=1:15) palette(&default&) par(mfrow=c(1,1))关于画图#matplot sines=outer(1:20,1:4,function(x, y) sin(x/20*pi*y)) matplot(sines, pch = 1:4, type = &o&, col = rainbow(ncol(sines))) #legend x &- seq(-pi, pi, len = 65) plot(x, sin(x), type = &l&, ylim = c(-1.2, 1.8), col = 3, lty = 2) points(x, cos(x), pch = 3, col = 4) lines(x, tan(x), type = &b&, lty = 1, pch = 4, col = 6) title(&legend(..., lty = c(2, -1, 1), pch = c(-1,3,4), merge = TRUE)&, cex.main = 1.1) legend(-1, 1.9, c(&sin&, &cos&, &tan&), col = c(3,4,6), lty = c(2, -1, 1), pch = c(-1, 3, 4), merge = TRUE, bg='gray90')关于画图#barplot and table par(mfrow=c(2,2)) tN=table(Ni=rpois(100, lambda=5));tN r=barplot(tN, col='gray') lines(r, tN, type='h', col='red', lwd=2) #- type = &h& plotting *is* `bar'plot barplot(tN, space = 1.5, axisnames=FALSE, sub = &barplot(..., space=0, axisnames = FALSE)&) #如space=1.5则有稀牙缝 barplot(tN, space = 0, axisnames=FALSE, sub = &barplot(..., space=0, axisnames = FALSE)&) pie(tN)#pie plot par(mfrow=c(1,1)) #加grid plot (1:3) grid(10, 5 , lwd = 2)dev.dev.dev.list关于画图(pairs/三维)#pairs#data(iris) pairs(iris[1:4], main = &Anderson's Iris Data -- 3 species&, pch = 21, bg = c(&red&, &green3&, &blue&)[unclass(iris$Species)]) #iris为150x5数据,这里是4个数量变量的点图(最后一个是分类变量(iris$Species)) #stars#data(mtcars) stars(mtcars[, 1:7], key.loc = c(14, 1.5), main = &Motor Trend Cars : full stars()&,flip.labels=FALSE) #mtcars为32x11数据,这里只选前7个数量变量的点图 #persp x &- seq(-10, 10, length= 30) y &- x f &- function(x,y) { r &- sqrt(x^2+y^2); 10 * sin(r)/r } z &- outer(x, y, f) z[is.na(z)] &- 1 persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = &lightblue&)data(volcano) par(mfrow=c(2,2)) z &- 2 * volcano# Exaggerate the relief x &- 10 * (1:nrow(z)) #10 meter spacing(S to N) y &- 10 * (1:ncol(z)) #10 meter spacing(E to W) ## Don't draw the grid lines : border = NA #par(bg = &slategray&) persp(x, y, z, theta = 135, phi = 30, col = &green3&, scale = FALSE, ltheta = -120, shade = 0.75, border = NA, box = FALSE) par(bg= &white&)关于画图(三维)#contour rx &- range(x &- 10*1:nrow(volcano)) ry &- range(y &- 10*1:ncol(volcano)) ry &- ry + c(-1,1) * (diff(rx) - diff(ry))/2 tcol &- terrain.colors(12) opar &- par(pty = &s&, bg = &lightcyan&);par(opar) plot(x = 0, y = 0,type = &n&, xlim = rx, ylim = ry, xlab = &&, ylab = &&) u &- par(&usr&) rect(u[1], u[3], u[2], u[4], col = tcol*8+, border = “red”) #rect画矩形 contour(x, y, volcano, col = tcol[2], lty = &solid&, add = TRUE, vfont = c(&sans serif&, &plain&)) title(&A Topographic Map of Maunga Whau&, font = 4) abline(h = 200*0:4, v = 200*0:4, col = &lightgray&, lty = 2, lwd = 0.1);par(opar) #image x &- 10*(1:nrow(volcano)) y &- 10*(1:ncol(volcano)) image(x, y, volcano, col = terrain.colors(100), axes = FALSE) contour(x, y, volcano, levels = seq(90, 200, by=5), add = TRUE, col = &peru&) axis(1, at = seq(100, 800, by = 100)) axis(2, at = seq(100, 600, by = 100)) box() title(main = &Maunga Whau Volcano&, font.main = 4) par(mfrow=c(1,1))多窗口操作x11() plot(1:10) x11() plot(rnorm(10)) dev.set(dev.prev()) abline(0,1)# through the 1:10 points dev.set(dev.next()) abline(h=0, col=&gray&)# for the residual plot dev.set(dev.prev()) dev.off(); dev.off()#- close the two X devices #dev.list()画图杂项#模拟布朗运动 n=100;x=cumsum(rnorm(100));y=cumsum(rnorm(100));plot(x,y,type=&l&) x=0;y=0;plot(100,ylim=c(-15,15),xlim=c(-15,15))#慢动作 for(i in 1:200){x1=x+rnorm(1);y1=y+rnorm(1); segments(x,y,x1,y1);x=x1;y=y1 Sys.sleep(.05)} #散点大小同因变量值成比例 x=1:10;y=runif(10) symbols(x,y,circle=y/2,inches=F,bg=x) #数据框的每一列都做Q-Q图 table=data.frame(x1=rnorm(100),x2=rnorm(100,1,1)) par(ask=TRUE)#waitforchanging等待页面改变的确认 results=apply(table,2,qqnorm) par(ask=FALSE) #在一个图上添加一个小图 x=rnorm(100) hist(x) op=par(fig=c(.02,.5,.5,.98),new=TRUE) boxplot(x) #数学符号 x=1:10;plot(x,type=&n&) text(3,2,expression(paste(&Temperature(&,degree,&C) in 2003&))) text(4,4,expression(bar(x)==sum(frac(x[i],n),i==1,n))) text(6,6,expression(hat(beta)==(X^t*X)^{.1}*X^t*y)) text(8,8,expression(z[i]==sqrt(x[i]^2+y[i]^2)))改变大小写字母x=c(&I&,&am&,&A&,&BIG&, &Cat&) tolower(x) toupper(x)R统计模型讲义#基础 x=rnorm(20,10) t.test(x,m=9,alt=&greater&) t.test(x[1:10],m=9,alt=&greater&)$p.value t.test(x,con=.90)$conf x=rnorm(30,10);y=rnorm(30,10.1) t.test(x,y,alt=&less&) library(TeachingDemos) ci.examp() run.ci.examp() vis.boxcox() vis.boxcoxu()回归相关#相关 x=rnorm(20);y=rnorm(20); cor(x,y) cor(x,y,method=&kendall&); cor(x,y,method=&spearman&) cor.test(x,y); cor.test(x,y,method=&kendall&); cor.test(x,y,method=&spearman&) cor.test(x,y,method=&kendall&)$p.value #相关吗? x=rnorm(3);y=rnorm(3);cor(x,y);cor.test(x,y)$p.value library(TeachingDemos) put.points.demo()基本原理#基本原理 set.seed(100) x1=rnorm(100);x2=rnorm(100);eps=rnorm(100) y=5+2*x1-3*x2+eps a=lm(y~x1+x2) (lm(y~0+x1+x2))#不要截距:等价于(lm(y~-1+x1+x2)) summary(a);anova(a) names(a) shapiro.test(a$res) qqnorm(a$res);qqline(a$res) #数学原理 x=cbind(1,x1,x2) dim(x) b=solve(t(x)%*%x)%*%t(x)%*%y b a$coe例1：cross.txt10 y 0 5345 x67863例1: cross.txtw=read.table(&cross.txt&,header=T) head(w) plot(y~x,w);summary(w) a=lm(y~x+z,w) summary(a) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) a1=lm(y~x*z,w) summary(a1);anova(a1) qqnorm(a1$res);qqline(a1$res) shapiro.test(a1$res) anova(a,a1) library(party)#更简单的方法 wt=mob(y~x|z,data=w) coef(wt);plot(wt) plot(y~x,w);abline(coef(wt)[1,],col=2);abline(coef(wt)[2,],col=4)回归方程65Poison ExperimentThe data give the survival times (in 10 hour units) in a 3 x 4 factorial experiment, the factors being (a) three poisons and (b) four treatments. Each combination of the two factors is used for four animals, the allocation to animals being completely randomized. Box, G. E. P., and Cox, D. R. (1964). An analysis of transformations (with Discussion). J. R. Statist. Soc. B, 26, 211-252.http://www.statsci.org/data/general/poison.html66例2：poison.txt：3种毒药，4种处理，用于动物实验，48个观测值1.0 1.5 2.0 2.5 3.0 3.5 4.0Poison3.04.0Treatment1.0 2.0Time1.01.52.02.53.00.20.40.60.81.01.20.2 0.4 0.6 0.8 1.0 1.21.01.52.02.53.067回归setwd(&f:/2010stat&) w=read.table(&poison.txt&,head=T) head(w);tail(w) str(w);summary(w) dim(w) w$Poison=factor(w$Poison) w$Treatment=factor(w$Treatment) pairs(w) #直接回归 a=lm(Time~Poison*Treatment,w) anova(a) a=lm(Time~.,w) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) #变换 a=lm(1/Time~Poison+Treatment,w) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) summary(a)逐步回归(step或stepAIC {MASS})library(MASS) quine.hi &- aov(log(Days + 2.5) ~ .^4, quine) quine.nxt &- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn) quine.stp &- stepAIC(quine.nxt, scope = list(upper = ~Eth*Sex*Age*Lrn, lower = ~1), trace = FALSE) quine.stp$anova cpus1 &- cpus attach(cpus) for(v in names(cpus)[2:7]) cpus1[[v]] &- cut(cpus[[v]], unique(quantile(cpus[[v]])), include.lowest = TRUE) detach() cpus0 &- cpus1[, 2:8] # excludes names, authors' predictions cpus.samp &- sample(1:209, 100) cpus.lm &- lm(log10(perf) ~ ., data = cpus1[cpus.samp,2:8]) cpus.lm2 &- stepAIC(cpus.lm, trace = FALSE) cpus.lm2$anovaexample(birthwt) birthwt.glm &- glm(low ~ ., family = binomial, data = bwt) birthwt.step &- stepAIC(birthwt.glm, trace = FALSE) birthwt.step$anova birthwt.step2 &- stepAIC(birthwt.glm, ~ .^2 + I(scale(age)^2) + I(scale(lwt)^2), trace = FALSE) birthwt.step2$anova quine.nb &- glm.nb(Days ~ .^4, data = quine) quine.nb2 &- stepAIC(quine.nb) quine.nb2$anova变换并拟合主效应71结果解释72异方差时的考虑? ? ? ? Box-Cox变换加权最小二乘(如: lm中weights = 1/x^2) 异常值处理不一定是一个模型等等(要做探索分析)多项式回归#多项式回归 y &- cars$x &- cars$speed o = order(x) plot( y~x ) do.it &- function (model, col) { r &- lm( model ); yp &- predict(r) lines( yp[o] ~ x[o], col=col, lwd=3 )} do.it(y~x, col=&red&) do.it(y~x+I(x^2), col=&blue&) do.it(y~-1+I(x^2), col=&green&) legend(par(&usr&)[1], par(&usr&)[4], c(&affine function&, &degree-2 polynomial&, &degree 2 monomial&), lwd=3, col=c(&red&, &blue&, &green&), ) n &- 100 x &- runif(n,min=-4,max=4) + sign(x)*.2 y &- 1/x + rnorm(n)#双曲线 plot(y~x) lm( 1/y ~ x ) n &- 100 x &- rlnorm(n)^3.14#a log-normal distribution is a probability distribution of a random variable whose logarithm is normally distributed. y &- x^-.1 * rlnorm(n) plot(y~x) lm(log(y) ~ log(x))多项式p,q正交, 如#关于正交多项式 y &- cars$x &- cars$speed #非正交: 一项加一项(互相影响, 显著的系数变成不显著) summary( lm(y~x) ) summary( lm(y~x+I(x^2)) ) summary( lm(y~x+I(x^2)+I(x^3)) ) summary( lm(y~x+I(x^2)+I(x^3)+I(x^4)) ) summary( lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) ) #正交: 不会改变开始显著的系数 poly: Compute Orthogonal Polynomials summary( lm(y~poly(x,1)) ) summary( lm(y~poly(x,2)) ) summary( lm(y~poly(x,3)) ) summary( lm(y~poly(x,4)) ) summary( lm(y~poly(x,5)) )#对正交多项式点出系数的p-values n &- 5 p &- matrix( nrow=n, ncol=n+1 ) for (i in 1:n) { p[i,1:(i+1)] &- summary(lm( y ~ poly(x,i) ))$coefficients[,4] } matplot(p, type='l', lty=1, lwd=3) legend( par(&usr&)[1], par(&usr&)[4], as.character(1:n), lwd=3, lty=1, col=1:n ) title(main=&Evolution of the p-values (orthonormal polynomials)&) #对非正交多项式, 点出系数的p-values p &- matrix( nrow=n, ncol=n+1 ) p[1,1:2] &- summary(lm(y ~ x) )$coefficients[,4] p[2,1:3] &- summary(lm(y ~ x+I(x^2)) )$coefficients[,4] p[3,1:4] &- summary(lm(y ~ x+I(x^2)+I(x^3)) )$coefficients[,4] p[4,1:5] &- summary(lm(y ~ x+I(x^2)+I(x^3)+I(x^4)) )$coefficients[,4] p[5,1:6] &- summary(lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) )$coefficients[,4] matplot(p, type='l', lty=1, lwd=3) legend( par(&usr&)[1], par(&usr&)[4], as.character(1:n), lwd=3, lty=1, col=1:n ) title(main=&Evolution of the p-values (non orthonormal polynomials)“)#例子 data(beavers) y &- beaver1$temp x &- 1:length(y) plot(y~x) for (i in 1:10) { r &- lm( y ~ poly(x,i) ) lines( predict(r), type=&l&, col=i ) } summary(r)y36.4036.636.837.037.237.4204060 x80100非参数回归#非参数: 样条 plot(quakes$long, quakes$lat) lines( smooth.spline(quakes$long, quakes$lat), col='red', lwd=3) library(Design)#rcs: Design Special Transformation Functions # 4-node spline r3 &- lm( quakes$lat ~ rcs(quakes$long) ) plot( quakes$lat ~ quakes$long ) o &- order(quakes$long) lines( quakes$long[o], predict(r)[o], col='red', lwd=3 ) r &- lm( quakes$lat ~ rcs(quakes$long,10) ) lines( quakes$long[o], predict(r)[o], col='blue', lwd=6, lty=3 ) title(main=&Regression with rcs&) legend( par(&usr&)[1], par(&usr&)[3], yjust=0, c(&4 knots&, &10 knots&), lwd=c(3,3), lty=c(1,3), col=c(&red&, &blue&) ) #更多的样条 library(splines) data(quakes) x &- quakes[,2] y &- quakes[,1] o &- order(x) x &- x[o] y &- y[o] r1 &- lm( y ~ bs(x,df=10) ) r2 &- lm( y ~ ns(x,df=6) ) plot(y~x) lines(predict(r1)~x, col='red', lwd=3) lines(predict(r2)~x, col='green', lwd=3)#核光滑 plot(cars$speed, cars$dist) lines(ksmooth(cars$speed, cars$dist, &normal&, bandwidth=2), col='red') lines(ksmooth(cars$speed, cars$dist, &normal&, bandwidth=5), col='green') lines(ksmooth(cars$speed, cars$dist, &normal&, bandwidth=10), col='blue') #加权局部最小二乘. Weighted Local Least Squares: loess #各种核函数 curve(dnorm(x), xlim=c(-3,3), ylim=c(0,1.1)) x &- seq(-3,3,length=200) D.Epanechikov &- function (t) { ifelse(abs(t)&1, 3/4*(1-t^2), 0) } lines(D.Epanechikov(x) ~ x, col='red') D.tricube &- function (t) { # aka &triweight kernel& ifelse(abs(t)&1, (1-abs(t)^3)^3, 0) } lines(D.tricube(x) ~ x, col='blue') legend( par(&usr&)[1], par(&usr&)[4], yjust=1, c(&noyau gaussien&, &noyau d'Epanechikov&, &noyau tricube&), lwd=1, lty=1, col=c(par('fg'),'red', 'blue')) title(main=&Differents kernels&)#局部多项式回归 library(KernSmooth) data(quakes) x &- quakes$y &- quakes$lat plot(y~x) bw &- dpill(x,y) # .2 lines( locpoly(x,y,degree=0, bandwidth=bw), col='red' ) lines( locpoly(x,y,degree=1, bandwidth=bw), col='green' ) lines( locpoly(x,y,degree=2, bandwidth=bw), col='blue' ) legend( par(&usr&)[1], par(&usr&)[3], yjust=0, c(&degree = 0&, &degree = 1&, &degree = 2&), lwd=1, lty=1, col=c('red', 'green', 'blue')) title(main=&Local Polynomial Regression&) #大窗宽 plot(y~x);bw &- .5 lines( locpoly(x,y,degree=0, bandwidth=bw), col='red' ) lines( locpoly(x,y,degree=1, bandwidth=bw), col='green' ) lines( locpoly(x,y,degree=2, bandwidth=bw), col='blue' ) legend( par(&usr&)[1], par(&usr&)[3], yjust=0, c(&degree = 0&, &degree = 1&, &degree = 2&), lwd=1, lty=1, col=c('red', 'green', 'blue')) title(main=&Local Polynomial Regression (wider window)&)非线性回归#非线性回归 library(nls2) f &- function (x,p) { u &- p[1] v &- p[2] u/(u-v) * (exp(-v*x) - exp(-u*x)) } n &- 100 x &- runif(n,0,2) y &- f(x, c(3.14,2.71)) + .1*rnorm(n) r &- nls( y ~ f(x,c(a,b)), start=c(a=3, b=2.5) ) plot(y~x) xx &- seq(0,2,length=200) lines(xx, f(xx,r$m$getAllPars()), col='red', lwd=3) lines(xx, f(xx,c(3.14,2.71)), lty=2)分位数回归模型损失函数寻找参数（可能是向量）满足对线性回归模型条件则为在最小二乘回归中在t分位数回归中 t=0.5?最小一乘回归在t分位数定义：损失函数rt(u)形状? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?#分位数回归 library(quantreg) data(engel);head(engel);plot(engel) plot(engel, log = &xy&,main = &'engel' data (log - log scale)&) plot(log10(foodexp) ~ log10(income), data = engel,main = &'engel' data (log10 - tranformed)&) taus &- c(.15, .25, .50, .75, .95, .99) rqs &- as.list(taus) for(i in seq(along = taus)) { rqs[[i]] &- rq(log10(foodexp) ~ log10(income), tau = taus[i], data = engel) lines(log10(engel$income), fitted(rqs[[i]]), col = i+1)} legend(&bottomright&, paste(&tau = &, taus), inset = .04, col = 2:(length(taus)+1), lty=1) abline(lm(log10(foodexp)~log10(income),engel),lwd=5)#最小二乘黑粗线 plot(summary(rq(log10(foodexp)~log10(income),tau = 1:49/50,data=engel)))#画出系数图 #未变换数据 plot(foodexp~income, data = engel, main = &'engel' data&) for(i in seq(along = taus)) { rqs[[i]] &- rq(foodexp ~ income, tau = taus[i], data = engel) lines(engel$income, fitted(rqs[[i]]), col = i+1)} legend(&bottomright&, paste(&tau = &, taus), inset = .04, col = 2:(length(taus)+1), lty=1) abline(lm(foodexp~income,engel),lwd=5)#最小二乘黑粗线 plot(summary(rq(foodexp~income,tau = 1:49/50,data=engel)))#画出系数图N &- 2000 x &- runif(N) y &- rnorm(N) y &- -1 + 2 * x + ifelse(y&0, y+5*x^2, y-x^2) plot(x,y) abline(lm(y~x), col=&red&) library(quantreg) plot(y~x) for (a in seq(.1,.9,by=.1)) { abline(rq(y~x, tau=a), col=&blue&, lwd=3) } #局部多项式分位数回归: locally polynomial quantile regression plot(y~x) for (a in seq(.1,.9,by=.1)) { r &- lprq(x,y, h=bw.nrd0(x), # See ?density tau=a) lines(r$xx, r$fv, col=&blue&, lwd=3) }Logistic & Probit 回归广义线性模型(GLM)所有函数已知, q和均值有关, 如果b(q)= q, 则为典则形式(所有分布都可通过变换成为典则形式), 再者, 如果T(y)=y, 则q称为典则参数, 而且如果用典则参数q, 即, q=b(m), q=b(m)=Xb为典则连接函数(连接函数是描述均值m和线性表示Xb之间关系的函数b(m)=Xb, 一般不等于q).
Logistic回归/Probit回归例子： ModeChoice95例子： ModeChoice实际上，Mode只有两种：0、1，其余变量为数量96考虑2种logistic模型模型b模型c97ANOVA library(Ecdat);data(ModeChoice)#二分类w=ModeChoice #两个logistic模型 b=glm(factor(mode)~ttme+invt+gc,data=w,family=&binomial&) c=glm(factor(mode)~ttme+invc*invt+gc,data=w,family=&binomial&) anova(c,test=&Chi&);summary(c);anova(b,c,test=&Chi&)98模型b拟合99模型c拟合100
考虑2种probit模型模型bb或模型cb或102ANOVA #两个probit模型bb=glm(factor(mode)~ttme+invt+gc,data=w,family=binomial(link=probit)) cb=glm(factor(mode)~ttme+invc*invt+gc,data=w,binomial(link=probit)) anova(cb,test=&Chi&);summary(c);anova(bb,cb,test=&Chi&) anova(b,bb,test=&Chi&);anova(c,cb,test=&Chi&)103
模型b拟合105
比较logistic和probitlibrary(dglm) library(statmod) clotting &- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) a1=glm(lot1 ~ log(u), data=clotting, family=Gamma) summary(a1) # The same example as in glm: the dispersion is modelled as constant # However, dglm used ml not reml, so results slightly different: out &- dglm(lot1 ~ log(u), ~1, data=clotting, family=Gamma) summary(out)dispersion# Try a double glm out2 &- dglm(lot1 ~ log(u), ~u, data=clotting, family=Gamma)summary(out2) anova(out2) # Summarize the mean model as for a glm summary.glm(out2) # Summarize the dispersion model as for a glm summary(out2$dispersion.fit) # Examine goodness of fit of dispersion model by plotting residuals plot(fitted(out2$dispersion.fit),residuals(out2$dispersion.fit))Poisson log-linear model: dispersionn independent responsesoffsetThe Poisson distribution has but it may happen that the actual variance exceeds the nominal variance under the assumed probability model. Suppose now that θi=λi ni Thus, it can be shownHence, for φ&0 we have overdispersion. It is interesting to note that the same mean and variance arise also if we assume a negative binomial distribution for the response variable.library(dispmod) data(salmonellaTA98) attach(salmonellaTA98) log.x10 &- log(x+10) mod &- glm(y ~ log.x10 + x, family=poisson(log)) summary(mod)mod.disp &- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersionPoisson log-linear model: dispersion# compute predictions on a grid of x-values... x0 &- seq(min(x), max(x), length=50) eta0 &- predict(mod, newdata=data.frame(log.x10=log(x0+10), x=x0), se=TRUE) eta0.disp &- predict(mod.disp, newdata=data.frame(log.x10=log(x0+10), x=x0), se=TRUE) # ... and plot the mean functions with variability bands plot(x, y) lines(x0, exp(eta0$fit)) lines(x0, exp(eta0$fit+2*eta0$se), lty=2) lines(x0, exp(eta0$fit-2*eta0$se), lty=2) lines(x0, exp(eta0.disp$fit), col=2) lines(x0, exp(eta0.disp$fit+2*eta0.disp$se), lty=2, col=2) lines(x0, exp(eta0.disp$fit-2*eta0.disp$se), lty=2, col=2)Poisson log-linear model: dispersion##-- Holford's datadata(holford) attach(holford) mod &- glm(incid ~ offset(log(pop)) + Age + Cohort, family=poisson(log)) summary(mod) mod.disp &- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion #另一种方法（利用Tweedie distributions―自己找文献） #tt= glm(incid ~ offset(log(pop)) + Age + Cohort, family=tweedie(var.power=4,link.power=0))识别回归共线性问题(Multicollinearity)? 如果加/减一个自变量, 系数变化很大. ? F检验拒绝所有系数为0, 但有的系数不显著. ? 有人用formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity ? 条件数检验(Condition Number Test) ? Farrar-Glauber Test: If the variables are found to be orthogonal, there is
这里R^2_j是第j个变量在所有其它变量上回归时的确定系数, tolerance太小(&0.2, 0.1)或VIF太大(&5, 10)则有多重共线性问题. 条件数(Condition Number ) 正交则为1, &15则有问题, &30问题严重计算VIF(线性模型和广义线性模型) Calculates variance-inflation and generalized variance-inflation factors for linear and generalized linear models.library(car) vif(lm(prestige ~ income + education, data=Duncan)) vif(lm(prestige ~ income + education + type, data=Duncan))counts &- c(18,17,15,20,10,20,25,13,12) outcome &- gl(3,1,9) treatment &- gl(3,3) print(d.AD &- data.frame(treatment, outcome, counts)) glm.D93 &- glm(counts ~ outcome + treatment, family=poisson()) anova(glm.D93) summary(glm.D93) vif(glm.D93)计算条件数(Compute or Estimate the Condition Number of a Matrix)kappa(x1 &- cbind(1,1:10))# 15.71 kappa(x1, exact = TRUE) # 13.68 kappa(x2 &- cbind(x1,2:11))# high! [x2 is singular!]hilbert &- function(n) { i &- 1:n; 1 / outer(i - 1, i, &+&) } sv9 &- svd(h9 &- hilbert(9))$ d kappa(h9)# pretty high! kappa(h9, exact = TRUE) == max(sv9) / min(sv9) kappa(h9, exact = TRUE) / kappa(h9) # .677 (i.e., rel.error = 32%)岭回归? ridge b2 p p ?N ? ? ? ? 2? = arg min ?? ? yi ? b 0 ? ? xij b j ? ? ? ? b j ? b j =1 j =1 ? ? i =1 ? ? ? ?? bridge? ? = arg min ? ? yi ? b 0 ? ? xij b j ? b i =1 ? j =1 ?N p2subject to?bj =1p2 j?slibrary(perturb);data(consumption)A data frame with 28 observations on the following 5 variables.?year: 1947 to 1974 ?c: total consumption, 1958 dollars ?r: the interest rate (Moody's Aaa) ?dpi: disposable income, 1958 dollars ?d_dpi annual change in disposable incomelibrary(perturb);data(consumption) head(consumption) library(MASS) ct1&-c(NA,c[-length(c)]); a&-lm.ridge(c~ct1+dpi+r+d_dpi, lambda=seq(0, 0.1,length=100), model =TRUE) names(a)# &coef& &scales& &Inter& &lambda& &ym& &xm& &GCV& &kHKB& &kLW& a$lambda[which.min(a$GCV)] ##找到GCV 最小时的lambdaGCV= 0.014 a$coef[,which.min(a$GCV)] ##找到GCV 最小时对应的系数 a$coef[,which.min(a$GCV)] par(mfrow=c(1,2)) plot(a) ##画出图形，并作出lambda 取0.01 时的那条线，以红线表示。 abline(v=a$lambda[which.min(a$GCV)],col=&red&) plot(a$lambda,a$GCV,type=&l&)#lamda 同GCV 之间关系的图形 abline(v=a$lambda[which.min(a$GCV)],col=&green&)80t(x$coef)a$GCV0.00 0.02 0.04 0.06 0.08 0.10402000.520.000.530.54600.550.020.040.060.080.10x$lambdaa$lambda对于正交数据（独立）4t(x$coef)a$GCV0 20 40 60 80 100210.0500.100.150.2030.2520406080100x$lambdaa$lambda#另一个例子 longley # not the same as the S-PLUS dataset names(longley)[1] &- &y& a0=lm.ridge(y ~ ., longley)#lambda = 0 plot(lm.ridge(y ~ ., longley,lambda = seq(0,0.1,0.001))) select(lm.ridge(y ~ ., longley,lambda = seq(0,0.1,0.0001))) a1=lm.ridge(y ~ ., longley,lambda=0.0057)a1$coe偏最小二乘回归PLSR (Partial Least Squares and Principal Component Regression)oliveoil {pls} Sensory and physico-chemical data of olive oils Description A data set with scores on 6 attributes from a sensory panel and measurements of 5 physicochemical quality parameters on 16 olive oil samples. The first five oils are Greek, the next five are Italian and the last six are Spanish. data(oliveoil) Format: A data frame with 16 observations on the following 2 variables. Sensory: a matrix with 6 columns. Scores for attributes ‘yellow’, ‘green’, ‘brown’, ‘glossy’, ‘transp’, and ‘syrup’. Chemical: a matrix with 5 columns. Measurements of acidity, peroxide, K232, K270, and DK. Source Massart, D. L., Vandeginste, B. G. M., Buydens, L. M. C., de Jong, S., Lewi, P. J., Smeyers-Verbeke, J. (1998) Handbook of Chemometrics and Qualimetrics: Part B. Elsevier. Tables 35.1 and 35.4. [Package pls version 2.1-0 Index]sensory ~ chemicalSensory Chemical#偏最小二乘回归(先主成份回归) library(pls); data(oliveoil);head(oliveoil);dim(oliveoil) oliveoil$sensory#是一个16x6矩阵 oliveoil$chemical #是一个16x5矩阵 #PCR sens.pcr &- pcr(sensory ~ chemical, ncomp = 4, scale = TRUE, data = oliveoil) summary(sens.pcr);names(sens.pcr)[1] &coefficients& &scores& &loadings& &Yloadings& &projection& &Xmeans& [7] &Ymeans& &fitted.values& &residuals& &Xvar& &Xtotvar& &ncomp& [13] &method& &scale& &call& &terms& &model&sens.pcr$loadings sens.pcr$coefficients sens.pcr$scores sens.pcr$Yloadings sens.pcr$projection sens.pcr$residuals#PLSR sens.pls &- plsr(sensory ~ chemical, ncomp = 4, scale = TRUE, data = oliveoil) summary(sens.pls);names(sens.pls)[1] &coefficients& &scores& &loadings& &loading.weights& &Yscores& [6] &Yloadings& &projection& &Xmeans& &Ymeans& &fitted.values& [11] &residuals& &Xvar& &Xtotvar& &ncomp& &method& [16] &scale& &call& &terms& &model&sens.pls$loadings sens.pls$coef sens.pls$scores sens.pls$loading.weights sens.pls$Yscores sens.pls$Yloadings sens.pls$Xvar sens.pls$Xtotvarlibrary(pls);data(yarn) Consisting of 21 NIR spectra of PET yarns, measured at 268 wavelengths, and 21 corresponding densities. (Erik Swierenga).library(pls);data(yarn) names(yarn)#[1] &NIR& &density& &train& dim(yarn$NIR)#28 268 自变量 yarn$density #因变量 summary(yarn$train) yarn$train Mode FALSE TRUE NA's logical 7 21 0 yarn.pls &- plsr(density ~ NIR, ncomp = 4, scale = TRUE, data = yarn) summary(yarn.pls);names(yarn.pls) [1] &coefficients& &scores& &loadings& &loading.weights& &Yscores& [6] &Yloadings& &projection& &Xmeans& &Ymeans& &fitted.values& [11] &residuals& &Xvar& &Xtotvar& &ncomp& &method& [16] &scale& &call& &terms& &model& yarn.pls$loadings yarn.pls$coef yarn.pls$scores yarn.pls$loading.weights yarn.pls$Yscores yarn.pls$Yloadings yarn.pls$Xvar yarn.pls$XtotvarLASSO方法在线性模型中，人们必须选择合适的变量；比如常用的逐步回归法就是选择显著的变量而抛弃那些不显著的。Tibshirani(1996)[1]提出了一个新的方法来处理变量选择的问题。该方法在模型系数绝对值的和小于某常数的条件下，谋求残差平方和最小。该方法既提供了如子集选择方法那样的可以解释的模型，也具有岭回归那样的稳定性。它不删除变量，但使得一些回归系数收缩、变小，甚至为 0。因而，该方法被称为lasso(least absolute shrinkage and selection operator，最小绝对值收缩和选择算子[2])。[1]Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso”, J. R. Statist. Soc. B, 58(1) 267-288. [2] 作为非缩写的英文词，lasso还有套马索的意义，和这里的缩写 130 lasso没有任何关联。把它翻译成“套马索”是不恰当的。? blasso? ? = arg min ? ? yi ? b 0 ? ? xij b j ? b i =1 ? j =1 ?N p2subject to?| bj =1pj|? s131library(lars) data(diabetes);attach(diabetes);names(diabetes);dim(diabetes);head(diabetes) attributes(diabetes$x)#数据是个list, 三个变量(y,x,x2), x及x2有子变量 #三种方法比较(缺省是lasso): #Fits Least Angle Regression, Lasso and Infinitesimal Forward Stagewise regression models par(mfrow=c(1,3)) attach(diabetes) object &- lars(x,y);plot(object) object2 &- lars(x,y,type=&lar&);plot(object2) object3 &- lars(x,y,type=&for&) ;plot(object3) par(mfrow=c(1,1)) a2&-lars(x2,y) a2 attributes(a2)a=cv.lars(x2,y,trace=TRUE,max.steps=80) attach(a) best&-index[which.min(cv)] (coef=coef.lars(a2,mode=&fraction&,s=best)) detach(diabetes) #write.csv(diabetes,&f:/2010stat/diabetes.csv&,quote=F,row.names=F) #w=read.csv(&f:/2010stat/diabetes.csv&)#和原来的格式有区别fit a GLM with lasso or elasticnet regularization (glmnet)# Gaussian x=matrix(rnorm(100*20),100,20) y=rnorm(100) fit1=glmnet(x,y) print(fit1) coef(fit1,s=0.01) # extract coefficients at a single value of lambda predict(fit1,newx=x[1:10,],s=c(0.01,0.005)) # make predictions#binomial g2=sample(1:2,100,replace=TRUE) fit2=glmnet(x,g2,family=&binomial&) #multinomial g4=sample(1:4,100,replace=TRUE) fit3=glmnet(x,g4,family=&multinomial&) system.time(fit2n&-glmnet(x,y))#poisson N=500; p=20 nzc=5 x=matrix(rnorm(N*p),N,p) beta=rnorm(nzc) f = x[,seq(nzc)]%*%beta mu=exp(f) y=rpois(N,mu) fit=glmnet(x,y,family=&poisson&) plot(fit) pfit = predict(fit,x,s=0.001,type=&response&) plot(pfit,y) #Cox set.seed(10101) N=1000;p=30 nzc=p/3 x=matrix(rnorm(N*p),N,p) beta=rnorm(nzc) fx=x[,seq(nzc)]%*%beta/3 hx=exp(fx) ty=rexp(N,hx) tcens=rbinom(n=N,prob=.3,size=1)# censoring indicator y=cbind(time=ty,status=1-tcens) # y=Surv(ty,1-tcens) with library(survival) fit=glmnet(x,y,family=&cox&) plot(fit)# Sparse n=10000;p=200 nzc=trunc(p/10) x=matrix(rnorm(n*p),n,p) iz=sample(1:(n*p),size=n*p*.85,replace=FALSE) x[iz]=0 sx=Matrix(x,sparse=TRUE) beta=rnorm(nzc) fx=x[,seq(nzc)]%*%beta eps=rnorm(n) y=fx+eps px=exp(fx) px=px/(1+px) ly=rbinom(n=length(px),prob=px,size=1) system.time(fit1&-glmnet(sx,y))广义可加模型(gam) Generalized Additive ModelsThe functions fi(xi) may be fit using parametric or non-parametric means, thus providing the potential for better fits to data than other methods.library(gam) data(kyphosis) gam(Kyphosis ~ s(Age,4) + Number, family = binomial, data=kyphosis, trace=TRUE) data(airquality) gam(Ozone^(1/3) ~ lo(Solar.R) + lo(Wind, Temp), data=airquality, na=na.gam.replace) gam(Kyphosis ~ poly(Age,2) + s(Start), data=kyphosis, family=binomial, subset=Number&2) data(gam.data) gam.object &- gam(y ~ s(x,6) + z,data=gam.data) summary(gam.object) plot(gam.object,se=TRUE) data(gam.newdata) predict(gam.object,type=&terms&,newdata=gam.newdata) #Built-in nonparametric smoothing terms are indicated by s for smoothing splines or lo for loess smooth terms.Multiple Fractional Polynomial ModelSelects the multiple fractional polynomial (MFP) model which best predicts the outcome. The model may be a generalized linear model or a proportional hazards (Cox) model. (后面ac.csv数据例子)library(mfp) data(GBSG) f &- mfp(Surv(rfst, cens) ~ fp(age, df = 4, select = 0.05) + fp(prm, df = 4, select = 0.05), family = cox, data = GBSG) print(f) survfit(f$fit) # use proposed coxph model fit for survival curve estimation参看ppt: fractional polynomial model.pdflibrary(mfp) w=read.csv(&f:/2010stat/ac.csv&) d=mfp(log(ac)~fp(gawks,select=0.05),data=w,verbose=T) summary(d) par(mfrow=c(2,2)) plot(mfp(log(ac)~fp(gawks),w)) xfit=(w$gawks);yfit=predict(d,data=(w$gawks=(xfit))) plot(log(w$ac)~w$gawks);lines(yfit~xfit,col=&red&)library(mfp) w=read.csv(&f:/2010stat/ac.csv&) d=mfp(log(ac)~fp(gawks,select=0.05),data=w,verbose=T)Null model Straight line model Best-fitting FP1 model Best-fitting FP2 modelM=2 P1=-2 P2=0summary(d)par(mfrow=c(2,2)) plot(mfp(log(ac)~fp(gawks),w))Residuals vs Fitted0.2255Normal Q-QStd. deviance resid. 4255Residuals0.0-0.24.55.0 Predicted values5.56.0-432592-232 59202-3-2-10123Theoretical Quantiles2.0Scale-Location255Residuals vs Leverage592Std. deviance resid.Std. Pearson resid.324171.5271.00.5-200.04.55.0 Predicted values5.56.0-432 Cook's distance0.000.010.02 Leverage0.03xfit=(w$gawks);yfit=predict(d,data=(w$gawks=(xfit))) plot(log(w$ac)~w$gawks);lines(yfit~xfit,col=&red&)log(w$ac)4.04.55.05.56.0152025 w$gawks303540变量选择:library(MMIX)library(MMIX) X1&-c(-0.2,-2.4,-0.7,1.2,0.0,-1.1,-2.1,-0.3,2.0,-1.7,1.4,-1.3,-3.4,0.4,-1.3, -4.8) X2&- c(-3, 2, 1, -2, -2, -4, 0, 1, 1, -1, -1, -4, 0, 2, 0, -4) X3&-c(2,1,0,-2,1,-2, 0, -1, -4, 1, -3, -3, -3, -1, 0, 2) #Linear model Y1&- c(8.7, 6, 9.1, 10.4, 7.6 ,10.4, 7.9, 11.9, 18, 10.5, 16.5, 8.8, 7.7, 13.5, 8.2, 0.8) data1&-data.frame(Y1,X1,X2,X3) pairs(data1) #Logistic model Y2&-c(1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1) data2&-data.frame(Y2,X1,X2,X3) pairs(data2)最终得到两个简化的数据集:选X1,X3 选X2##varSelec data1bis&-varSelec(data=data1,family=gaussian(&identity&),maxVar=2) data2bis&-varSelec(data=data2,family=binomial(&logit&),maxVar=2)This function analyses the stability of the stepwise selection and mixing methods using a bootstrap procedure.##bootFreq #Stepwise selection method bootStep1&bootFreq(data=data1,family=gaussian(&identity&),nboot=50,method=2, criterion=&bic&,trace=0) Factors weights bootStep1 summary(bootStep1) plot(bootStep1)X1 X2 X30.00.20.40.60.81.0bootStep2&-bootFreq(data=data2,family=binomial(&logit&),method=2, criterion=&bic&,nboot=20,trace=0) bootStep2 summary(bootStep2) plot(bootStep2) Factors weightsX3X1X20.00.10.20.30.40.5Adaptative Regression by Mixing with model Screening Apply ARMS for linear and logistic models.##ARMS method armsResult1&-arms(data=data1,family=gaussian(&identity&),nbest=5,nsample=10, criterion=&both&,weight=&aic&) Factor weights armsResult1 summary(armsResult1) plot(armsResult1)X1 X2 X30.00.20.40.60.8armsResult2&-arms(data=data2,family=binomial(&logit&),nbest=5,nsample=10, criterion=&both&,weight=&aic&) armsResult2 summary(armsResult2) plot(armsResult2)Factor weightsX3X1X20.000.050.100.150.200.250.300.35Model performance indicators PMSE and AUC##Root Mean Square Error by cross-validation #Stepwise selection with BIC pmseStepBic&-pmseCV(data=data1,method=2,np=1,random=FALSE,direction=&both&, criterion=&bic&,trace=0) pmseStepBic #BMA pmseBMA&-pmseCV(data=data1,method=3,np=1,random=FALSE) pmseBMA##Area Under ROC Curve by cross-validation aucStepBic&-aucCV(data=data2,method=2,np=1,random=FALSE,direction=&both&, criterion=&bic&,trace=0) aucStepBic aucBMA&-aucCV(data=data2,method=3,np=1,random=FALSE) aucBMAConditional logistic regression参见ppt: logisticRegressionManitoba-Nov2010.pdf“Prospective” modelConditional likelihoodExponential movement kernels (Forester et al 2009)
library(TwoStepCLogit) data(bison)dyad &- bison$Cluster stratum &- bison$Strata Y &- bison$Y x1 &- bison$biomass x2 &- bison$water x3 &- bison$biomass2bison.data &- cbind(dyad,stratum,Y,x1,x2,x3) # Conditional logistic regression # There are 10 observations per stratum, with # 2 cases and 8 controls in each stratum# Model 1: all three covariates x1, x2, x3 # Random effects in front of x1 and x3 # Main diagonal covariance structure for D Fit1 &- Ts.estim(bison.data, Proposed.model=c(1,2,3), random=c(1,3), All.m.1=FALSE, D=&UN(1)&) Fit1 # Model 2: only covariates x1, x3 # Random effects in front of x1 and x3 # Unstructured covariance structure for D Fit2 &- Ts.estim(bison.data, Proposed.model=c(1,3), random=c(1,2), All.m.1=FALSE, D=&UN&) Fit2Bayesian Methods for Tree Based ModelsBayesian Additive Regression Treesbart {BayesTree}BART is a Bayesian “sum-of-trees” model. For numeric response y, we have y = f(x) + e, where e ~ N(0,sigma\^2). For a binary response y, P(Y=1 | x) = F(f(x)), where F denotes the standard normal cdf (probit link). In both cases, f is the sum of many tree models. The goal is to have very flexible inference for the uknown function f. In the spirit of “ensemble models”, each tree is constrained by a prior to be a weak learner so that it contributes a small amount to the overall fit.library(BayesTree) ##simulate data (example from Friedman MARS paper) f = function(x){ 10*sin(pi*x[,1]*x[,2]) + 20*(x[,3]-.5)^2+10*x[,4]+5*x[,5] } sigma = 1.0 #y = f(x) + sigma*z , z~N(0,1) n = 100 #number of observations set.seed(99) x=matrix(runif(n*10),n,10) #10 variables, only first 5 matter Ey = f(x) y=Ey+sigma*rnorm(n) lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later ##run BART set.seed(99) bartFit = bart(x,y) plot(bartFit) # plot bart fit ##compare BART fit to linear matter and truth = Ey fitmat = cbind(y,Ey,lmFit$fitted,bartFit$yhat.train.mean) colnames(fitmat) = c('y','Ey','lm','bart') print(cor(fitmat))sigma 0.5 1.0 1.5 2.0 2.50 200 400 600 800posterior interval for E(Y|x) 5 10 15 20 25 30Index 5 10 15 y 20 25? &vm& : used as a base for the various vector machine classes in kernlab ? &gausspr& :The Gaussian Processes object class ? &kfa& : returned by the Kernel Feature Analysis kfa function ? &kqr& : The Kernel Quantile Regression ? &ksvm& : Support Vector Machines function ? &lssvm& : Gaussian Processes ? &onlearn& : Kernel-based Online learning algorithms ? &rvm& : Relevance Vector Machine ? &gausspr& : Gaussian ProcessesKernel-based Machine Learning Lab library(kernlab) 参看R中描述方法的pdf文件(该包附带)library(kernlab) ## simple example using the promotergene data set data(promotergene)例: ksvm (支持向量机)## train a support vector machine gene &- ksvm(Class~.,data=promotergene,kernel=&rbfdot&,kpar=list(sigma=0.015),C=50,cross=4) gene# the kernel function kernelf(gene) # the alpha values alpha(gene) # the coefficients coef(gene) # the fitted valuesfitted(gene) 给出拟合值# the cross validation error cross(gene)预测(给出对测试集的预测)## example using the promotergene data set data(promotergene) ## create test and training set ind &- sample(1:dim(promotergene)[1],20) genetrain &- promotergene[-ind, ] genetest &- promotergene[ind, ]## train a support vector machine gene &ksvm(Class~.,data=genetrain,kernel=&rbfdot&,kpar=list(sigma=0.015),C=70,cross= 4,prob.model=TRUE) gene## predict gene type probabilities on the test set genetype &- predict(gene,genetest,type=&probabilities&) genetype例: kqr (分位数回归)# create data x &- sort(runif(300)) y &- sin(pi*x) + rnorm(300,0,sd=exp(sin(2*pi*x))) # first calculate the median #(中位数回归) qrm &- kqr(x, y, tau = 0.5, C=0.15) # predict and plot plot(x, y) ytest &- predict(qrm, x) lines(x, ytest, col=&blue&)6 y -4 -2 0 2 40.00.20.4 x0.60.81.0# calculate 0.9 quantile qrm &- kqr(x, y, tau = 0.9, kernel = &rbfdot&, kpar= list(sigma=10), C=0.15) ytest &- predict(qrm, x) lines(x, ytest, col=&red&)y -4 -2 02460.00.20.4 x0.60.81.0例: 主成份分析data(iris) test &- sample(1:150,20) kpc &- kha(~.,data=iris[-test,-5],kernel=&rbfdot&,kpar=list(sigma=0.2),features=2) #print the principal component vectors pcv(kpc) #plot the data projection on the components plot(predict(kpc,iris[,-5]),col=as.integer(iris[,5]),xlab=&1st Principal Component&,ylab=&2nd Principal Component&)0.2 -0.4 -0.2 0.0 0.42nd Principal Component-0.10-0.050.000.051st Principal ComponentGeneralized Boosted Regression Models (原理看vignette(&gbm&))See vignette(&gbm&) for technical details of the package. Also available at ../doc/gbm.pdf (if you are using HTML help). This package implements the generalized boosted modeling framework. Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001). In addition to many of the features documented in the Gradient Boosting Machine, gbm offers additional features including the out-of-bag estimator for the optimal number of iterations, the ability to store and manipulate the resulting gbm object, and a variety of other loss functions that had not previously had associated boosting algorithms, including the Cox partial likelihood for censored data, the poisson likelihood for count outcomes, and a gradient boosting implementation to minimize the AdaBoost exponential loss function. gbm.fit provides the link between R and the C++ gbm engine. gbm is a frontend to gbm.fit that uses the familiar R modeling formulas. However, model.frame is very slow if there are many predictor variables. For power-users with many variables use gbm.fit. For general practice gbm is preferable.gbm: A least squares regression example# create some data N &- 1000 X1 &- runif(N) X2 &- 2*runif(N) X3 &- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 &- factor(sample(letters[1:6],N,replace=TRUE)) X5 &- factor(sample(letters[1:3],N,replace=TRUE)) X6 &- 3*runif(N) mu &- c(-1,0,1,2)[as.numeric(X3)] SNR &- 10 # signal-to-noise ratio Y &- X1**1.5 + 2 * (X2**.5) + mu sigma &- sqrt(var(Y)/SNR) Y &- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] &- NA X4[sample(1:N,size=300)] &- NA data &- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)构造数据# fit initial model gbm1 &- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=&gaussian&, # bernoulli, adaboost, gaussian, # poisson, coxph, and quantile available n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress拟合# check performance using an out-of-bag estimator # OOB underestimates the optimal number of iterations best.iter &- gbm.perf(gbm1,method=&OOB&) print(best.iter) # check performance using 5-fold cross-validation best.iter &- gbm.perf(gbm1,method=&cv&) print(best.iter)2.0 Squared error loss 0.5 1.0 1.5050010001500200025003000# plot the performance # plot variable influence summary(gbm1,n.trees=1) # based on the first tree summary(gbm1,n.trees=best.iter) # based on the estimated best number of treesX3X2X1X4X5X602040 Relative influence6080X5X6X4X1X2X30102030405060Relative influence# compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1,1)) print(pretty.gbm.tree(gbm1,gbm1$n.trees))# make some new data# make some new data N &- 1000 X1 &- runif(N) X2 &- 2*runif(N) X3 &- ordered(sample(letters[1:4],N,replace=TRUE)) X4 &- factor(sample(letters[1:6],N,replace=TRUE)) X5 &- factor(sample(letters[1:3],N,replace=TRUE)) X6 &- 3*runif(N) mu &- c(-1,0,1,2)[as.numeric(X3)] Y &- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma) data2 &- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)# predict on the new data using &best& number of trees # f.predict generally will be on the canonical scale (logit,log,etc.) f.predict &- predict.gbm(gbm1,data2,best.iter)# least squares error print(sum((data2$Y-f.predict)^2))# create marginal plots # plot variable X1,X2,X3 after &best& iterations par(mfrow=c(1,3)) plot.gbm(gbm1,1,best.iter) plot.gbm(gbm1,2,best.iter) plot.gbm(gbm1,3,best.iter) par(mfrow=c(1,1))3.2 3.5 4.0 3.0 3.0 f(X1) f(X2) 2.5 f(X3) 2.8 2.6 2.0 1.5 2.40.0 0.2 0.4 0.6 0.8 1.0 X10.00.51.0 X21.52.01.52.02.53.03.5dc X3ba# contour plot of variables 1 and 2 after &best& iterations plot.gbm(gbm1,1:2,best.iter)# lattice plot of variables 2 and 3 plot.gbm(gbm1,2:3,best.iter)0.0 0.5 1.0 1.5 2.0ba5 4 3 2 1f(X2,X3)0d5 4 3 2 1 0 0.0 0.5 1.0 1.5 2.0cX2# lattice plot of variables 3 and 4 plot.gbm(gbm1,3:4,best.iter)def4.0 3.5 3.0 2.5 2.0f(X3,X4)1.5a4.0 3.5 3.0 2.5 2.0 1.5 d c b a d cbcbadcbaX4# 3-way plots plot.gbm(gbm1,c(1,2,6),best.iter,cont=20) plot.gbm(gbm1,1:3,best.iter) plot.gbm(gbm1,2:4,best.iter) plot.gbm(gbm1,3:5,best.iter)c a4.0 3.5 3.0 2.5 2.0 1.54 2 0c bc cc dc ec f0.00.51.01.52.00.00.51.01.52.0f df cf bf ab af(X3,X4,X5)b bb cb db eb f4.0 3.5 3.0 2.5 2.0 1.5e d4 2 0e ce be ad dd cd bd a4 2 0f(X2,X3,X4)c d4 2 0c cc bc aa a4.0 3.5 3.0 2.5 2.0 1.5 d c b a d ca ba ca da ea fb db cb bb a4 2 0a d4 2 0 0.0 0.5 1.0 1.5 2.0a ca ba abadcbadcbadcbadcba0.00.51.01.52.0X3X2# do another 100 iterations gbm2 &- gbm.more(gbm1,100, verbose=FALSE) # stop printing detailed progressFit a Cox model by likelihood based boosting (BoxBoost)In contrast to gradient boosting (implemented e.g. in the glmboost routine in the R package mboost, using the CoxPH loss function), CoxBoost is not based on gradients of loss functions, but adapts the offset-based boosting approach from Tutz and Binder (2007) for estimating Cox proportional hazards models. In each boosting step the previous boosting steps are incorporated as an offset in penalized partial likelihood estimation, which is employed for obtain an update for one single parameter, i.e., one covariate, in every boosting step. This results in sparse fits similar to Lasso-like approaches, with many estimated coefficients being zero. The main model complexity parameter, which has to be selected (e.g. by cross-validation using cv.CoxBoost), is the number of boosting steps stepno. The penalty parameter penalty can be chosen rather coarsely, either by hand or using optimCoxBoostPenalty. The advantage of the offset-based approach compared to gradient boosting is that the penalty structure is very flexible. In the present implementation this is used for allowing for unpenalized mandatory covariates, which receive a very fast coefficient build-up in the course of the boosting steps, while the other (optional) covariates are subjected to penalization. For example in a microarray setting, the (many) microarray features would be taken to be optional covariates, and the (few) potential clinical covariates would be taken to be mandatory, by including their indices in unpen.index. If a group of correlated covariates has influence on the response, e.g. genes from the same pathway, componentwise boosting will often result in a non-zero estimate for only one member of this group. To avoid this, information on the connection between covariates can be provided in pendistmat. If then, in addition, a penalty updating scheme with stepsize.factor & 1 is chosen, connected covariates are more likely to be chosen in future boosting steps, if a directly connected covariate has been chosen in an earlier boosting step (see Binder and Schumacher, 2009b).library(CoxBoost) # Generate some survival data with 10 informative covariates n &- 200; p &- 100 beta &- c(rep(1,10),rep(0,p-10)) x &- matrix(rnorm(n*p),n,p) real.time &- -(log(runif(n)))/(10*exp(drop(x %*% beta))) cens.time &- rexp(n,rate=1/10) status &- ifelse(real.time &= cens.time,1,0) obs.time &- ifelse(real.time &= cens.time,real.time,cens.time) # Fit a Cox proportional hazards model by CoxBoostcbfit &- CoxBoost(time=obs.time,status=status,x=x,stepno=100,penalty=100) summary(cbfit)# ... with covariates 1 and 2 being mandatorycbfit.mand &- CoxBoost(time=obs.time,status=status,x=x,unpen.index=c(1,2), stepno=100,penalty=100) summary(cbfit.mand)
Generalized additive model by likelihood based boosting (GAMboost)GAMBoost is used to fit a generalized additive model by likelihood based boosting. It is especially suited for models with a large number of predictors with potentially non-linear influence. It provides smooth function estimates of covariate influence functions together with confidence bands and approximate degrees of freedom.## Generate some data n &- 100; p &- 8; q &- 2 # covariates with non-linear (smooth) effects x &- matrix(runif(n*p,min=-1,max=1),n,p) # binary covariates x.linear &- matrix(round(runif(n*q,min=0,max=1)),n,q) # 1st and 3rd smooth covariate and 1st linear covariate are informative eta &- -0.5 + 2*x[,1] + 2*x[,3]^2 + x.linear[,1]-.5 y &- rbinom(n,1,binomial()$linkinv(eta)) ## Fit a model with just smooth components gb1 &- GAMBoost(x,y,penalty=500,stepno=100,family=binomial(),trace=TRUE)# Inspect the AIC for a minimum plot(gb1$AIC) # still falling at boosting step 100 so we need more steps # or a smaller penalty (use 'optimGAMBoostPenalty' for # automatic penalty optimization)gb1$AIC12012513013514002040 Index6080100## Include two binary covariates as mandatory without penalty ## (appropriate for example for 'treatment/control') ## modelled as 'linear' predictors gb2 &- GAMBoost(x,y,penalty=200, x.linear=x.linear,penalty.linear=0, stepno=100,family=binomial(),trace=TRUE) ## Include first binary covariates as mandatory and second ## as optional (e.g 'treatment/control' and 'female/male')gb3 &- GAMBoost(x,y,penalty=200, x.linear=x.linear,penalty.linear=c(0,100), stepno=100,family=binomial(),trace=TRUE)# Get summary with fitted covariates and estimates for # the parametric components summary(gb3)# Extract boosted components at 'optimal' boosting step selected &- getGAMBoostSelected(gb3,at.step=which.min(gb3$AIC)) # Plot all smooth components at final boosting step par(mfrow=c(2,4)) plot(gb3)111etaetaetaeta000-1-1-1-2-2-2-1.00.0 0.5 S1-1.00.0 0.5 1.0 S2-1.00.0 0.5 1.0 S3-2-101-1.00.0 0.5 1.0 S4111etaetaetaeta000-1-1-1-2-2-2-1.00.0 0.5 1.0 S5-1.00.0 0.5 1.0 S6-1.00.0 0.5 1.0 S7-2-101-1.00.0 0.5 1.0 S8# plot smooth components for which the null line is not inside the bands # at 'optimal' boosting step (determined by AIC) par(mfrow=c(1,length(selected$smoothbands))) plot(gb3,select=selected$smoothbands,at.step=which.min(gb3$AIC))1.51.51.51.01.01.00.50.50.50.00.00.0etaetaeta-1.5 -1.0 -0.5-1.5 -1.0 -0.5-1.5 -1.0 -0.5eta-1.0 0.0 0.5 1.0 S7-1.00.0 0.5 S1-1.00.0 0.5 1.0 S3-1.5 -1.0 -0.5-1.00.00.51.01.50.0 0.5 1.0 S8## Fit a generalized linear model for comparison x.linear &- cbind(x,x.linear) gb4 &- GAMBoost(x=NULL,y=y,x.linear=x.linear,penalty.linear=100, stepno=100,trace=TRUE,family=binomial())# Compare with generalized additive model fit plot(0:100,gb3$AIC,type=&l&,xlab=&stepno&,ylab=&AIC&); lines(0:100,gb4$AIC,lty=2)135 AIC 120 125 13002040 stepno6080100## Fit a generalized linear model with penalty modification ## after every boosting step, with penalty changes being ## redistrbuted via a connection matrix pendistmat &- matrix(0,10,10) # Covariates 1 and 3 are connected pendistmat[1,3] &- pendistmat[3,1] &- 1 gb5 &- GAMBoost(x=NULL,y=y,x.linear=x.linear,penalty.linear=100, stepsize.factor.linear=0.9,pendistmat.linear=pendistmat, stepno=100,trace=TRUE,family=binomial()) # or alternativelygb5 &- GAMBoost(x=NULL,y=y,x.linear=x.linear,penalty.linear=100, stepsize.factor.linear=0.9, pendistmat.linear=pendistmat[c(1,3),c(1,3)], connected.index.linear=c(1,3), stepno=100,trace=TRUE,family=binomial())Functional Data Analysis in R (fda)library(fda) ## ## Simple smoothing ## girlGrowthSm &- with(growth, smooth.basisPar(argvals=age, y=hgtf, lambda=0.1)) plot(girlGrowthSm$fd, xlab=&age&, ylab=&height (cm)&, main=&Girls in Berkeley Growth Study& ) plot(deriv(girlGrowthSm$fd), xlab=&age&, ylab=&growth rate (cm / year)&, main=&Girls in Berkeley Growth Study& ) plot(deriv(girlGrowthSm$fd, 2), xlab=&age&, ylab=&growth acceleration (cm / year^2)&, main=&Girls in Berkeley Growth Study& )Girls in Berkeley Growth Study180 160Girls in Berkeley Growth Studyheight (cm) 140120100growth rate (cm / year)80Girls in Berkeley Growth Study5 10 agegrowth acceleration (cm / year^2) 5101500515205-510 age15-1051015## ## Simple basis ## bspl1.2 &- create.bspline.basis(norder=1, breaks=c(0,.5, 1)) plot(bspl1.2) # 2 bases, order 1 = degree 0 = step functions: # (1) constant 1 between 0 and 0.5 and 0 otherwise # (2) constant 1 between 0.5 and 1 and 0 otherwise.1.0 0.0 0.2 0.4 0.6 0.80.00.20.40.60.81.0fd1.2 &- Data2fd(0:1, basisobj=bspl1.2) op &- par(mfrow=c(2,1)) plot(bspl1.2, main='bases') plot(fd1.2, main='fit') par(op) # A step function: 0 to time=0.5, then 1 after0.8bases0.00.40.00.20.40.60.81.0fit0.8 value 0.0 0.40.00.20.4 time0.60.81.0加权最小二乘(lm中的选项) 广义线性模型―package sandwich()
更多搜索：
All rights reserved Powered by
文档资料库内容来自网络，如有侵犯请联系客服。

nxt head 报8409(mtcars[vars])报错

我要回帖

更多关于 git reset head 的文章

随机推荐