nxt head 报8409(mtcars[vars])报错

(飞马网FMI)
(大家都叫我伦哥)
(法语初学)
(知识改变焦虑)
第三方登录:当前位置: >>
R语言讲义(包括各种回归)
R 语言讲义 ? 免费(没有权力和铜臭) ? 资源公开, 可改变代码(不是黑盒子,也不是 吝啬鬼, 透明是防止“腐败”的最好方式) ? 容易学习。可编程以实行复杂的课题 ? 可扩展: 通过数千个网上提供的适用于不同 领域、不同目的、不同方法的软件包来实 现你的目标。也可以把你的方法贡献出来 ? 功能强大(绘图功能, 优秀的内在帮助系统, R社区的支持,不断更新,不断修正) ? 没有任何一个商业软件有如此多和如此新 的算法 ? 世界应用统计学家大都把自己的方法首先以R来 实现,并尽量放到R 网站上 ? 一年多,R网站的软件包数量增加了两倍,从近 1000个到近3000多个。大都都有关于计算、演 示和输入输出方法的函数和例子数据 ? 除非得到巨额资助(或者永远使用盗版软件), 没 有理由在公立学校教授商业软件 ? 绝大多数美国统计研究生都会的语言(Berkeley 统计和应用数学本科都开设R语言课) ? 我的很大一部分数据分析知识的来源就是R. ? 我都能学会, 并且到处宣传和普及, 相信你们会 做得更好! 下载R(http://www.r-project.org/)点击CRAN得到一批镜像网站 点击镜像网站比如Berkeley 选择 base选择这个,下载安装文件选择这个,下载软件包
Packages (每个都有大量数据和可以读写修改的 函数/程序)? ? ? ? ? ? ? ? ? ? ? ? ? ? ? base The R Base Package boot Bootstrap R (S-Plus) Functions (Canty) class Functions for Classification cluster Cluster Analysis Extended Rousseeuw et al. concord Concordance and reliability datasets The R Datasets Package exactRankTests Exact Distributions for Rank and Permutation Tests foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ... graphics The R Graphics Package grDevices The R Graphics Devices and Support for Colours and Fonts grid The Grid Graphics Package KernSmooth Functions for kernel smoothing for Wand & Jones (1995) lattice Lattice Graphics Interface tools Tools for Package Development utils The R Utils Package Packages (继续)? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MASSMain Package of Venables and Ripley's MASS methodsFormal Methods and Classes mgcvGAMs with GCV smoothness estimation and GAMMs by REML/PQL multtestResampling-based multiple hypothesis testing nlmeLinear and nonlinear mixed effects models nnetFeed-forward Neural Networks and Multinomial Log-Linear Models nortestTests for Normality outliersTests for outliers plsPartial Least Squares Regression (PLSR) and Principal Component Regression (PCR) pls.pcrPLS and PCR functions rpartRecursive Partitioning SAGxStatistical Analysis of the GeneChip smaStatistical Microarray Analysis spatialFunctions for Kriging and Point Pattern Analysis splinesRegression Spline Functions and Classes statsThe R Stats Package stats4Statistical Functions using S4 Classes survivalSurvival analysis, including penalised likelihood. tcltkTcl/Tk Interface toolsTools for Package Development utilsThe R Utils Package Packages (网上)? 网上还有许多 所有这些Packages可以自由下载? Base中的package包含常用的函 数和数据 ? 而其他的packages包含各个方向 统计学家自己发展的方法和数据。 ? 希望你是下一个加盟这些 packages的作者之一。? 安装Packages
关机时是否保存?? 如果是, 你的运算结果(赋值的变量及函 数等)保存在一个文件(名字为.RData)中, 下次开机时还会重新载入. 如果你不要 则删去该文件即可. ? 其实, 除非是做一个需要多次才完成的 大课题, 一般你都不想保存. 你所用的代 码可以以程序脚本形式(*.R, 注意: 一定 要自己敲入”.R”, 没有默认)保存 几个有用的函数函数:f(x): 名字(变元) getwd() setwd(dir = &f:/2010stat&)#或setwd(&f:/2010stat&) getwd() x=rnorm(100) ls() ?rnorm#或help(rnorm) apropos(“norm“) identical(1:10,1:10) identical(1:10,as.numeric(1:10)) identical(1:10,as.integer(1:10)) ? ? ? ? ??赋值和运算 z = rnorm(,0.1) median(z) 赋值: “=”可以用“&-”代替 x&-z-&y-&w 简单数学运算有: +,-,*,/, ^,%*%,%%(mod) %/%(整数除法)等等 常用的数学函数有:abs , sign , log , log2,log10 , logb, expm1, log1p(x), sqrt , exp , sin , cos , tan , acos , asin, atan , cosh , sinh, tanh 赋值和运算? ? ? ? round, floor, ceiling gamma , lgamma, digamma and trigamma. sum, prod, cumsum, cumprod max, min, cummax, cummin, pmax, pmin, range ? mean, length, var, duplicated, unique ? union, intersect, setdiff ? &, &=, &, &=, &, |, ! 从高到低的运算次序 一些基本运算例子x=1:100 (x=1:100) sample(x,20) set.seed(0);sample(1:10,3)#随机种子! z=sample(1:00) z[1:10]#向量下标 y=c(1,3,7,3,4,2) z[y] 一些基本运算例子z=sample(x,20,rep=T) z (z1=unique(z));length(z1) z=sample(x,100,rep=T) xz=setdiff(x,z) sort(union(xz,z)) sort(union(xz,z))==x setequal(union(xz,z),x) intersect(1:10,7:50) sample(1:100,20,prob=1:100) 一些基本运算例子pi * 10^2 #能够用?”*”来看基本算术运算方法 &*&(pi, &^&(10, 2)) pi * (1:10)^2 x &- pi * 10^2 x print(x) (x=pi * 10^2) pi^(1:5) print(x, digits = 12) class(x) typeof(x) 一些基本运算例子 class(cars) typeof(cars) names(cars) summary(cars) str(cars) row.names(cars) class(dist ~ speed) plot(dist ~ speed,cars) 一些基本运算例子head(cars)#cars[1:6,] tail(cars) ncol(cars);nrow(cars) dim(cars) lm(dist ~ speed, data = cars) cars$qspeed =cut(cars$speed, breaks =quantile(cars$speed), include.lowest = TRUE) names(cars) cars[3] table(cars[3]) is.factor(cars$qspeed) plot(dist ~ qspeed, data = cars) (a=lm(dist ~ qspeed, data = cars)) summary(a) 一些基本运算例子x &- round(runif(20,0,20), digits=2) summary(x) min(x);max(x) median(x) # median mean(x) # mean var(x) # variance sd(x) # standard deviation sqrt(var(x)) rank(x) # rank order(x) x[order(x)] sort(x) sort(x,decreasing=T)#sort(x,dec=T) sum(x);length(x) round(x) 一些基本运算例子fivenum(x) # quantiles quantile(x) # quantiles (different convention)有 多种定义 quantile(x, c(0,.33,.66,1)) mad(x) # normalized mean deviation to the median (“median average distance“) 可用?mad查看 cummax(x) cummin(x) cumprod(x) cor(x,sin(x/20)) # correlation 一些基本运算例子#直方图 x &- rnorm(200) hist(x, col = &light blue&) rug(x) #茎叶图 stem(x) #散点图 N &- 500 x &- rnorm(N) y &- x + rnorm(N) plot(y ~ x) a=lm(y~x) abline(a,col=&red&)#或者abline(lm(y~x),col=&red&) print(&Hello World!&) paste(&x 的最小值= &, min(x)) #cat(&\\end{document}\n&, file=&RESULT.tex&, append=TRUE) demo(graphics)#演示画图 一些基本运算例子#复数运算 x=2+3i (z &- complex(real=rnorm(10), imaginary =rnorm(10))) complex(re=rnorm(3),im=rnorm(3)) Re(z) Im(z) Mod(z) Arg(z) choose(3,2);factorial(6) #解方程 f =function(x) x^3-2*x-1 uniroot(f,c(0,2))#迭代 #如果知道根为极值 f =function(x) x^2+2*x+1 optimize(f,c(-2,2)) 分布和产生随机数? 正 态 分 布 : pnorm(1.2,2,1); dnorm(1.2,2,1); qnorm(.7,2,1); rnorm(10,0,1) #rnorm(10) ? t 分 布 : pt(1.2,1); dt(1.2,2); qt(.7,1); rt(10,1) ? 此外还有指数分布、F分布、“卡方”分布、 Beta分布、二项分布、Cauchy分布、Gamma分布、 几何分布、超几何分布、对数正态分布、 Logistic分布、负二项分布、Poisson分布、均 匀分布、Weibull分布、Willcoxon分布等? 变元可以是向量! 可能遇到的问题a=factor(letters[1:10]) a[3]=&w&#不行 a=as.character(a) a[3]=&w& a=factor(a) a 输入输出数据x=scan() 1.5 2.6 3.7 2.1 8.9 12 -1.2 -4 #等价于x=c(1.5,2.6,3.7,2.1,8.9,12,-1.2,-4) w=read.table(file.choose(),header=T)#从列表中选择 setwd(“f:/2010stat”)#或setwd(&f:\\2010stat&) (x=rnorm(20)) write(x,&f:/2010stat/test.txt&) y=scan(&f:/2010stat/test.txt&);y y=y[1:5,];str(y) write.table(y,&f:/2010stat/test.txt&,row.names=F) w=read.table(&f:/2010stat/test.txt&,header=T) str(w) write.csv(y,&f:/2010stat/test.csv&) v=read.csv(&f:/2010stat/test.csv&) str(v) data=read.table(&clipboard&) write.table(&clipboard&) 序列和向量z=seq(-1,10,length=100)#z=seq(-1,10, len=100) z=seq(10,-1,-1) #z=10:-1 x=rep(1:3,3) x=rep(3:5,1:3) x=rep(c(1,10),c(4,5)) w=c(1,3,x,z);w[3] x=rep(0,10);z=1:3;x+z x*z rev(x) z=c(&no cat&,&has &,&nine&,&tails&) z[1]==&no cat& z=1:5 z[7]=8;z z=NULL z[c(1,3,5)]=1:3; z rnorm(10)[c(2,5)] z[-c(1,3)] #去掉第1、3元素 z=sample(1:100,10);z which(z==max(z))#给出下标 向量?矩阵x=sample(1:100,12);x all(x&0);all(x!=0);any(x&0);(1:10)[x&0] diff(x) diff(x,lag=2) x=matrix(1:20,4,5);x x=matrix(1:20,4,5,byrow=T);x t(x) x=matrix(sample(1:100,20),4,5) 2*x x+5 y=matrix(sample(1:100,20),5,4) x+t(y) (z=x%*%y) z1=solve(z) # solve(a,b)可以解ax=b方程 z1%*%z round(z1%*%z,14) 矩阵nrow(x); ncol(x);dim(x)#行列数目 x=matrix(rnorm(24),4,6) x[c(2,1),]#第2和第1行 x[,c(1,3)] #第1和第3列 x[2,1] #第[2,1]元素 x[x[,1]&0,1] #第1列大于0的元素 sum(x[,1]&0) #第1列大于0的元素的个数 sum(x[,1]&=0) #第1列不大于0的元素的个数 x[,-c(1,3)] #没有第1、3列的x. diag(x) diag(1:5) diag(5) x[-2,-c(1,3)] #没有第2行、第1、3列的x. x[x[,1]&0&x[,3]&=1,1]; #第1中大于0并且相应于第3列中小于或等于1的元 x[x[,2]&0|x[,1]&.51,1] #第1中小于.51或者相应于第2列中大于0的元素 (“或”) x[!x[,2]&.51,1]#第一列中相应于第2列中不小于.51的元素(“非”) apply(x,1,mean);apply(x,2,sum) 矩阵/高维数组#上下三角阵 x=matrix(rnorm(24),4,6) diag(x) diag(1:5) diag(5) x[lower.tri(x)]=0#x[upper.tri(x)]=0;diag(x)=0 x=array(runif(24),c(4,3,2));x is.matrix(x) #可由dim(x)得到维数(4,3,2) is.matrix(x[1,,]) x=array(1:24,c(4,3,2)) x[c(1,3),,] x=array(1:24,c(4,3,2)) apply(x,1,mean) apply(x,1:2,sum) apply(x,c(1,3),prod) 矩阵/高维数组/scale#矩阵与向量之间的运算 x=matrix(1:20,5,4) sweep(x,1,1:5,&*&) x*1:5 sweep(x,2,1:4,&+&) (x=matrix(sample(1:100,24),6,4));(x1=scale(x)) (x2=scale(x,scale=F)); (x3=scale(x,center=F)) round(apply(x1,2,mean),14) apply(x1,2,sd) round(apply(x2,2,mean),14);apply(x2,2,sd) round(apply(x3,2,mean),14);apply(x3,2,sd) Data.framex=matrix(1:6,2,3) z=data.frame(x);z z$X2 attributes(z) names(z)=c(&TOYOTA&,&GM&,&HUNDA&) row.names(z)=c(&2001&,&2002&) Z attach(x) GM detach(x) GM sapply(z,is.numeric)#apply(z,2,is.numeric) 缺失值问题等airquality complete.cases(airquality)#哪一行没有缺失值 which(complete.cases(airquality)==F) sum(complete.cases(airquality)) na.omit(airquality) #append,cbind,vbind x=1:10;x[12]=3 (x1=append(x,77,after=5)) cbind(1:3,4:6);rbind(1:3,4:6) #去掉矩阵重复的行 (x=rbind(1:5,runif(5),runif(5),1:5,7:11)) x[!duplicated(x),] unique(x) List#list可以是任何对象的集合(包括lists) z=list(1:3,Tom=c(1:2, a=list(&R&,letters[1:5]),w=&hi!&)) z[[1]];z[[2]] z$T z$T$a2 z$T[[3]] z$T$w attributes(airquality)#属性! airquality$Ozone attributes(matrix(1:6,2,3)) Categorical data A survey asks people if they smoke or not. The data is Yes, No, No, Yes, Yes x=c(&Yes&,&No&,&No&,&Yes&,&Yes&) table(x);x factor(x) ?Barplot:Suppose, a group of 25 people are surveyed as to their beer-drinking preference. The categories were (1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw data is 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1beer = scan()1barplot(beer) # this isn't correctbarplot(table(beer)) # Yes, call with summarized databarplot(table(beer)/length(beer)) # divide by n for proportiontable(beer)/length(beer) Table/categorical datalibrary(MASS) quine attach(quine) table(Age) table(Sex, Age); tab=xtabs(~ Sex + Age, quine); unclass(tab) tapply(Days, Age, mean) tapply(Days, list(Sex, Age), mean) #apply, sapply, tapply, lapply smokes = c(&Y&,&N&,&N&,&Y&,&N&,&Y&,&Y&,&Y&,&N&,&Y&) amount = c(1,2,2,3,3,1,2,1,3,2) (tmp=table(smokes,amount)) # store the tableoptions(digits=3) # only print 3 decimal placesprop.table(tmp,1) # the rows sum to 1 nowprop.table(tmp,2) # the columns sum to 1 now#上两行等价于下面两行sweep(tmp, 1, margin.table(tmp, 1), &/&)sweep(tmp, 2, margin.table(tmp, 2), &/&)prop.table(tmp)#amount # all the numbers sum to 1options(digits=7) # restore the number of digits array/matrix??table??data.frame## Start with a contingency table. ftable(Titanic, row.vars = 1:3) ftable(Titanic, row.vars = 1:2) data.frame(Titanic)#把array变成data.frame a=xtabs(Freq~Survived+Sex, w) biplot(corresp(a, nf=2))#应用之一 ## Start with a data frame. str(mtcars) x &- ftable(mtcars[c(&cyl&, &vs&, &am&, &gear&)]) x #为array,其维的次序为(&cyl&, &vs&, &am&, &gear&) ftable(x, row.vars = c(2, 4))#从x(array)确定表的行变量 ## Start with expressions, use table()'s &dnn& to change labels ftable(mtcars$cyl, mtcars$vs, mtcars$am, mtcars$gear, row.vars = c(2, 4), dnn = c(&Cylinders&, &V/S&, &Transmission&, &Gears&)) ftable(vs~carb,mtcars)#vs是列,carb是行#或ftable(mtcars$vs~mtcars$carb) ftable(carb~vs,mtcars) #vs是行,carb是列 ftable(mtcars[,c(8,11)])#和上面ftable(carb~vs,mtcars)等价 ftable(breaks~wool+tension,warpbreaks) #as.data.frame (DF &- as.data.frame(UCBAdmissions)) #等价于data.frame(UCBAdmissions) xtabs(Freq ~ Admit+ Gender + Dept, DF)#:把方阵变成原来的列联表 (a=xtabs( Freq~ Admit + Gender, data=DF))#如无频数(权),左边为空 写函数ss=function(n=100){z=2;for (i in 2:n)if(any(i%%2:(i1)==0)==F)z=c(z,i);return(z) } fix(ss) ss() t1=Sys.time() ss(10000) Sys.time()-t1 system.time(ss(10000)) #函数可以不写return,这时最后一个值为return的 值.为了输出多个值最好使用list #几个图一起: par(mfrow=c(2,4))#par(mfcol=c(2,4)) layout(matrix(c(1,1,1,2,3,4,2,3,4),nr=3,byrow=T)) hist(rnorm(100),col=&Red&,10) hist(rnorm(100),col=&Blue&,8) hist(rnorm(100),col=&Green&) hist(rnorm(100),col=&Brown&) #par(mar = c(bottom, left, top, right))设置边缘 #缺省值c(5, 4, 4, 2) + 0.1 (英寸) spring= data.frame(compression=c(41,39,43,53,42,48,47,46), distance=c(120,114,132,157,122,144,137,141)) attach(spring)#(Hooke’s law: f=.5ks) par(mfcol=c(2,2)) plot(distance ~ compression) plot(distance ~ compression,type=&l&) plot(compression, distance,type=&o&) plot(compression, distance,type=&b&)关于画图 关于画图par(mfrow=c(2,2))#准备画2x2的4个图 plot(compression, distance,main= &Hooke's Law&) #只有标题 plot(compression, distance,main= &Hooke's Law&, xlab= &x&,ylab= &y&) #标题+x,y标记 identify(compression,distance) #标出点号码 plot(compression, distance,main=&Hooke's Law&) #只有标题 text(46,120, expression(f==frac(1,2)*k*s))#在指定位写入文字 plot(compression, distance,main=&Hooke's Law&) #只有标题的图 text(locator(2), c(&I am here!&,&you are there!&)) #在点击的两个位置 写入文字 par(mfrow=c(1,1)) plot(1:10,sin(1:10),type=&l&,lty=2,col=4,main=paste(strwrap(&The title is too long, and I hate to make it shorter, !@#$%^&*&,width=50),collapse=&\n&)) legend(1.2,1.0,&Just a sine&,lty=2,col=4) 关于画图library(MASS);data(Animals);attach(Animals) par(mfrow=c(2,2)) plot(body, brain) plot(sqrt(body), sqrt(brain)) plot((body)^0.1, (brain)^0.1) plot(log(body),log(brain)) #或者plot(brain~body,log=&xy&) par(mfrow=c(1,1)) par(cex=0.7,mex=0.7) #character (cex) & margin (mex) expansion plot(log(body),log(brain)) text(x=log(body), y=log(brain),labels=row.names(Animals), adj=1.5)# adj=0 implies left adjusted text plot(log(body),log(brain)) identify(log(body),log(brain),row.names(Animals)) 关于画图(符号颜色大小形状等)plot(1,1,xlim=c(1,7.5),ylim=c(0,5),type=&n&) # Do not plot points points(1:7,rep(4.5,7),cex=seq(1,4,l=7),col=1:7, pch=0:6) text(1:7,rep(3.5,7),labels=paste(0:6,letters[1:7]),cex=seq(1,4,l=7), col=1:7) points(1:7,rep(2,7), pch=(0:6)+7) # Plot symbols 7 to 13 text((1:7)+0.25, rep(2,7), paste((0:6)+7)) # Label with symbol number points(1:7,rep(1,7), pch=(0:6)+14) # Plot symbols 14 to 20 text((1:7)+0.25, rep(1,7), paste((0:6)+14)) # Labels with symbol number#调色板 par(mfrow=c(2,4)) palette(); barplot(rnorm(15,10,3),col=1:15) palette(rainbow(15));barplot(rnorm(15,10,3),col=1:15) palette(heat.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(terrain.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(topo.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(cm.colors(15));barplot(rnorm(15,10,3),col=1:15) palette(gray(seq(0,0.9,l=15)));barplot(rnorm(15,10,3),col=1:15) palette(grey(seq(0,0.5,l=15))));barplot(rnorm(15,10,3),col=1:15) palette(&default&) par(mfrow=c(1,1)) 关于画图#matplot sines=outer(1:20,1:4,function(x, y) sin(x/20*pi*y)) matplot(sines, pch = 1:4, type = &o&, col = rainbow(ncol(sines))) #legend x &- seq(-pi, pi, len = 65) plot(x, sin(x), type = &l&, ylim = c(-1.2, 1.8), col = 3, lty = 2) points(x, cos(x), pch = 3, col = 4) lines(x, tan(x), type = &b&, lty = 1, pch = 4, col = 6) title(&legend(..., lty = c(2, -1, 1), pch = c(-1,3,4), merge = TRUE)&, cex.main = 1.1) legend(-1, 1.9, c(&sin&, &cos&, &tan&), col = c(3,4,6), lty = c(2, -1, 1), pch = c(-1, 3, 4), merge = TRUE, bg='gray90') 关于画图#barplot and table par(mfrow=c(2,2)) tN=table(Ni=rpois(100, lambda=5));tN r=barplot(tN, col='gray') lines(r, tN, type='h', col='red', lwd=2) #- type = &h& plotting *is* `bar'plot barplot(tN, space = 1.5, axisnames=FALSE, sub = &barplot(..., space=0, axisnames = FALSE)&) #如space=1.5则有稀牙缝 barplot(tN, space = 0, axisnames=FALSE, sub = &barplot(..., space=0, axisnames = FALSE)&) pie(tN)#pie plot par(mfrow=c(1,1)) #加grid plot (1:3) grid(10, 5 , lwd = 2)dev.dev.dev.list 关于画图(pairs/三维)#pairs#data(iris) pairs(iris[1:4], main = &Anderson's Iris Data -- 3 species&, pch = 21, bg = c(&red&, &green3&, &blue&)[unclass(iris$Species)]) #iris为150x5数据,这里是4个数量变量的点图(最后一个是分类变量(iris$Species)) #stars#data(mtcars) stars(mtcars[, 1:7], key.loc = c(14, 1.5), main = &Motor Trend Cars : full stars()&,flip.labels=FALSE) #mtcars为32x11数据,这里只选前7个数量变量的点图 #persp x &- seq(-10, 10, length= 30) y &- x f &- function(x,y) { r &- sqrt(x^2+y^2); 10 * sin(r)/r } z &- outer(x, y, f) z[is.na(z)] &- 1 persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = &lightblue&) data(volcano) par(mfrow=c(2,2)) z &- 2 * volcano# Exaggerate the relief x &- 10 * (1:nrow(z)) #10 meter spacing(S to N) y &- 10 * (1:ncol(z)) #10 meter spacing(E to W) ## Don't draw the grid lines : border = NA #par(bg = &slategray&) persp(x, y, z, theta = 135, phi = 30, col = &green3&, scale = FALSE, ltheta = -120, shade = 0.75, border = NA, box = FALSE) par(bg= &white&)关于画图(三维)#contour rx &- range(x &- 10*1:nrow(volcano)) ry &- range(y &- 10*1:ncol(volcano)) ry &- ry + c(-1,1) * (diff(rx) - diff(ry))/2 tcol &- terrain.colors(12) opar &- par(pty = &s&, bg = &lightcyan&);par(opar) plot(x = 0, y = 0,type = &n&, xlim = rx, ylim = ry, xlab = &&, ylab = &&) u &- par(&usr&) rect(u[1], u[3], u[2], u[4], col = tcol*8+, border = “red”) #rect画矩形 contour(x, y, volcano, col = tcol[2], lty = &solid&, add = TRUE, vfont = c(&sans serif&, &plain&)) title(&A Topographic Map of Maunga Whau&, font = 4) abline(h = 200*0:4, v = 200*0:4, col = &lightgray&, lty = 2, lwd = 0.1);par(opar) #image x &- 10*(1:nrow(volcano)) y &- 10*(1:ncol(volcano)) image(x, y, volcano, col = terrain.colors(100), axes = FALSE) contour(x, y, volcano, levels = seq(90, 200, by=5), add = TRUE, col = &peru&) axis(1, at = seq(100, 800, by = 100)) axis(2, at = seq(100, 600, by = 100)) box() title(main = &Maunga Whau Volcano&, font.main = 4) par(mfrow=c(1,1)) 多窗口操作x11() plot(1:10) x11() plot(rnorm(10)) dev.set(dev.prev()) abline(0,1)# through the 1:10 points dev.set(dev.next()) abline(h=0, col=&gray&)# for the residual plot dev.set(dev.prev()) dev.off(); dev.off()#- close the two X devices #dev.list() 画图杂项#模拟布朗运动 n=100;x=cumsum(rnorm(100));y=cumsum(rnorm(100));plot(x,y,type=&l&) x=0;y=0;plot(100,ylim=c(-15,15),xlim=c(-15,15))#慢动作 for(i in 1:200){x1=x+rnorm(1);y1=y+rnorm(1); segments(x,y,x1,y1);x=x1;y=y1 Sys.sleep(.05)} #散点大小同因变量值成比例 x=1:10;y=runif(10) symbols(x,y,circle=y/2,inches=F,bg=x) #数据框的每一列都做Q-Q图 table=data.frame(x1=rnorm(100),x2=rnorm(100,1,1)) par(ask=TRUE)#waitforchanging等待页面改变的确认 results=apply(table,2,qqnorm) par(ask=FALSE) #在一个图上添加一个小图 x=rnorm(100) hist(x) op=par(fig=c(.02,.5,.5,.98),new=TRUE) boxplot(x) #数学符号 x=1:10;plot(x,type=&n&) text(3,2,expression(paste(&Temperature(&,degree,&C) in 2003&))) text(4,4,expression(bar(x)==sum(frac(x[i],n),i==1,n))) text(6,6,expression(hat(beta)==(X^t*X)^{.1}*X^t*y)) text(8,8,expression(z[i]==sqrt(x[i]^2+y[i]^2))) 改变大小写字母x=c(&I&,&am&,&A&,&BIG&, &Cat&) tolower(x) toupper(x) R统计模型讲义 #基础 x=rnorm(20,10) t.test(x,m=9,alt=&greater&) t.test(x[1:10],m=9,alt=&greater&)$p.value t.test(x,con=.90)$conf x=rnorm(30,10);y=rnorm(30,10.1) t.test(x,y,alt=&less&) library(TeachingDemos) ci.examp() run.ci.examp() vis.boxcox() vis.boxcoxu() 回归 相关#相关 x=rnorm(20);y=rnorm(20); cor(x,y) cor(x,y,method=&kendall&); cor(x,y,method=&spearman&) cor.test(x,y); cor.test(x,y,method=&kendall&); cor.test(x,y,method=&spearman&) cor.test(x,y,method=&kendall&)$p.value #相关吗? x=rnorm(3);y=rnorm(3);cor(x,y);cor.test(x,y)$p.value library(TeachingDemos) put.points.demo() 基本原理#基本原理 set.seed(100) x1=rnorm(100);x2=rnorm(100);eps=rnorm(100) y=5+2*x1-3*x2+eps a=lm(y~x1+x2) (lm(y~0+x1+x2))#不要截距:等价于(lm(y~-1+x1+x2)) summary(a);anova(a) names(a) shapiro.test(a$res) qqnorm(a$res);qqline(a$res) #数学原理 x=cbind(1,x1,x2) dim(x) b=solve(t(x)%*%x)%*%t(x)%*%y b a$coe 例1:cross.txt10 y 0 5345 x67863 例1: cross.txtw=read.table(&cross.txt&,header=T) head(w) plot(y~x,w);summary(w) a=lm(y~x+z,w) summary(a) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) a1=lm(y~x*z,w) summary(a1);anova(a1) qqnorm(a1$res);qqline(a1$res) shapiro.test(a1$res) anova(a,a1) library(party)#更简单的方法 wt=mob(y~x|z,data=w) coef(wt);plot(wt) plot(y~x,w);abline(coef(wt)[1,],col=2);abline(coef(wt)[2,],col=4) 回归方程65 Poison ExperimentThe data give the survival times (in 10 hour units) in a 3 x 4 factorial experiment, the factors being (a) three poisons and (b) four treatments. Each combination of the two factors is used for four animals, the allocation to animals being completely randomized. Box, G. E. P., and Cox, D. R. (1964). An analysis of transformations (with Discussion). J. R. Statist. Soc. B, 26, 211-252.http://www.statsci.org/data/general/poison.html66 例2:poison.txt:3种毒药,4种处理, 用于动物实验,48个观测值1.0 1.5 2.0 2.5 3.0 3.5 4.0Poison3.04.0Treatment1.0 2.0Time1.01.52.02.53.00.20.40.60.81.01.20.2 0.4 0.6 0.8 1.0 1.21.01.52.02.53.067 回归setwd(&f:/2010stat&) w=read.table(&poison.txt&,head=T) head(w);tail(w) str(w);summary(w) dim(w) w$Poison=factor(w$Poison) w$Treatment=factor(w$Treatment) pairs(w) #直接回归 a=lm(Time~Poison*Treatment,w) anova(a) a=lm(Time~.,w) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) #变换 a=lm(1/Time~Poison+Treatment,w) anova(a) qqnorm(a$res);qqline(a$res) shapiro.test(a$res) summary(a) 逐步回归(step或stepAIC {MASS})library(MASS) quine.hi &- aov(log(Days + 2.5) ~ .^4, quine) quine.nxt &- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn) quine.stp &- stepAIC(quine.nxt, scope = list(upper = ~Eth*Sex*Age*Lrn, lower = ~1), trace = FALSE) quine.stp$anova cpus1 &- cpus attach(cpus) for(v in names(cpus)[2:7]) cpus1[[v]] &- cut(cpus[[v]], unique(quantile(cpus[[v]])), include.lowest = TRUE) detach() cpus0 &- cpus1[, 2:8] # excludes names, authors' predictions cpus.samp &- sample(1:209, 100) cpus.lm &- lm(log10(perf) ~ ., data = cpus1[cpus.samp,2:8]) cpus.lm2 &- stepAIC(cpus.lm, trace = FALSE) cpus.lm2$anova example(birthwt) birthwt.glm &- glm(low ~ ., family = binomial, data = bwt) birthwt.step &- stepAIC(birthwt.glm, trace = FALSE) birthwt.step$anova birthwt.step2 &- stepAIC(birthwt.glm, ~ .^2 + I(scale(age)^2) + I(scale(lwt)^2), trace = FALSE) birthwt.step2$anova quine.nb &- glm.nb(Days ~ .^4, data = quine) quine.nb2 &- stepAIC(quine.nb) quine.nb2$anova 变换并拟合主效应71 结果解释72 异方差时的考虑? ? ? ? Box-Cox变换 加权最小二乘(如: lm中weights = 1/x^2) 异常值处理 不一定是一个模型等等(要做探索分析) 多项式回归 #多项式回归 y &- cars$x &- cars$speed o = order(x) plot( y~x ) do.it &- function (model, col) { r &- lm( model ); yp &- predict(r) lines( yp[o] ~ x[o], col=col, lwd=3 )} do.it(y~x, col=&red&) do.it(y~x+I(x^2), col=&blue&) do.it(y~-1+I(x^2), col=&green&) legend(par(&usr&)[1], par(&usr&)[4], c(&affine function&, &degree-2 polynomial&, &degree 2 monomial&), lwd=3, col=c(&red&, &blue&, &green&), ) n &- 100 x &- runif(n,min=-4,max=4) + sign(x)*.2 y &- 1/x + rnorm(n)#双曲线 plot(y~x) lm( 1/y ~ x ) n &- 100 x &- rlnorm(n)^3.14#a log-normal distribution is a probability distribution of a random variable whose logarithm is normally distributed. y &- x^-.1 * rlnorm(n) plot(y~x) lm(log(y) ~ log(x)) 多项式p,q正交, 如#关于正交多项式 y &- cars$x &- cars$speed #非正交: 一项加一项(互相影响, 显著的系数变成不显著) summary( lm(y~x) ) summary( lm(y~x+I(x^2)) ) summary( lm(y~x+I(x^2)+I(x^3)) ) summary( lm(y~x+I(x^2)+I(x^3)+I(x^4)) ) summary( lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) ) #正交: 不会改变开始显著的系数 poly: Compute Orthogonal Polynomials summary( lm(y~poly(x,1)) ) summary( lm(y~poly(x,2)) ) summary( lm(y~poly(x,3)) ) summary( lm(y~poly(x,4)) ) summary( lm(y~poly(x,5)) ) #对正交多项式点出系数的p-values n &- 5 p &- matrix( nrow=n, ncol=n+1 ) for (i in 1:n) { p[i,1:(i+1)] &- summary(lm( y ~ poly(x,i) ))$coefficients[,4] } matplot(p, type='l', lty=1, lwd=3) legend( par(&usr&)[1], par(&usr&)[4], as.character(1:n), lwd=3, lty=1, col=1:n ) title(main=&Evolution of the p-values (orthonormal polynomials)&) #对非正交多项式, 点出系数的p-values p &- matrix( nrow=n, ncol=n+1 ) p[1,1:2] &- summary(lm(y ~ x) )$coefficients[,4] p[2,1:3] &- summary(lm(y ~ x+I(x^2)) )$coefficients[,4] p[3,1:4] &- summary(lm(y ~ x+I(x^2)+I(x^3)) )$coefficients[,4] p[4,1:5] &- summary(lm(y ~ x+I(x^2)+I(x^3)+I(x^4)) )$coefficients[,4] p[5,1:6] &- summary(lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) )$coefficients[,4] matplot(p, type='l', lty=1, lwd=3) legend( par(&usr&)[1], par(&usr&)[4], as.character(1:n), lwd=3, lty=1, col=1:n ) title(main=&Evolution of the p-values (non orthonormal polynomials)“) #例子 data(beavers) y &- beaver1$temp x &- 1:length(y) plot(y~x) for (i in 1:10) { r &- lm( y ~ poly(x,i) ) lines( predict(r), type=&l&, col=i ) } summary(r)y36.4036.636.837.037.237.4204060 x80100 非参数回归 #非参数: 样条 plot(quakes$long, quakes$lat) lines( smooth.spline(quakes$long, quakes$lat), col='red', lwd=3) library(Design)#rcs: Design Special Transformation Functions # 4-node spline r3 &- lm( quakes$lat ~ rcs(quakes$long) ) plot( quakes$lat ~ quakes$long ) o &- order(quakes$long) lines( quakes$long[o], predict(r)[o], col='red', lwd=3 ) r &- lm( quakes$lat ~ rcs(quakes$long,10) ) lines( quakes$long[o], predict(r)[o], col='blue', lwd=6, lty=3 ) title(main=&Regression with rcs&) legend( par(&usr&)[1], par(&usr&)[3], yjust=0, c(&4 knots&, &10 knots&), lwd=c(3,3), lty=c(1,3), col=c(&red&, &blue&) ) #更多的样条 library(splines) data(quakes) x &- quakes[,2] y &- quakes[,1] o &- order(x) x &- x[o] y &- y[o] r1 &- lm( y ~ bs(x,df=10) ) r2 &- lm( y ~ ns(x,df=6) ) plot(y~x) lines(predict(r1)~x, col='red', lwd=3) lines(predict(r2)~x, col='green', lwd=3) #核光滑 plot(cars$speed, cars$dist) lines(ksmooth(cars$speed, cars$dist, &normal&, bandwidth=2), col='red') lines(ksmooth(cars$speed, cars$dist, &normal&, bandwidth=5), col='green') lines(ksmooth(cars$speed, cars$dist, &normal&, bandwidth=10), col='blue') #加权局部最小二乘. Weighted Local Least Squares: loess #各种核函数 curve(dnorm(x), xlim=c(-3,3), ylim=c(0,1.1)) x &- seq(-3,3,length=200) D.Epanechikov &- function (t) { ifelse(abs(t)&1, 3/4*(1-t^2), 0) } lines(D.Epanechikov(x) ~ x, col='red') D.tricube &- function (t) { # aka &triweight kernel& ifelse(abs(t)&1, (1-abs(t)^3)^3, 0) } lines(D.tricube(x) ~ x, col='blue') legend( par(&usr&)[1], par(&usr&)[4], yjust=1, c(&noyau gaussien&, &noyau d'Epanechikov&, &noyau tricube&), lwd=1, lty=1, col=c(par('fg'),'red', 'blue')) title(main=&Differents kernels&) #局部多项式回归 library(KernSmooth) data(quakes) x &- quakes$y &- quakes$lat plot(y~x) bw &- dpill(x,y) # .2 lines( locpoly(x,y,degree=0, bandwidth=bw), col='red' ) lines( locpoly(x,y,degree=1, bandwidth=bw), col='green' ) lines( locpoly(x,y,degree=2, bandwidth=bw), col='blue' ) legend( par(&usr&)[1], par(&usr&)[3], yjust=0, c(&degree = 0&, &degree = 1&, &degree = 2&), lwd=1, lty=1, col=c('red', 'green', 'blue')) title(main=&Local Polynomial Regression&) #大窗宽 plot(y~x);bw &- .5 lines( locpoly(x,y,degree=0, bandwidth=bw), col='red' ) lines( locpoly(x,y,degree=1, bandwidth=bw), col='green' ) lines( locpoly(x,y,degree=2, bandwidth=bw), col='blue' ) legend( par(&usr&)[1], par(&usr&)[3], yjust=0, c(&degree = 0&, &degree = 1&, &degree = 2&), lwd=1, lty=1, col=c('red', 'green', 'blue')) title(main=&Local Polynomial Regression (wider window)&) 非线性回归 #非线性回归 library(nls2) f &- function (x,p) { u &- p[1] v &- p[2] u/(u-v) * (exp(-v*x) - exp(-u*x)) } n &- 100 x &- runif(n,0,2) y &- f(x, c(3.14,2.71)) + .1*rnorm(n) r &- nls( y ~ f(x,c(a,b)), start=c(a=3, b=2.5) ) plot(y~x) xx &- seq(0,2,length=200) lines(xx, f(xx,r$m$getAllPars()), col='red', lwd=3) lines(xx, f(xx,c(3.14,2.71)), lty=2) 分位数回归 模型损失函数 寻找参数(可能是向量) 满足对线性回归模型条件则为 在最小二乘回归中 在t分位数回归中 t=0.5?最小一乘回归 在t分位数定义: 损失函数rt(u)形状 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?#分位数回归 library(quantreg) data(engel);head(engel);plot(engel) plot(engel, log = &xy&,main = &'engel' data (log - log scale)&) plot(log10(foodexp) ~ log10(income), data = engel,main = &'engel' data (log10 - tranformed)&) taus &- c(.15, .25, .50, .75, .95, .99) rqs &- as.list(taus) for(i in seq(along = taus)) { rqs[[i]] &- rq(log10(foodexp) ~ log10(income), tau = taus[i], data = engel) lines(log10(engel$income), fitted(rqs[[i]]), col = i+1)} legend(&bottomright&, paste(&tau = &, taus), inset = .04, col = 2:(length(taus)+1), lty=1) abline(lm(log10(foodexp)~log10(income),engel),lwd=5)#最小二乘黑粗线 plot(summary(rq(log10(foodexp)~log10(income),tau = 1:49/50,data=engel)))#画出系数图 #未变换数据 plot(foodexp~income, data = engel, main = &'engel' data&) for(i in seq(along = taus)) { rqs[[i]] &- rq(foodexp ~ income, tau = taus[i], data = engel) lines(engel$income, fitted(rqs[[i]]), col = i+1)} legend(&bottomright&, paste(&tau = &, taus), inset = .04, col = 2:(length(taus)+1), lty=1) abline(lm(foodexp~income,engel),lwd=5)#最小二乘黑粗线 plot(summary(rq(foodexp~income,tau = 1:49/50,data=engel)))#画出系数图 N &- 2000 x &- runif(N) y &- rnorm(N) y &- -1 + 2 * x + ifelse(y&0, y+5*x^2, y-x^2) plot(x,y) abline(lm(y~x), col=&red&) library(quantreg) plot(y~x) for (a in seq(.1,.9,by=.1)) { abline(rq(y~x, tau=a), col=&blue&, lwd=3) } #局部多项式分位数回归: locally polynomial quantile regression plot(y~x) for (a in seq(.1,.9,by=.1)) { r &- lprq(x,y, h=bw.nrd0(x), # See ?density tau=a) lines(r$xx, r$fv, col=&blue&, lwd=3) } Logistic & Probit 回归 广义线性模型(GLM)所有函数已知, q和均值有关, 如果b(q)= q, 则 为典则形式(所有分布都可通过变换成为典则 形式), 再者, 如果T(y)=y, 则q称为典则参数, 而 且 如果用典则参数q, 即, q=b(m), q=b(m)=Xb为典 则连接函数(连接函数是描述均值m和线性表 示Xb之间关系的函数b(m)=Xb, 一般不等于q).
Logistic回归/Probit回归例子: ModeChoice95 例子: ModeChoice实际上,Mode只有两种:0、1, 其余变量为数量96 考虑2种logistic模型模型b模型c97 ANOVA library(Ecdat);data(ModeChoice)#二分类w=ModeChoice #两个logistic模型 b=glm(factor(mode)~ttme+invt+gc,data=w,family=&binomial&) c=glm(factor(mode)~ttme+invc*invt+gc,data=w,family=&binomial&) anova(c,test=&Chi&);summary(c);anova(b,c,test=&Chi&)98 模型b拟合99 模型c拟合100
考虑2种probit模型模型bb或模型cb或102 ANOVA #两个probit模型bb=glm(factor(mode)~ttme+invt+gc,data=w,family=binomial(link=probit)) cb=glm(factor(mode)~ttme+invc*invt+gc,data=w,binomial(link=probit)) anova(cb,test=&Chi&);summary(c);anova(bb,cb,test=&Chi&) anova(b,bb,test=&Chi&);anova(c,cb,test=&Chi&)103
模型b拟合105
比较logistic和probit library(dglm) library(statmod) clotting &- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) a1=glm(lot1 ~ log(u), data=clotting, family=Gamma) summary(a1) # The same example as in glm: the dispersion is modelled as constant # However, dglm used ml not reml, so results slightly different: out &- dglm(lot1 ~ log(u), ~1, data=clotting, family=Gamma) summary(out)dispersion# Try a double glm out2 &- dglm(lot1 ~ log(u), ~u, data=clotting, family=Gamma)summary(out2) anova(out2) # Summarize the mean model as for a glm summary.glm(out2) # Summarize the dispersion model as for a glm summary(out2$dispersion.fit) # Examine goodness of fit of dispersion model by plotting residuals plot(fitted(out2$dispersion.fit),residuals(out2$dispersion.fit)) Poisson log-linear model: dispersionn independent responsesoffsetThe Poisson distribution has but it may happen that the actual variance exceeds the nominal variance under the assumed probability model. Suppose now that θi=λi ni Thus, it can be shownHence, for φ&0 we have overdispersion. It is interesting to note that the same mean and variance arise also if we assume a negative binomial distribution for the response variable. library(dispmod) data(salmonellaTA98) attach(salmonellaTA98) log.x10 &- log(x+10) mod &- glm(y ~ log.x10 + x, family=poisson(log)) summary(mod)mod.disp &- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersionPoisson log-linear model: dispersion# compute predictions on a grid of x-values... x0 &- seq(min(x), max(x), length=50) eta0 &- predict(mod, newdata=data.frame(log.x10=log(x0+10), x=x0), se=TRUE) eta0.disp &- predict(mod.disp, newdata=data.frame(log.x10=log(x0+10), x=x0), se=TRUE) # ... and plot the mean functions with variability bands plot(x, y) lines(x0, exp(eta0$fit)) lines(x0, exp(eta0$fit+2*eta0$se), lty=2) lines(x0, exp(eta0$fit-2*eta0$se), lty=2) lines(x0, exp(eta0.disp$fit), col=2) lines(x0, exp(eta0.disp$fit+2*eta0.disp$se), lty=2, col=2) lines(x0, exp(eta0.disp$fit-2*eta0.disp$se), lty=2, col=2) Poisson log-linear model: dispersion##-- Holford's datadata(holford) attach(holford) mod &- glm(incid ~ offset(log(pop)) + Age + Cohort, family=poisson(log)) summary(mod) mod.disp &- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion #另一种方法(利用Tweedie distributions―自己找文献) #tt= glm(incid ~ offset(log(pop)) + Age + Cohort, family=tweedie(var.power=4,link.power=0)) 识别回归共线性问题(Multicollinearity)? 如果加/减一个自变量, 系数变化很大. ? F检验拒绝所有系数为0, 但有的系数不显著. ? 有人用formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity ? 条件数检验(Condition Number Test) ? Farrar-Glauber Test: If the variables are found to be orthogonal, there is
这里R^2_j是第j个变量在所有其它变量上回归时的确定系数, tolerance太小(&0.2, 0.1)或VIF太大(&5, 10)则有多重共线性问题. 条件数(Condition Number ) 正交则为1, &15则有问题, &30问题严重 计算VIF(线性模型和广义线性模型) Calculates variance-inflation and generalized variance-inflation factors for linear and generalized linear models.library(car) vif(lm(prestige ~ income + education, data=Duncan)) vif(lm(prestige ~ income + education + type, data=Duncan))counts &- c(18,17,15,20,10,20,25,13,12) outcome &- gl(3,1,9) treatment &- gl(3,3) print(d.AD &- data.frame(treatment, outcome, counts)) glm.D93 &- glm(counts ~ outcome + treatment, family=poisson()) anova(glm.D93) summary(glm.D93) vif(glm.D93) 计算条件数(Compute or Estimate the Condition Number of a Matrix)kappa(x1 &- cbind(1,1:10))# 15.71 kappa(x1, exact = TRUE) # 13.68 kappa(x2 &- cbind(x1,2:11))# high! [x2 is singular!]hilbert &- function(n) { i &- 1:n; 1 / outer(i - 1, i, &+&) } sv9 &- svd(h9 &- hilbert(9))$ d kappa(h9)# pretty high! kappa(h9, exact = TRUE) == max(sv9) / min(sv9) kappa(h9, exact = TRUE) / kappa(h9) # .677 (i.e., rel.error = 32%) 岭回归? ridge b2 p p ?N ? ? ? ? 2? = arg min ?? ? yi ? b 0 ? ? xij b j ? ? ? ? b j ? b j =1 j =1 ? ? i =1 ? ? ? ?? bridge? ? = arg min ? ? yi ? b 0 ? ? xij b j ? b i =1 ? j =1 ?N p2subject to?bj =1p2 j?s library(perturb);data(consumption)A data frame with 28 observations on the following 5 variables.?year: 1947 to 1974 ?c: total consumption, 1958 dollars ?r: the interest rate (Moody's Aaa) ?dpi: disposable income, 1958 dollars ?d_dpi annual change in disposable income library(perturb);data(consumption) head(consumption) library(MASS) ct1&-c(NA,c[-length(c)]); a&-lm.ridge(c~ct1+dpi+r+d_dpi, lambda=seq(0, 0.1,length=100), model =TRUE) names(a)# &coef& &scales& &Inter& &lambda& &ym& &xm& &GCV& &kHKB& &kLW& a$lambda[which.min(a$GCV)] ##找到GCV 最小时的lambdaGCV= 0.014 a$coef[,which.min(a$GCV)] ##找到GCV 最小时对应的系数 a$coef[,which.min(a$GCV)] par(mfrow=c(1,2)) plot(a) ##画出图形,并作出lambda 取0.01 时的那条线,以红线表示。 abline(v=a$lambda[which.min(a$GCV)],col=&red&) plot(a$lambda,a$GCV,type=&l&)#lamda 同GCV 之间关系的图形 abline(v=a$lambda[which.min(a$GCV)],col=&green&) 80t(x$coef)a$GCV0.00 0.02 0.04 0.06 0.08 0.10402000.520.000.530.54600.550.020.040.060.080.10x$lambdaa$lambda 对于正交数据(独立)4t(x$coef)a$GCV0 20 40 60 80 100210.0500.100.150.2030.2520406080100x$lambdaa$lambda #另一个例子 longley # not the same as the S-PLUS dataset names(longley)[1] &- &y& a0=lm.ridge(y ~ ., longley)#lambda = 0 plot(lm.ridge(y ~ ., longley,lambda = seq(0,0.1,0.001))) select(lm.ridge(y ~ ., longley,lambda = seq(0,0.1,0.0001))) a1=lm.ridge(y ~ ., longley,lambda=0.0057)a1$coe 偏最小二乘回归PLSR (Partial Least Squares and Principal Component Regression) oliveoil {pls} Sensory and physico-chemical data of olive oils Description A data set with scores on 6 attributes from a sensory panel and measurements of 5 physicochemical quality parameters on 16 olive oil samples. The first five oils are Greek, the next five are Italian and the last six are Spanish. data(oliveoil) Format: A data frame with 16 observations on the following 2 variables. Sensory: a matrix with 6 columns. Scores for attributes ‘yellow’, ‘green’, ‘brown’, ‘glossy’, ‘transp’, and ‘syrup’. Chemical: a matrix with 5 columns. Measurements of acidity, peroxide, K232, K270, and DK. Source Massart, D. L., Vandeginste, B. G. M., Buydens, L. M. C., de Jong, S., Lewi, P. J., Smeyers-Verbeke, J. (1998) Handbook of Chemometrics and Qualimetrics: Part B. Elsevier. Tables 35.1 and 35.4. [Package pls version 2.1-0 Index] sensory ~ chemicalSensory Chemical #偏最小二乘回归(先主成份回归) library(pls); data(oliveoil);head(oliveoil);dim(oliveoil) oliveoil$sensory#是一个16x6矩阵 oliveoil$chemical #是一个16x5矩阵 #PCR sens.pcr &- pcr(sensory ~ chemical, ncomp = 4, scale = TRUE, data = oliveoil) summary(sens.pcr);names(sens.pcr)[1] &coefficients& &scores& &loadings& &Yloadings& &projection& &Xmeans& [7] &Ymeans& &fitted.values& &residuals& &Xvar& &Xtotvar& &ncomp& [13] &method& &scale& &call& &terms& &model&sens.pcr$loadings sens.pcr$coefficients sens.pcr$scores sens.pcr$Yloadings sens.pcr$projection sens.pcr$residuals #PLSR sens.pls &- plsr(sensory ~ chemical, ncomp = 4, scale = TRUE, data = oliveoil) summary(sens.pls);names(sens.pls)[1] &coefficients& &scores& &loadings& &loading.weights& &Yscores& [6] &Yloadings& &projection& &Xmeans& &Ymeans& &fitted.values& [11] &residuals& &Xvar& &Xtotvar& &ncomp& &method& [16] &scale& &call& &terms& &model&sens.pls$loadings sens.pls$coef sens.pls$scores sens.pls$loading.weights sens.pls$Yscores sens.pls$Yloadings sens.pls$Xvar sens.pls$Xtotvar library(pls);data(yarn) Consisting of 21 NIR spectra of PET yarns, measured at 268 wavelengths, and 21 corresponding densities. (Erik Swierenga). library(pls);data(yarn) names(yarn)#[1] &NIR& &density& &train& dim(yarn$NIR)#28 268 自变量 yarn$density #因变量 summary(yarn$train) yarn$train Mode FALSE TRUE NA's logical 7 21 0 yarn.pls &- plsr(density ~ NIR, ncomp = 4, scale = TRUE, data = yarn) summary(yarn.pls);names(yarn.pls) [1] &coefficients& &scores& &loadings& &loading.weights& &Yscores& [6] &Yloadings& &projection& &Xmeans& &Ymeans& &fitted.values& [11] &residuals& &Xvar& &Xtotvar& &ncomp& &method& [16] &scale& &call& &terms& &model& yarn.pls$loadings yarn.pls$coef yarn.pls$scores yarn.pls$loading.weights yarn.pls$Yscores yarn.pls$Yloadings yarn.pls$Xvar yarn.pls$Xtotvar LASSO方法 在线性模型中,人们必须选择合适的变量;比如常 用的逐步回归法就是选择显著的变量而抛弃那些不显著的 。Tibshirani(1996)[1]提出了一个新的方法来处理变量选择 的问题。该方法在模型系数绝对值的和小于某常数的条件 下,谋求残差平方和最小。该方法既提供了如子集选择方 法那样的可以解释的模型,也具有岭回归那样的稳定性。 它不删除变量,但使得一些回归系数收缩、变小,甚至为 0。因而,该方法被称为lasso(least absolute shrinkage and selection operator,最小绝对值收缩和选择算子[2])。[1]Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso”, J. R. Statist. Soc. B, 58(1) 267-288. [2] 作为非缩写的英文词,lasso还有套马索的意义,和这里的缩写 130 lasso没有任何关联。把它翻译成“套马索”是不恰当的。 ? blasso? ? = arg min ? ? yi ? b 0 ? ? xij b j ? b i =1 ? j =1 ?N p2subject to?| bj =1pj|? s131 library(lars) data(diabetes);attach(diabetes);names(diabetes);dim(diabetes);head(diabetes) attributes(diabetes$x)#数据是个list, 三个变量(y,x,x2), x及x2有子变量 #三种方法比较(缺省是lasso): #Fits Least Angle Regression, Lasso and Infinitesimal Forward Stagewise regression models par(mfrow=c(1,3)) attach(diabetes) object &- lars(x,y);plot(object) object2 &- lars(x,y,type=&lar&);plot(object2) object3 &- lars(x,y,type=&for&) ;plot(object3) par(mfrow=c(1,1)) a2&-lars(x2,y) a2 attributes(a2)a=cv.lars(x2,y,trace=TRUE,max.steps=80) attach(a) best&-index[which.min(cv)] (coef=coef.lars(a2,mode=&fraction&,s=best)) detach(diabetes) #write.csv(diabetes,&f:/2010stat/diabetes.csv&,quote=F,row.names=F) #w=read.csv(&f:/2010stat/diabetes.csv&)#和原来的格式有区别 fit a GLM with lasso or elasticnet regularization (glmnet)# Gaussian x=matrix(rnorm(100*20),100,20) y=rnorm(100) fit1=glmnet(x,y) print(fit1) coef(fit1,s=0.01) # extract coefficients at a single value of lambda predict(fit1,newx=x[1:10,],s=c(0.01,0.005)) # make predictions#binomial g2=sample(1:2,100,replace=TRUE) fit2=glmnet(x,g2,family=&binomial&) #multinomial g4=sample(1:4,100,replace=TRUE) fit3=glmnet(x,g4,family=&multinomial&) system.time(fit2n&-glmnet(x,y)) #poisson N=500; p=20 nzc=5 x=matrix(rnorm(N*p),N,p) beta=rnorm(nzc) f = x[,seq(nzc)]%*%beta mu=exp(f) y=rpois(N,mu) fit=glmnet(x,y,family=&poisson&) plot(fit) pfit = predict(fit,x,s=0.001,type=&response&) plot(pfit,y) #Cox set.seed(10101) N=1000;p=30 nzc=p/3 x=matrix(rnorm(N*p),N,p) beta=rnorm(nzc) fx=x[,seq(nzc)]%*%beta/3 hx=exp(fx) ty=rexp(N,hx) tcens=rbinom(n=N,prob=.3,size=1)# censoring indicator y=cbind(time=ty,status=1-tcens) # y=Surv(ty,1-tcens) with library(survival) fit=glmnet(x,y,family=&cox&) plot(fit) # Sparse n=10000;p=200 nzc=trunc(p/10) x=matrix(rnorm(n*p),n,p) iz=sample(1:(n*p),size=n*p*.85,replace=FALSE) x[iz]=0 sx=Matrix(x,sparse=TRUE) beta=rnorm(nzc) fx=x[,seq(nzc)]%*%beta eps=rnorm(n) y=fx+eps px=exp(fx) px=px/(1+px) ly=rbinom(n=length(px),prob=px,size=1) system.time(fit1&-glmnet(sx,y)) 广义可加模型(gam) Generalized Additive ModelsThe functions fi(xi) may be fit using parametric or non-parametric means, thus providing the potential for better fits to data than other methods.library(gam) data(kyphosis) gam(Kyphosis ~ s(Age,4) + Number, family = binomial, data=kyphosis, trace=TRUE) data(airquality) gam(Ozone^(1/3) ~ lo(Solar.R) + lo(Wind, Temp), data=airquality, na=na.gam.replace) gam(Kyphosis ~ poly(Age,2) + s(Start), data=kyphosis, family=binomial, subset=Number&2) data(gam.data) gam.object &- gam(y ~ s(x,6) + z,data=gam.data) summary(gam.object) plot(gam.object,se=TRUE) data(gam.newdata) predict(gam.object,type=&terms&,newdata=gam.newdata) #Built-in nonparametric smoothing terms are indicated by s for smoothing splines or lo for loess smooth terms. Multiple Fractional Polynomial ModelSelects the multiple fractional polynomial (MFP) model which best predicts the outcome. The model may be a generalized linear model or a proportional hazards (Cox) model. (后面ac.csv数据例子)library(mfp) data(GBSG) f &- mfp(Surv(rfst, cens) ~ fp(age, df = 4, select = 0.05) + fp(prm, df = 4, select = 0.05), family = cox, data = GBSG) print(f) survfit(f$fit) # use proposed coxph model fit for survival curve estimation参看ppt: fractional polynomial model.pdf library(mfp) w=read.csv(&f:/2010stat/ac.csv&) d=mfp(log(ac)~fp(gawks,select=0.05),data=w,verbose=T) summary(d) par(mfrow=c(2,2)) plot(mfp(log(ac)~fp(gawks),w)) xfit=(w$gawks);yfit=predict(d,data=(w$gawks=(xfit))) plot(log(w$ac)~w$gawks);lines(yfit~xfit,col=&red&) library(mfp) w=read.csv(&f:/2010stat/ac.csv&) d=mfp(log(ac)~fp(gawks,select=0.05),data=w,verbose=T)Null model Straight line model Best-fitting FP1 model Best-fitting FP2 modelM=2 P1=-2 P2=0 summary(d) par(mfrow=c(2,2)) plot(mfp(log(ac)~fp(gawks),w))Residuals vs Fitted0.2255Normal Q-QStd. deviance resid. 4255Residuals0.0-0.24.55.0 Predicted values5.56.0-432592-232 59202-3-2-10123Theoretical Quantiles2.0Scale-Location255Residuals vs Leverage592Std. deviance resid.Std. Pearson resid.324171.5271.00.5-200.04.55.0 Predicted values5.56.0-432 Cook's distance0.000.010.02 Leverage0.03 xfit=(w$gawks);yfit=predict(d,data=(w$gawks=(xfit))) plot(log(w$ac)~w$gawks);lines(yfit~xfit,col=&red&)log(w$ac)4.04.55.05.56.0152025 w$gawks303540 变量选择:library(MMIX)library(MMIX) X1&-c(-0.2,-2.4,-0.7,1.2,0.0,-1.1,-2.1,-0.3,2.0,-1.7,1.4,-1.3,-3.4,0.4,-1.3, -4.8) X2&- c(-3, 2, 1, -2, -2, -4, 0, 1, 1, -1, -1, -4, 0, 2, 0, -4) X3&-c(2,1,0,-2,1,-2, 0, -1, -4, 1, -3, -3, -3, -1, 0, 2) #Linear model Y1&- c(8.7, 6, 9.1, 10.4, 7.6 ,10.4, 7.9, 11.9, 18, 10.5, 16.5, 8.8, 7.7, 13.5, 8.2, 0.8) data1&-data.frame(Y1,X1,X2,X3) pairs(data1) #Logistic model Y2&-c(1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1) data2&-data.frame(Y2,X1,X2,X3) pairs(data2)最终得到两个简化的数据集:选X1,X3 选X2##varSelec data1bis&-varSelec(data=data1,family=gaussian(&identity&),maxVar=2) data2bis&-varSelec(data=data2,family=binomial(&logit&),maxVar=2) This function analyses the stability of the stepwise selection and mixing methods using a bootstrap procedure.##bootFreq #Stepwise selection method bootStep1&bootFreq(data=data1,family=gaussian(&identity&),nboot=50,method=2, criterion=&bic&,trace=0) Factors weights bootStep1 summary(bootStep1) plot(bootStep1)X1 X2 X30.00.20.40.60.81.0 bootStep2&-bootFreq(data=data2,family=binomial(&logit&),method=2, criterion=&bic&,nboot=20,trace=0) bootStep2 summary(bootStep2) plot(bootStep2) Factors weightsX3X1X20.00.10.20.30.40.5 Adaptative Regression by Mixing with model Screening Apply ARMS for linear and logistic models.##ARMS method armsResult1&-arms(data=data1,family=gaussian(&identity&),nbest=5,nsample=10, criterion=&both&,weight=&aic&) Factor weights armsResult1 summary(armsResult1) plot(armsResult1)X1 X2 X30.00.20.40.60.8 armsResult2&-arms(data=data2,family=binomial(&logit&),nbest=5,nsample=10, criterion=&both&,weight=&aic&) armsResult2 summary(armsResult2) plot(armsResult2)Factor weightsX3X1X20.000.050.100.150.200.250.300.35 Model performance indicators PMSE and AUC##Root Mean Square Error by cross-validation #Stepwise selection with BIC pmseStepBic&-pmseCV(data=data1,method=2,np=1,random=FALSE,direction=&both&, criterion=&bic&,trace=0) pmseStepBic #BMA pmseBMA&-pmseCV(data=data1,method=3,np=1,random=FALSE) pmseBMA ##Area Under ROC Curve by cross-validation aucStepBic&-aucCV(data=data2,method=2,np=1,random=FALSE,direction=&both&, criterion=&bic&,trace=0) aucStepBic aucBMA&-aucCV(data=data2,method=3,np=1,random=FALSE) aucBMA Conditional logistic regression参见ppt: logisticRegressionManitoba-Nov2010.pdf “Prospective” model Conditional likelihood Exponential movement kernels (Forester et al 2009)
library(TwoStepCLogit) data(bison)dyad &- bison$Cluster stratum &- bison$Strata Y &- bison$Y x1 &- bison$biomass x2 &- bison$water x3 &- bison$biomass2bison.data &- cbind(dyad,stratum,Y,x1,x2,x3) # Conditional logistic regression # There are 10 observations per stratum, with # 2 cases and 8 controls in each stratum # Model 1: all three covariates x1, x2, x3 # Random effects in front of x1 and x3 # Main diagonal covariance structure for D Fit1 &- Ts.estim(bison.data, Proposed.model=c(1,2,3), random=c(1,3), All.m.1=FALSE, D=&UN(1)&) Fit1 # Model 2: only covariates x1, x3 # Random effects in front of x1 and x3 # Unstructured covariance structure for D Fit2 &- Ts.estim(bison.data, Proposed.model=c(1,3), random=c(1,2), All.m.1=FALSE, D=&UN&) Fit2 Bayesian Methods for Tree Based ModelsBayesian Additive Regression Treesbart {BayesTree}BART is a Bayesian “sum-of-trees” model. For numeric response y, we have y = f(x) + e, where e ~ N(0,sigma\^2). For a binary response y, P(Y=1 | x) = F(f(x)), where F denotes the standard normal cdf (probit link). In both cases, f is the sum of many tree models. The goal is to have very flexible inference for the uknown function f. In the spirit of “ensemble models”, each tree is constrained by a prior to be a weak learner so that it contributes a small amount to the overall fit. library(BayesTree) ##simulate data (example from Friedman MARS paper) f = function(x){ 10*sin(pi*x[,1]*x[,2]) + 20*(x[,3]-.5)^2+10*x[,4]+5*x[,5] } sigma = 1.0 #y = f(x) + sigma*z , z~N(0,1) n = 100 #number of observations set.seed(99) x=matrix(runif(n*10),n,10) #10 variables, only first 5 matter Ey = f(x) y=Ey+sigma*rnorm(n) lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later ##run BART set.seed(99) bartFit = bart(x,y) plot(bartFit) # plot bart fit ##compare BART fit to linear matter and truth = Ey fitmat = cbind(y,Ey,lmFit$fitted,bartFit$yhat.train.mean) colnames(fitmat) = c('y','Ey','lm','bart') print(cor(fitmat)) sigma 0.5 1.0 1.5 2.0 2.50 200 400 600 800posterior interval for E(Y|x) 5 10 15 20 25 30Index 5 10 15 y 20 25 ? &vm& : used as a base for the various vector machine classes in kernlab ? &gausspr& :The Gaussian Processes object class ? &kfa& : returned by the Kernel Feature Analysis kfa function ? &kqr& : The Kernel Quantile Regression ? &ksvm& : Support Vector Machines function ? &lssvm& : Gaussian Processes ? &onlearn& : Kernel-based Online learning algorithms ? &rvm& : Relevance Vector Machine ? &gausspr& : Gaussian ProcessesKernel-based Machine Learning Lab library(kernlab) 参看R中描述方法的pdf文件(该包附带) library(kernlab) ## simple example using the promotergene data set data(promotergene)例: ksvm (支持向量机)## train a support vector machine gene &- ksvm(Class~.,data=promotergene,kernel=&rbfdot&,kpar=list(sigma=0.015),C=50,cross=4) gene# the kernel function kernelf(gene) # the alpha values alpha(gene) # the coefficients coef(gene) # the fitted valuesfitted(gene) 给出拟合值# the cross validation error cross(gene) 预测(给出对测试集的预测)## example using the promotergene data set data(promotergene) ## create test and training set ind &- sample(1:dim(promotergene)[1],20) genetrain &- promotergene[-ind, ] genetest &- promotergene[ind, ]## train a support vector machine gene &ksvm(Class~.,data=genetrain,kernel=&rbfdot&,kpar=list(sigma=0.015),C=70,cross= 4,prob.model=TRUE) gene## predict gene type probabilities on the test set genetype &- predict(gene,genetest,type=&probabilities&) genetype 例: kqr (分位数回归)# create data x &- sort(runif(300)) y &- sin(pi*x) + rnorm(300,0,sd=exp(sin(2*pi*x))) # first calculate the median #(中位数回归) qrm &- kqr(x, y, tau = 0.5, C=0.15) # predict and plot plot(x, y) ytest &- predict(qrm, x) lines(x, ytest, col=&blue&)6 y -4 -2 0 2 40.00.20.4 x0.60.81.0 # calculate 0.9 quantile qrm &- kqr(x, y, tau = 0.9, kernel = &rbfdot&, kpar= list(sigma=10), C=0.15) ytest &- predict(qrm, x) lines(x, ytest, col=&red&)y -4 -2 02460.00.20.4 x0.60.81.0 例: 主成份分析data(iris) test &- sample(1:150,20) kpc &- kha(~.,data=iris[-test,-5],kernel=&rbfdot&,kpar=list(sigma=0.2),features=2) #print the principal component vectors pcv(kpc) #plot the data projection on the components plot(predict(kpc,iris[,-5]),col=as.integer(iris[,5]),xlab=&1st Principal Component&,ylab=&2nd Principal Component&)0.2 -0.4 -0.2 0.0 0.42nd Principal Component-0.10-0.050.000.051st Principal Component Generalized Boosted Regression Models (原理看vignette(&gbm&))See vignette(&gbm&) for technical details of the package. Also available at ../doc/gbm.pdf (if you are using HTML help). This package implements the generalized boosted modeling framework. Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001). In addition to many of the features documented in the Gradient Boosting Machine, gbm offers additional features including the out-of-bag estimator for the optimal number of iterations, the ability to store and manipulate the resulting gbm object, and a variety of other loss functions that had not previously had associated boosting algorithms, including the Cox partial likelihood for censored data, the poisson likelihood for count outcomes, and a gradient boosting implementation to minimize the AdaBoost exponential loss function. gbm.fit provides the link between R and the C++ gbm engine. gbm is a frontend to gbm.fit that uses the familiar R modeling formulas. However, model.frame is very slow if there are many predictor variables. For power-users with many variables use gbm.fit. For general practice gbm is preferable. gbm: A least squares regression example# create some data N &- 1000 X1 &- runif(N) X2 &- 2*runif(N) X3 &- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 &- factor(sample(letters[1:6],N,replace=TRUE)) X5 &- factor(sample(letters[1:3],N,replace=TRUE)) X6 &- 3*runif(N) mu &- c(-1,0,1,2)[as.numeric(X3)] SNR &- 10 # signal-to-noise ratio Y &- X1**1.5 + 2 * (X2**.5) + mu sigma &- sqrt(var(Y)/SNR) Y &- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] &- NA X4[sample(1:N,size=300)] &- NA data &- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)构造数据 # fit initial model gbm1 &- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=&gaussian&, # bernoulli, adaboost, gaussian, # poisson, coxph, and quantile available n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress拟合 # check performance using an out-of-bag estimator # OOB underestimates the optimal number of iterations best.iter &- gbm.perf(gbm1,method=&OOB&) print(best.iter) # check performance using 5-fold cross-validation best.iter &- gbm.perf(gbm1,method=&cv&) print(best.iter)2.0 Squared error loss 0.5 1.0 1.5050010001500200025003000 # plot the performance # plot variable influence summary(gbm1,n.trees=1) # based on the first tree summary(gbm1,n.trees=best.iter) # based on the estimated best number of treesX3X2X1X4X5X602040 Relative influence6080X5X6X4X1X2X30102030405060Relative influence # compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1,1)) print(pretty.gbm.tree(gbm1,gbm1$n.trees)) # make some new data# make some new data N &- 1000 X1 &- runif(N) X2 &- 2*runif(N) X3 &- ordered(sample(letters[1:4],N,replace=TRUE)) X4 &- factor(sample(letters[1:6],N,replace=TRUE)) X5 &- factor(sample(letters[1:3],N,replace=TRUE)) X6 &- 3*runif(N) mu &- c(-1,0,1,2)[as.numeric(X3)] Y &- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma) data2 &- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)# predict on the new data using &best& number of trees # f.predict generally will be on the canonical scale (logit,log,etc.) f.predict &- predict.gbm(gbm1,data2,best.iter)# least squares error print(sum((data2$Y-f.predict)^2)) # create marginal plots # plot variable X1,X2,X3 after &best& iterations par(mfrow=c(1,3)) plot.gbm(gbm1,1,best.iter) plot.gbm(gbm1,2,best.iter) plot.gbm(gbm1,3,best.iter) par(mfrow=c(1,1))3.2 3.5 4.0 3.0 3.0 f(X1) f(X2) 2.5 f(X3) 2.8 2.6 2.0 1.5 2.40.0 0.2 0.4 0.6 0.8 1.0 X10.00.51.0 X21.52.01.52.02.53.03.5dc X3ba # contour plot of variables 1 and 2 after &best& iterations plot.gbm(gbm1,1:2,best.iter) # lattice plot of variables 2 and 3 plot.gbm(gbm1,2:3,best.iter)0.0 0.5 1.0 1.5 2.0ba5 4 3 2 1f(X2,X3)0d5 4 3 2 1 0 0.0 0.5 1.0 1.5 2.0cX2 # lattice plot of variables 3 and 4 plot.gbm(gbm1,3:4,best.iter)def4.0 3.5 3.0 2.5 2.0f(X3,X4)1.5a4.0 3.5 3.0 2.5 2.0 1.5 d c b a d cbcbadcbaX4 # 3-way plots plot.gbm(gbm1,c(1,2,6),best.iter,cont=20) plot.gbm(gbm1,1:3,best.iter) plot.gbm(gbm1,2:4,best.iter) plot.gbm(gbm1,3:5,best.iter)c a4.0 3.5 3.0 2.5 2.0 1.54 2 0c bc cc dc ec f0.00.51.01.52.00.00.51.01.52.0f df cf bf ab af(X3,X4,X5)b bb cb db eb f4.0 3.5 3.0 2.5 2.0 1.5e d4 2 0e ce be ad dd cd bd a4 2 0f(X2,X3,X4)c d4 2 0c cc bc aa a4.0 3.5 3.0 2.5 2.0 1.5 d c b a d ca ba ca da ea fb db cb bb a4 2 0a d4 2 0 0.0 0.5 1.0 1.5 2.0a ca ba abadcbadcbadcbadcba0.00.51.01.52.0X3X2 # do another 100 iterations gbm2 &- gbm.more(gbm1,100, verbose=FALSE) # stop printing detailed progress Fit a Cox model by likelihood based boosting (BoxBoost)In contrast to gradient boosting (implemented e.g. in the glmboost routine in the R package mboost, using the CoxPH loss function), CoxBoost is not based on gradients of loss functions, but adapts the offset-based boosting approach from Tutz and Binder (2007) for estimating Cox proportional hazards models. In each boosting step the previous boosting steps are incorporated as an offset in penalized partial likelihood estimation, which is employed for obtain an update for one single parameter, i.e., one covariate, in every boosting step. This results in sparse fits similar to Lasso-like approaches, with many estimated coefficients being zero. The main model complexity parameter, which has to be selected (e.g. by cross-validation using cv.CoxBoost), is the number of boosting steps stepno. The penalty parameter penalty can be chosen rather coarsely, either by hand or using optimCoxBoostPenalty. The advantage of the offset-based approach compared to gradient boosting is that the penalty structure is very flexible. In the present implementation this is used for allowing for unpenalized mandatory covariates, which receive a very fast coefficient build-up in the course of the boosting steps, while the other (optional) covariates are subjected to penalization. For example in a microarray setting, the (many) microarray features would be taken to be optional covariates, and the (few) potential clinical covariates would be taken to be mandatory, by including their indices in unpen.index. If a group of correlated covariates has influence on the response, e.g. genes from the same pathway, componentwise boosting will often result in a non-zero estimate for only one member of this group. To avoid this, information on the connection between covariates can be provided in pendistmat. If then, in addition, a penalty updating scheme with stepsize.factor & 1 is chosen, connected covariates are more likely to be chosen in future boosting steps, if a directly connected covariate has been chosen in an earlier boosting step (see Binder and Schumacher, 2009b). library(CoxBoost) # Generate some survival data with 10 informative covariates n &- 200; p &- 100 beta &- c(rep(1,10),rep(0,p-10)) x &- matrix(rnorm(n*p),n,p) real.time &- -(log(runif(n)))/(10*exp(drop(x %*% beta))) cens.time &- rexp(n,rate=1/10) status &- ifelse(real.time &= cens.time,1,0) obs.time &- ifelse(real.time &= cens.time,real.time,cens.time) # Fit a Cox proportional hazards model by CoxBoostcbfit &- CoxBoost(time=obs.time,status=status,x=x,stepno=100,penalty=100) summary(cbfit)# ... with covariates 1 and 2 being mandatorycbfit.mand &- CoxBoost(time=obs.time,status=status,x=x,unpen.index=c(1,2), stepno=100,penalty=100) summary(cbfit.mand)
Generalized additive model by likelihood based boosting (GAMboost)GAMBoost is used to fit a generalized additive model by likelihood based boosting. It is especially suited for models with a large number of predictors with potentially non-linear influence. It provides smooth function estimates of covariate influence functions together with confidence bands and approximate degrees of freedom. ## Generate some data n &- 100; p &- 8; q &- 2 # covariates with non-linear (smooth) effects x &- matrix(runif(n*p,min=-1,max=1),n,p) # binary covariates x.linear &- matrix(round(runif(n*q,min=0,max=1)),n,q) # 1st and 3rd smooth covariate and 1st linear covariate are informative eta &- -0.5 + 2*x[,1] + 2*x[,3]^2 + x.linear[,1]-.5 y &- rbinom(n,1,binomial()$linkinv(eta)) ## Fit a model with just smooth components gb1 &- GAMBoost(x,y,penalty=500,stepno=100,family=binomial(),trace=TRUE) # Inspect the AIC for a minimum plot(gb1$AIC) # still falling at boosting step 100 so we need more steps # or a smaller penalty (use 'optimGAMBoostPenalty' for # automatic penalty optimization)gb1$AIC12012513013514002040 Index6080100 ## Include two binary covariates as mandatory without penalty ## (appropriate for example for 'treatment/control') ## modelled as 'linear' predictors gb2 &- GAMBoost(x,y,penalty=200, x.linear=x.linear,penalty.linear=0, stepno=100,family=binomial(),trace=TRUE) ## Include first binary covariates as mandatory and second ## as optional (e.g 'treatment/control' and 'female/male')gb3 &- GAMBoost(x,y,penalty=200, x.linear=x.linear,penalty.linear=c(0,100), stepno=100,family=binomial(),trace=TRUE)# Get summary with fitted covariates and estimates for # the parametric components summary(gb3) # Extract boosted components at 'optimal' boosting step selected &- getGAMBoostSelected(gb3,at.step=which.min(gb3$AIC)) # Plot all smooth components at final boosting step par(mfrow=c(2,4)) plot(gb3)111etaetaetaeta000-1-1-1-2-2-2-1.00.0 0.5 S1-1.00.0 0.5 1.0 S2-1.00.0 0.5 1.0 S3-2-101-1.00.0 0.5 1.0 S4111etaetaetaeta000-1-1-1-2-2-2-1.00.0 0.5 1.0 S5-1.00.0 0.5 1.0 S6-1.00.0 0.5 1.0 S7-2-101-1.00.0 0.5 1.0 S8 # plot smooth components for which the null line is not inside the bands # at 'optimal' boosting step (determined by AIC) par(mfrow=c(1,length(selected$smoothbands))) plot(gb3,select=selected$smoothbands,at.step=which.min(gb3$AIC))1.51.51.51.01.01.00.50.50.50.00.00.0etaetaeta-1.5 -1.0 -0.5-1.5 -1.0 -0.5-1.5 -1.0 -0.5eta-1.0 0.0 0.5 1.0 S7-1.00.0 0.5 S1-1.00.0 0.5 1.0 S3-1.5 -1.0 -0.5-1.00.00.51.01.50.0 0.5 1.0 S8 ## Fit a generalized linear model for comparison x.linear &- cbind(x,x.linear) gb4 &- GAMBoost(x=NULL,y=y,x.linear=x.linear,penalty.linear=100, stepno=100,trace=TRUE,family=binomial())# Compare with generalized additive model fit plot(0:100,gb3$AIC,type=&l&,xlab=&stepno&,ylab=&AIC&); lines(0:100,gb4$AIC,lty=2)135 AIC 120 125 13002040 stepno6080100 ## Fit a generalized linear model with penalty modification ## after every boosting step, with penalty changes being ## redistrbuted via a connection matrix pendistmat &- matrix(0,10,10) # Covariates 1 and 3 are connected pendistmat[1,3] &- pendistmat[3,1] &- 1 gb5 &- GAMBoost(x=NULL,y=y,x.linear=x.linear,penalty.linear=100, stepsize.factor.linear=0.9,pendistmat.linear=pendistmat, stepno=100,trace=TRUE,family=binomial()) # or alternativelygb5 &- GAMBoost(x=NULL,y=y,x.linear=x.linear,penalty.linear=100, stepsize.factor.linear=0.9, pendistmat.linear=pendistmat[c(1,3),c(1,3)], connected.index.linear=c(1,3), stepno=100,trace=TRUE,family=binomial()) Functional Data Analysis in R (fda)library(fda) ## ## Simple smoothing ## girlGrowthSm &- with(growth, smooth.basisPar(argvals=age, y=hgtf, lambda=0.1)) plot(girlGrowthSm$fd, xlab=&age&, ylab=&height (cm)&, main=&Girls in Berkeley Growth Study& ) plot(deriv(girlGrowthSm$fd), xlab=&age&, ylab=&growth rate (cm / year)&, main=&Girls in Berkeley Growth Study& ) plot(deriv(girlGrowthSm$fd, 2), xlab=&age&, ylab=&growth acceleration (cm / year^2)&, main=&Girls in Berkeley Growth Study& ) Girls in Berkeley Growth Study180 160Girls in Berkeley Growth Studyheight (cm) 140120100growth rate (cm / year)80Girls in Berkeley Growth Study5 10 agegrowth acceleration (cm / year^2) 5101500515205-510 age15-1051015 ## ## Simple basis ## bspl1.2 &- create.bspline.basis(norder=1, breaks=c(0,.5, 1)) plot(bspl1.2) # 2 bases, order 1 = degree 0 = step functions: # (1) constant 1 between 0 and 0.5 and 0 otherwise # (2) constant 1 between 0.5 and 1 and 0 otherwise.1.0 0.0 0.2 0.4 0.6 0.80.00.20.40.60.81.0 fd1.2 &- Data2fd(0:1, basisobj=bspl1.2) op &- par(mfrow=c(2,1)) plot(bspl1.2, main='bases') plot(fd1.2, main='fit') par(op) # A step function: 0 to time=0.5, then 1 after0.8bases0.00.40.00.20.40.60.81.0fit0.8 value 0.0 0.40.00.20.4 time0.60.81.0 加权最小二乘(lm中的选项) 广义线性模型―package sandwich()
更多搜索:
All rights reserved Powered by
文档资料库内容来自网络,如有侵犯请联系客服。

我要回帖

更多关于 git reset head 的文章

 

随机推荐