逻辑回归极大似然，解最大似然估计为什么用梯度下降

羽毛球技术 | 体育赛事 | 英文歌曲 | 住宅风水 | 用户界面设计师 | 六爻 | 书籍改编电影 | 德国足球甲级联赛 | 欧美明星 | PLC | 中国足球 | aj1 | 国家队 | 拜仁慕尼黑足球俱乐部 | 小说创作 | 配音 | iOS应用 | NBA 2K | 古典音乐 | 面相 | 火影忍者 | 武汉大学 | 土拨鼠 | 营销策划 | 秦时明月之天行九歌 | 设计师 | 巴塞罗那足球俱乐部 | 尤文图斯 | 实况足球（游戏） | 少帅 | 罗玉凤 | 比利时 | 跑鞋 | 冷知识 | 肖战 | 李元胜 | 古琴 | 按键精灵 | 罗兰 | 徐波 | 激光手术 | 角色扮演 | 关晓彤 | 微电影 | safari | 北京国安 | 古汉语 | 曼彻斯特联 | 玄幻小说 | 科幻小说 | 双眼皮手术 | 主题曲 | 年会 | 检测仪 | 徒步 | 互联网公司 | 百度输入法 | 镜头 | 宜昌市 | 自拍 | 金蝶 | 电子烟 | 网站建设 | 广播体操 | 文身 | nba篮球 | 索尼(sony) | 天体物理学 | 痛风 | 象棋 | 牛皮癣 | 皮肤护理 | 周星驰（人物） | 试管婴儿 | 亚足联亚洲杯（AFC Asian Cup） | 健美 | 美术生 | 迅雷（软件） | 战斗机 | 穿越小说 | 张璐 | 姓氏 | 诸葛亮 | 后宫·甄嬛传（书籍） | 虎牙直播 | snh48 | 阿迪达斯 | 投影仪 | 组装机 | 微信群 | 阿迪达斯(adidas) | 网球王子 | 分子生物学 | 耽美 | 武磊 | 婚礼 | 表演 | 中国武术 | 动画电影 | Air Jordan | 张子枫 | 免费软件 | 相声演员 | 摩羯座 | 宿舍 | ansys | 法国足球甲级联赛 | 户外 | 剧场版 | 杨凡 | 科幻电影 | galgame | 融资 | 关节炎 | NBA季后赛 | 神话 | 王力宏（人物） | 建模 | 计算机病毒 | 广州恒大淘宝足球俱乐部 | 北京奥运会 | 电脑电源 | 百度翻译 | 字幕 | 讯飞输入法 | 海关 | 易烊千玺 | 深度学习 | 编辑器 | 澳门特别行政区 | 直播 | 流氓软件 | 事故 | 大片 | 李景亮 | 郭富城 | 日语歌曲 | 卡牌游戏 | 小品 | 东京 | 花卉 | 音乐剧 | 互联网创业 | 占卜 | 羽毛球拍 | 婆媳关系 | 日本动画 | 巴黎 | 拳击比赛 | 东南亚 | 足球经理（FM）（游戏） | youtube | 胡歌（演员） | 地铁跑酷 | 植发 | 张继科 | 三国 | 用户界面 | 演技 | 百度竞价 | 青梅竹马 | 移动硬盘 | 韩晓鹏 | 马龙 | 瘦腿 | 宠物医疗 | 巨蟹座 | 徐峥 | 天蝎座 | 胸肌 | 赵丽颖（演员） | adidas阿迪达斯 | 低音炮 | 星际争霸（游戏） | 豆瓣电影 | 微信开放平台 | 手绘 | 吉他学习 | 江苏卫视 | 模特 | 创意 | 团队管理 | 奢侈品 | 王源 | TANK | 笛子 | 偶像 | 莱斯特城 | 维生素 | 新百伦 | 国际物流 | 前女友 | 李小龙 | 华语流行音乐 | 猎头公司 | crm | 搏击项目 | 网站运营 | 鼻炎 | 篮球游戏 |

你的位置：网站首页 >> 频道首页 >>数学 >>逻辑回归极大似然，解最大似然估计为什么用梯度下降

逻辑回归极大似然，解最大似然估计为什么用梯度下降

来源：蜘蛛抓取(WebSpider) 时间：2018-07-20 07:11 标签：极大似然估计法的步骤

564 条评论分享收藏感谢收起面壁者系列：Logistic回归有问题，上知乎。知乎作为中文互联网最大的知识分享平台，以「知识连接一切」为愿景，致力于构建一个人人都可以便捷接入的知识分享网络，让人们便捷地与世界分享知识、经验和见解，发现更大的世界。什么是逻辑回归？是一种广义的线性回归分析模型，因此与多重线性回归分析有很多相同之处。它们的模型形式基本上相同，都具有
，其中w和b是待求参数，其区别在于他们的因变量不同，多重线性回归直接将
作为因变量，即
，而logistic回归则通过函数L将
对应一个隐状态h，
,然后根据h 与1-h的大小决定因变量的值。如果L是logistic函数，就是logistic回归，如果L是多项式函数就是多项式回归。适用条件：因变量为二分类的分类变量或某事件的发生率，并且是数值型变量。但是需要注意，重复计数现象指标不适用于Logistic回归。残差和因变量都要服从二项分布。二项分布对应的是分类变量，所以不是正态分布，进而不是用最小二乘法，而是最大似然法来解决方程估计和检验问题。自变量和Logistic概率是线性关系各观测对象间相互独立。模型特点：优点：计算代价不高，易于理解和实现缺点：容易欠拟合，分类精度可能不高适用数据类型：数值型和标称行Hypothesis Function 假设方程 w 是参数/系数， x是自变量。
根据自身特性，它的输出范围在0和1之间。Sigmoid Function/Logistic Function公式(1) 就是 Sigmoid 函数。可以理解为给定数据X，参数 theta, y=1的概率是多少Decision Boundary 决定边界定义： the line separates the area where y=0 and where y = 1 一般情况来说：
任何大于 0.5 的数据被分入 1 类，小于 0.5 即被归入 0 类。Cost Function 我们的目标是尽量让我们hypothesis function得出的预测值和真实值接近。那么 Cost Function就是衡量接近多少的一个函数。越近，这个函数就越小。因为这是一个凸函数，那也就意味着我们可以用梯度下降法来解决得到最优解。Gradient Decent 梯度下降梯度下降就不过多介绍了，是一个一阶最优化算法，通常也称为最速下降法。要使用梯度下降法找到一个函数的局部极小值，必须向函数上当前点对应梯度（或者是近似梯度）的反方向的规定步长距离点进行迭代搜索。如果相反地向梯度正方向迭代进行搜索，则会接近函数的局部极大值点；这个过程则被称为梯度上升法。大致意思就是用梯度来得到最优解。算法就是： Multi-class classification one-vs-all逻辑回归也可以用作多元分类/多标签分类。思路如下：假设我们标签A中有a0,a1,a2....an个标签，对于每个标签 ai (ai 是标签A之一)，我们训练一个逻辑回归分类器。即，训练该标签的逻辑回归分类器的时候，将ai看作一类标签，非ai的所有标签看作一类标签。那么相当于整个数据集里面只有两类标签：ai 和其他。剩下步骤就跟我们训练正常的逻辑回归分类器一样了。测试数据的时候，将查询点套用在每个逻辑回归分类器中的Sigmoid 函数，取值最高的对应标签为查询点的标签。避免过拟合的方法减少多余的特征：手动选择那些特征保留下来使用算法来选择特征。Regularization 正则化：就是保留所有特征，但是需要加入在cost function一个正则化项，也就是下面公式加号右边那个求和式子。
叫做正则化参数，太大太小都不太好。太大会 underfitting（欠拟合），太小可能会 overfitting（过拟合）。刚刚好，就是fitting拟合。正则化项中的 j是从1开始的。因为我们不需要对
进行惩罚。m前面 2 是为了求导方便，无其他实际意义。当所有特征都有点用的时候，删除特征未免太过可惜。这个场景下，正则化就比较适用。这时候梯度下降算法就会更改为： Logistic回归和最大熵模型Logistic 回归和最大熵模型都属于对数线性模型（log linear model）。学习它们的模型一般采用极大似然估计或者正则化的极大似然估计。Logistic 回归和最大熵模型学习可以形式化为无约束最优化问题。除了使用梯度下降法来解决，可以使用改进的迭代尺度法和拟牛顿法来解决。赞同分享收藏文章被以下专栏收录逻辑回归与梯度下降
Logistic回归为概率型非线性回归模型，是研究二分类观察结果与一些影响因素之间关系的一种
多变量分析方法。通常的问题是，研究某些因素条件下某个结果是否发生，比如医学中根据病人的一些症状来判断它是
否患有某种病。
在讲解Logistic回归理论之前，我们先从LR分类器说起。LR分类器，即Logistic Regression Classifier。
在分类情形下，经过学习后的LR分类器是一组权值，当测试样本的数据输入时，这组权值与测试数
据按照线性加和得到
这里是每个样本的个特征。之后按照Sigmoid函数（又称为Logistic函数）的形式求出
由于Sigmoid函数的定义域为，值域为，因此最基本的LR分类器适合对两类目标进行分类。
所以Logistic回归最关键的问题就是研究如何求得这组权值。此问题用极大似然估计来做。
下面正式地来讲Logistic回归模型。
考虑具有个独立变量的向量，设条件慨率为根据观测量相对于某事件发生
的概率。那么Logistic回归模型可以表示为
其中，那么在条件下不发生的概率为
所以事件发生与不发生的概率之比为
这个比值称为事件的发生比（the odds of experiencing an event），简记为odds。
可以看出Logistic回归都是围绕一个Logistic函数来展开的。接下来就讲如何用极大似然估计求分类器的参数。
假设有个观测样本，观测值分别为，设为给定条件下得到的概率，
同样地，的概率为，所以得到一个观测值的概率为。
因为各个观测样本之间相互独立，那么它们的联合分布为各边缘分布的乘积。得到似然函数为
然后我们的目标是求出使这一似然函数的值最大的参数估计，最大似然估计就是求出参数，使
得取得最大值，对函数取对数得到
现在求向量，使得最大，其中。
这里介绍一种方法，叫做梯度下降法（求局部极小值），当然相对还有梯度上升法（求局部极大值）。
对上述的似然函数求偏导后得到
由于是求局部极大值，所以根据梯度上升法，有
根据上述公式，只需初始化向量全为零，或者随机值，迭代到指定精度为止。
现在就来用C++编程实现Logistic回归的梯度上升算法。首先要对训练数据进行处理，假设训练数据如下
训练数据：TrainData.txt
1 0 0 1 0 1
0 0 1 2 0 0
1 0 0 1 1 0
0 0 0 0 1 0
0 0 1 0 0 0
0 0 1 0 1 0
0 0 1 2 1 0
1 0 0 0 0 0
0 0 1 0 1 0
1 0 1 0 0 0
0 0 1 0 1 0
0 0 1 0 0 0
0 0 1 0 1 0
1 0 0 1 0 0
1 0 0 0 1 0
2 0 0 0 1 0
1 0 0 2 1 0
2 0 0 0 1 0
2 0 1 0 0 0
0 0 1 0 1 0
0 0 1 2 0 0
0 0 0 0 0 0
0 0 1 0 1 0
1 0 1 0 1 1
0 0 1 2 1 0
1 0 1 0 0 0
0 0 1 0 0 0
0 0 0 2 0 0
1 0 0 0 1 0
2 0 1 0 0 0
2 0 1 1 1 0
1 0 1 1 0 0
1 0 1 2 0 0
1 0 0 1 1 0
0 0 0 0 1 0
1 1 0 0 1 0
1 0 1 2 1 0
0 0 0 0 1 0
0 0 1 0 0 0
1 0 1 1 1 0
1 0 1 0 1 0
2 0 1 2 0 0
0 0 1 2 1 0
0 0 1 0 1 0
2 0 1 0 1 0
0 0 1 0 1 0
1 0 0 0 0 0
1 0 0 0 1 0
0 0 0 0 1 0
0 0 1 2 1 0
0 1 1 0 0 0
0 1 0 0 1 0
2 1 0 0 0 0
2 1 0 0 0 0
1 1 0 2 0 0
1 1 0 0 0 1
0 1 0 0 0 0
2 1 0 0 1 0
0 1 0 0 1 0
2 1 0 2 1 0
2 1 0 2 1 0
1 1 0 2 1 0
0 1 0 0 0 1
2 1 1 0 1 0
2 1 0 1 1 0
1 1 0 0 0 1
2 1 0 0 0 0
1 1 0 0 1 0
1 1 0 0 0 0
2 1 0 1 1 0
1 1 0 0 1 0
1 0 1 1 0 1
2 1 0 1 1 0
0 1 0 0 1 0
1 0 1 0 0 0
0 0 1 0 0 1
1 0 0 0 0 0
0 0 0 2 1 0
1 0 1 2 0 1
1 0 0 1 1 0
2 0 1 2 1 0
2 0 0 0 1 0
1 0 0 1 1 0
1 0 1 0 1 0
0 0 1 0 0 0
1 0 0 2 1 0
2 0 1 1 1 0
0 0 1 0 1 0
0 0 0 0 1 0
2 0 0 1 0 1
0 0 1 0 0 0
0 0 0 0 0 0
1 0 1 1 1 1
2 0 1 0 1 0
0 0 0 0 0 0
1 0 1 0 1 0
0 0 0 0 1 0
0 0 0 2 0 0
0 0 0 0 0 0
0 0 1 2 0 0
0 0 1 0 1 0
0 0 1 0 0 1
0 0 0 2 1 0
1 0 1 1 1 0
1 0 0 1 1 0
0 0 1 0 1 0
1 0 0 0 0 0
1 0 1 0 1 0
2 0 0 0 1 0
1 0 0 0 1 0
2 0 0 1 1 0
0 0 1 2 1 0
1 0 1 2 0 0
0 0 1 2 1 0
1 0 0 0 0 0
0 0 1 0 1 0
0 0 0 1 1 0
1 0 0 0 1 0
2 0 0 1 1 0
1 0 0 1 1 0
1 0 1 0 0 0
1 1 0 1 1 0
2 1 0 0 1 0
0 1 0 0 0 0
1 1 0 1 0 1
1 1 0 2 1 0
0 1 0 0 0 0
1 1 0 2 0 0
0 1 0 0 1 0
1 1 0 0 1 1
1 1 0 2 1 0
1 0 0 2 1 0
2 1 1 1 1 0
0 1 0 0 1 0
0 1 0 0 1 0
2 1 0 0 0 1
1 1 0 2 1 0
1 1 0 0 1 0
1 1 1 0 0 0
2 1 0 2 1 0
2 1 1 1 0 0
0 1 0 0 1 0
1 1 0 2 1 0
0 1 0 0 1 0
1 1 0 1 1 0
0 1 0 0 1 0
0 1 0 0 0 0
1 1 0 0 0 0
1 1 0 2 1 0
1 1 0 0 0 0
0 1 1 2 0 0
2 1 0 0 1 0
2 0 1 0 0 1
0 0 1 0 1 0
1 0 1 0 0 0
0 0 1 2 1 0
0 0 1 0 0 0
1 0 1 0 1 0
0 0 1 0 1 0
0 0 1 0 1 0
1 0 1 0 1 0
0 0 0 0 0 1
0 0 1 2 1 0
0 0 1 0 1 0
0 0 1 0 1 0
0 0 1 0 0 0
0 0 1 0 0 1
0 0 1 2 1 0
2 0 1 2 1 0
0 0 1 0 1 0
0 0 1 0 1 0
0 0 1 0 1 0
1 0 0 0 0 0
2 0 1 1 1 0
0 0 1 0 0 1
1 0 1 0 0 0
1 0 1 1 1 0
1 0 1 1 0 0
0 0 1 0 0 0
1 0 1 1 1 0
1 0 1 2 0 0
2 0 0 0 1 0
0 0 1 0 0 1
0 0 1 0 1 0
0 0 1 0 1 0
1 0 1 0 0 0
0 0 1 0 0 0
2 0 1 1 0 0
0 0 1 2 0 0
1 0 0 1 1 1
0 0 0 0 1 0
0 0 0 0 0 1
0 0 1 0 1 0
2 0 1 2 1 0
1 0 0 1 0 0
0 0 1 0 0 0
2 0 0 1 1 1
0 0 1 0 0 0
0 0 1 0 1 0
2 0 1 0 1 0
0 0 1 0 1 0
2 0 0 0 1 0
1 0 1 0 1 0
1 0 0 0 1 0
0 0 1 0 0 1
2 0 0 0 0 0
2 0 0 1 1 0
0 0 1 0 1 0
0 0 0 0 1 0
2 0 1 0 0 0
1 0 1 0 1 0
0 0 0 0 1 0
1 0 1 0 1 0
0 0 1 0 0 0
1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 1 0
0 0 1 2 0 0
2 0 1 0 1 1
0 0 1 0 1 0
0 0 1 2 1 0
0 0 0 0 0 0
0 0 1 0 1 0
1 0 1 0 1 0
0 0 1 0 1 0
1 0 1 0 0 0
0 0 1 0 1 0
0 0 1 0 0 0
1 0 1 0 0 0
0 0 1 0 1 0
0 0 1 0 1 0
1 0 0 0 1 0
0 0 1 0 0 0
0 0 0 0 1 0
1 0 1 1 1 0
0 0 0 2 0 0
0 0 1 0 1 0
0 0 1 0 1 0
0 0 1 0 1 0
1 0 0 1 1 0
2 0 0 0 1 0
1 0 0 0 0 0
2 0 0 2 1 0
0 0 1 2 1 0
1 0 1 0 0 1
0 0 1 2 1 0
0 0 1 2 1 0
0 0 1 0 1 0
1 0 1 2 1 0
0 0 0 2 0 0
1 0 0 0 0 0
0 0 0 2 1 0
0 0 1 0 1 0
2 0 0 0 1 0
1 0 0 0 0 0
1 0 0 1 1 0
1 0 1 1 1 0
1 0 1 0 1 1
0 0 1 0 1 0
1 1 0 2 1 0
1 1 0 1 0 0
2 1 0 2 1 0
1 1 1 0 0 0
0 1 1 0 0 0
0 1 1 0 0 1
0 1 0 0 1 0
1 1 1 0 0 0
1 1 1 0 1 0
0 1 0 0 1 0
0 1 1 0 0 1
1 1 1 1 1 0
1 1 0 2 1 0
0 1 0 2 0 0
1 1 0 2 1 0
0 0 1 2 1 0
2 1 1 1 1 0
0 1 0 0 1 0
0 0 1 0 1 0
2 1 0 1 1 0
0 1 0 0 1 0
1 1 0 0 0 0
1 1 0 0 1 0
0 1 0 0 0 0
0 1 1 0 0 0
2 1 0 0 1 0
2 1 0 0 0 0
1 1 0 0 1 0
2 1 0 1 1 0
上面训练数据中，每一行代表一组训练数据，每组有7个数组，第1个数字代表ID，可以忽略之，2~6代表这组训
练数据的特征输入，第7个数字代表输出，为0或者1。每个数据之间用一个空格隔开。
首先我们来研究如何一行一行读取文本，在C++中，读取文本的一行用getline()函数。
getline()函数表示读取文本的一行，返回的是读取的字节数，如果读取失败则返回-1。用法如下：
#include &iostream&
#include &string.h&
#include &fstream&
#include &string&
#include &stdio.h&
int main()
string filename = "data.in";
ifstream file(filename.c_str());
char s[1024];
if(file.is_open())
while(file.getline(s,1024))
int x,y,z;
sscanf(s,"%d %d %d",&x,&y,&z);
cout&&x&&" "&&y&&" "&&z&&
拿到每一行后，可以把它们提取出来，进行系统输入。 Logistic回归的梯度上升算法实现如下
#include &iostream&
#include &string.h&
#include &fstream&
#include &stdio.h&
#include &math.h&
#include &vector&
#define Type double
#define Vector vector
struct Data
Vector&Type&
void PreProcessData(Vector&Data&& data, string path)
string filename =
ifstream file(filename.c_str());
char s[1024];
if(file.is_open())
while(file.getline(s, 1024))
Type x1, x2, x3, x4, x5, x6, x7;
sscanf(s,"%lf %lf %lf %lf %lf %lf %lf", &x1, &x2, &x3, &x4, &x5, &x6, &x7);
tmp.x.push_back(1);
tmp.x.push_back(x2);
tmp.x.push_back(x3);
tmp.x.push_back(x4);
tmp.x.push_back(x5);
tmp.x.push_back(x6);
tmp.y = x7;
data.push_back(tmp);
void Init(Vector&Data& &data, Vector&Type& &w)
w.clear();
data.clear();
PreProcessData(data, "TrainData.txt");
for(int i = 0; i & data[0].x.size(); i++)
w.push_back(0);
Type WX(const Data& data, const Vector&Type&& w)
Type ans = 0;
for(int i = 0; i & w.size(); i++)
ans += w[i] * data.x[i];
Type Sigmoid(const Data& data, const Vector&Type&& w)
Type x = WX(data, w);
Type ans = exp(x) / (1 + exp(x));
Type Lw(const Vector&Data&& data, Vector&Type& w)
Type ans = 0;
for(int i = 0; i & data.size(); i++)
Type x = WX(data[i], w);
ans += data[i].y * x - log(1 + exp(x));
void Gradient(const Vector&Data&& data, Vector&Type& &w, Type alpha)
for(int i = 0; i & w.size(); i++)
Type tmp = 0;
for(int j = 0; j & data.size(); j++)
tmp += alpha * data[j].x[i] * (data[j].y - Sigmoid(data[j], w));
void Display(int cnt, Type objLw, Type newLw, Vector&Type& w)
cout&&"第"&&cnt&&"次迭代:
ojLw = "&&objLw&&"
两次迭代的目标差为: "&&(newLw - objLw)&&
cout&&"参数w为: ";
for(int i = 0; i & w.size(); i++)
cout&&w[i]&&" ";
void Logistic(const Vector&Data&& data, Vector&Type& &w)
int cnt = 0;
Type alpha = 0.1;
Type delta = 0.00001;
Type objLw = Lw(data, w);
Gradient(data, w, alpha);
Type newLw = Lw(data, w);
while(fabs(newLw - objLw) & delta)
objLw = newLw;
Gradient(data, w, alpha);
newLw = Lw(data, w);
Display(cnt,objLw,newLw, w);
void Separator(Vector&Type& w)
Vector&Data&
PreProcessData(data, "TestData.txt");
cout&&"预测分类结果："&&
for(int i = 0; i & data.size(); i++)
Type p0 = 0;
Type p1 = 0;
Type x = WX(data[i], w);
p1 = exp(x) / (1 + exp(x));
p0 = 1 - p1;
cout&&"实例: ";
for(int j = 0; j & data[i].x.size(); j++)
cout&&data[i].x[j]&&" ";
cout&&"所属类别为：";
if(p1 &= p0) cout&&1&&
else cout&&0&&
int main()
Vector&Type&
Vector&Data&
Init(data, w);
Logistic(data, w);
Separator(w);
测试数据：TestData.txt
没有更多推荐了，毛燥的孩子
逻辑回归梯度下降法详解
逻辑回归常用于预测疾病发生的概率，例如因变量是是否恶性肿瘤，自变量是肿瘤的大小、位置、硬度、患者性别、年龄、职业等等（很多文章里举了这个例子，但现代医学发达，可以通过病理检查，即获取标本放到显微镜下观察是否恶变来判断）；广告界中也常用于预测点击率或者转化率(cvr/ctr)，例如因变量是是否点击，自变量是物料的长、宽、广告的位置、类型、用户的性别、爱好等等。
本章主要介绍逻辑回归算法推导、梯度下降法求最优值的推导及spark的源码实现。
一般回归问题的步骤是：
1. 寻找预测函数（h函数，hypothesis）
2. 构造损失函数（J函数）
3. 使损失函数最小，获得回归系数θ
而第三步中常见的算法有：
1. 梯度下降
2. 牛顿迭代算法
3. 拟牛顿迭代算法（BFGS算法和L-BFGS算法）
其中随机梯度下降和L-BFGS在spark mllib中已经实现，梯度下降是最简单和容易理解的。
二元逻辑回归
构造预测函数
hθ(x)=g(θTx)=11+e-θTx
θTx=∑i=1nθixi=θ0+θ1x1+θ2x2+...+θnxnθ=?????θ0θ1...θn?????,x=?????x0x1...xn?????
为何LR模型偏偏选择sigmoid 函数呢？逻辑回归不是回归问题，而是二分类问题，因变量不是0就是1，那么我们很自然的认为概率函数服从伯努利分布，而伯努利分布的指数形式就是个sigmoid 函数。
函数hθ(x)表示结果取1的概率，那么对于分类1和0的概率分别为：
P(y=1|x;θ)=hθ(x)P(y=0|x;θ)=1-hθ(x)
概率一般式为：
P(y|x;θ)=(hθ(x))y((1-hθ(x)))1-y
最大似然估计的思想
当从模型总体随机抽取m组样本观测值后，我们的目标是寻求最合理的参数估计θ′使得从模型中抽取该m组样本观测值的概率最大。最大似然估计就是解决此类问题的方法。求最大似然函数的步骤是：
写出似然函数
对似然函数取对数
对对数似然函数的参数求偏导并令其为0，得到方程组
求方程组的参数
为什么第三步要取对数呢，因为取对数后，乘法就变成加法了，且单调性一致，不会改变极值的位置，后边就更好的求偏导。
构造损失函数
线性回归中的损失函数是：
J(θ)=12m∑i=1m(yi-hθ(xi))2
线性回归损失函数有很明显的实际意义，就是平方损失。而逻辑回归却不是，它的预测函数hθ(x)明显是非线性的，如果类比的使用线性回归的损失函数于逻辑回归，那J(θ)很有可能就是非凸函数，即存在很多局部最优解，但不一定是全局最优解。我们希望构造一个凸函数，也就是一个碗型函数做为逻辑回归的损失函数。
按照求最大似然函数的方法，逻辑回归似然函数：
L(θ)=∏i=1mP(yi|xi;θ)=∏i=1m(hθ(xi))yi((1-hθ(xi)))1-yi
其中m表示样本数量，取对数：
l(θ)=logL(θ)=∑i=1m(yiloghθ(xi)+(1-yi)log(1-hθ(xi)))
我们的目标是求最大l(θ)时的θ，如上函数是一个上凸函数，可以使用梯度上升来求得最大似然函数值(最大值)。或者上式乘以-1，变成下凸函数，就可以使用梯度下降来求得最小负似然函数值（最小值）：
J(θ)=-1ml(θ)
同样是取极小值，思想与损失函数一致，即我们把如上的J(θ)作为逻辑回归的损失函数。Andrew Ng的课程中，上式乘了一个系数1/m，我怀疑就是为了和线性回归的损失函数保持一致吧。
求最小值时的参数
我们求最大似然函数参数的第三步时，令对参数θ偏导=0，然后求解方程组。考虑到参数数量的不确定，即参数数量很大，此时直接求解方程组的解变的很困难，或者根本就求不出精确的参数。于是，我们用随机梯度下降法，求解方程组的值。
当然也可以使用牛顿法、拟牛顿法。梯度下降法是最容易理解和推导的，如下是推导过程：
梯度下降θ的更新过程，走梯度方向的反方向：
θj:=θj-αδδθjJ(θ)
δδθjJ(θ)=-1m∑i=1m(yi1hθ(xi)δδθjhθ(xi)-(1-yi)11-hθ(xi)δδθjhθ(xi))=-1m∑i=1m(yi1g(θTxi)-(1-yi)11-g(θTxi))δδθjg(θTxi)=-1m∑i=1m(yi1g(θTxi)-(1-yi)11-g(θTxi))g(θTxi)(1-g(θTxi))δδθjθTxi=-1m∑i=1m(yi(1-g(θTxi))-(1-yi)g(θTxi))xji=-1m∑i=1m(yi-g(θTxi))xji=1m∑i=1m(hθ(xi)-yi))xji
第二步推导请注意：
(f(x)g(x))′=g(x)f′(x)-f(x)g′(x)g2(x)(ex)′=ex
那么可以推导：
δδθjg(θTxi)=-e-θTxi(1+e-θTxi)2δδθj(-1)θTxi=g(θTxi)(1-g(θTxi))δθjθTxi
因此更新过程可以写成：
θj:=θj-α1m∑i=1m(hθ(xi)-yi))xji
那迭代多少次停止呢，spark是指定迭代次数和比较两次梯度变化或者cost变化小于一定值时停止。
过拟合问题
过拟合问题，即我们求得的回归系数在实验集中效果很好，但之外的数据效果很差。机器学习中的特征基本上是靠人的经验选择的，有可能某一些特征或者特征组合与因变量没有任何关系，即某些θi≈0。所以我们需要把不必要的特征剔除，一般我们使用正则化来保留所有特征，并让它相应的系数≈0，L1范数正则化后θ的更新：
θj:=θj-α1m∑i=1m(hθ(xi)-yi))xji-λmθj
λ越大，对模型的复杂度惩罚越大，有可能出现欠拟合现象。λ越小，惩罚越小，可能新出现过拟合现象。spark逻辑回归的随机梯度下降法中，使用的是L2范数正则化。
多元逻辑回归
推广到K元逻辑回归，即因变量为0、1、2、…、k-1。在二元逻辑回归中有这样的性质：
logP(y=1|x,θ)P(y=0|x,θ)=θTx
推广至K元逻辑回归：
logP(y=1|x,θ)P(y=0|x,θ)=θT1xlogP(y=2|x,θ)P(y=0|x,θ)=θT2x...logP(y=K-1|x,θ)P(y=0|x,θ)=θTK-1x
其中，θ=(θ1,θ2,...,θK-1)T，是个（k-1）*（n+1）的矩阵，n为特征的个数，加1是增加截距项。去除对数则得到概率分布：
P(y=0|x,θ)=11+∑K-1i=1eθTixP(y=1|x,θ)=eθT1x1+∑K-1i=1eθTix...P(y=K-1|x,θ)=eθTK-1x1+∑K-1i=1eθTix
K元逻辑回归似然函数：
L(θ)=∏i=1mP(y|x,θ)
α(yi)=1ifyi=0α(yi)=0ifyi≠0
l(θ,x)=∑i=1mlogP(yi|xi,θ)=∑i=1mα(yi)logP(y=0|xi,θ)+(1-α(yi))logP(yi|xi,θ)=∑i=1mα(yi)log11+∑K-1k=1eθTkx+(1-α(yi))logeθTyix1+∑K-1k=1eθTkx=∑i=1m(1-α(yi))θTyix-log(1+∑k=1K-1eθTkx)
同样的我们得到损失函数：
J(θ,x)=-1ml(θ,x)
θ更新过程：
θj:=θj-αδδθjJ(θ,x)
对θ求偏导得到梯度：
Gkj(θ,x)=-1mδl(θ,x)δθkj=-1m(∑i=1m(1-α(yi))xijδk,yi-eθTxi1+eθTxixij)
其中k表示因变量，j表示特征数量，i表示实验数。
spark源码注释中，稍稍不一样，l(w,x)乘以了-1，其实与我们上边推导的-1m意义一样。我们来看看spark的推导过程。
P(y=0|x,w)=1/(1+∑iK-1exp(xwi))P(y=1|x,w)=exp(xw1)/(1+∑iK-1exp(xwi))...P(y=K-1|x,w)=exp(xwK-1)/(1+∑iK-1exp(xwi)
l(w,x)=-logP(y|x,w)=-α(y)logP(y=0|x,w)-(1-α(y))logP(y|x,w)=log(1+∑iK-1exp(xwi))-(1-α(y))xwy-1=log(1+∑iK-1exp(marginsi))-(1-α(y))marginsy-1
α(i)=1ifi!=0,α(i)=0ifi==0,marginsi=xwi.
?l(w,x)?wij=(exp(xwi)/(1+∑kK-1exp(xwk))-(1-α(y)δy,i+1))*xj=multiplieri*xj
δi,j=1ifi==j,δi,j=0ifi!=j,multiplier=exp(marginsi)/(1+∑kK-1exp(marginsi))-(1-α(y)δy,i+1)
为了不让数值溢出,xw项减了maxMargin，l(w,x)改写为：
l(w,x)=log(1+∑iK-1exp(marginsi))-(1-α(y))marginsy-1=log(exp(-maxMargin)+∑iK-1exp(marginsi-maxMargin))+maxMargin-(1-α(y))marginsy-1=log(1+sum)+maxMargin-(1-α(y))marginsy-1
sum=exp(-maxMargin)+∑iK-1exp(marginsi-maxMargin)-1
而multiplier可以表示为：
multiplier=exp(marginsi)/(1+∑kK-1exp(marginsi))-(1-α(y)δy,i+1)=exp(marginsi-maxMargin)/(1+sum)-(1-α(y)δy,i+1)
先看实例代码:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training)
val predictionAndLabels = test.map { case LabeledPoint(label, features) =&
val prediction = model.predict(features)
(prediction, label)
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
随机梯度下降调用:
* Train a logistic regression model given an RDD of (label, features) pairs. We run a fixed
* number of iterations of gradient descent using the specified step size. Each iteration uses
* `miniBatchFraction` fraction of the data to calculate the gradient. The weights used in
* gradient descent are initialized using the initial weights provided.
* NOTE: Labels used in Logistic Regression should be {0, 1}
input RDD of (label, array of features) pairs.
numIterations Number of iterations of gradient descent to run.迭代次数
stepSize Step size to be used for each iteration of gradient descent.步长
miniBatchFraction Fraction of data to be used per iteration.用于模型预估数据的比例
initialWeights Initial set of weights to be used. Array should be equal in size to the number of features in the data.初始化权重
@Since("1.0.0")
def train(
input: RDD[LabeledPoint],
numIterations: Int,
stepSize: Double,
miniBatchFraction: Double,
initialWeights: Vector): LogisticRegressionModel = {
new LogisticRegressionWithSGD(stepSize, numIterat2 Aions, 0.0, miniBatchFraction)
.run(input, initialWeights)
LogisticRegressionWithLBFGS和LogisticRegressionWithSGD都继承于GeneralizedLinearModel,它的run方法:
def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {
if (numFeatures & 0) {
numFeatures = input.map(_.features.size).first()
if (input.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data is not directly cached, which may hurt performance if its"
+ " parent RDDs are also uncached.")
if (validateData && !validators.forall(func =& func(input))) {
throw new SparkException("Input validation failed.")
* Scaling columns to unit variance as a heuristic to reduce the condition number:
* During the optimization process, the convergence (rate) depends on the condition number of
* the training dataset. Scaling the variables often reduces this condition number
* heuristically, thus improving the convergence rate. Without reducing the condition number,
* some training datasets mixing the columns with different scales may not be able to converge.
* GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return
* the weights in the original scale.
* See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
* Here, if useFeatureScaling is enabled, we will standardize the training features by dividing
* the variance of each column (without subtracting the mean), and train the model in the
* scaled space. Then we transform the coefficients from the scaled space to the original scale
* as GLMNET and LIBSVM do.
*通过每一列除以这一列的标准差,将数据标准化.LBFGS算法中可以启用.
* Currently, it's only enabled in LogisticRegressionWithLBFGS
val scaler = if (useFeatureScaling) {
new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))
val data =
if (addIntercept) {
if (useFeatureScaling) {
input.map(lp =& (lp.label, appendBias(scaler.transform(lp.features)))).cache()
input.map(lp =& (lp.label, appendBias(lp.features))).cache()
if (useFeatureScaling) {
input.map(lp =& (lp.label, scaler.transform(lp.features))).cache()
input.map(lp =& (lp.label, lp.features))
* TODO: For better convergence, in logistic regression, the intercepts should be computed
* from the prior probability distribu for linear regression,
* the intercept should be set as the average of response.
val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {
appendBias(initialWeights)
/** If `numOfLinearPredictor & 1`, initialWeights already contains intercepts. */
initialWeights
val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)
createModel(weights, intercept)
梯度下降SGD实现:
def runMiniBatchSGD(
data: RDD[(Double, Vector)],
gradient: Gradient,
updater: Updater,
stepSize: Double,
numIterations: Int,
regParam: Double,
miniBatchFraction: Double,
initialWeights: Vector,
convergenceTol: Double): (Vector, Array[Double]) = {
val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
var weights = Vectors.dense(initialWeights.toArray)
val n = weights.size
* For the first iteration, the regVal will be initialized as sum of weight squares
* if it's L2 for L1 updater, the same logic is followed.
var regVal = updater.compute(
weights, Vectors.zeros(weights.size), 0, 1, regParam)._2
var converged = false
while (!converged && i &= numIterations) {
val bcWeights = data.context.broadcast(weights)
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
seqOp = (c, v) =& {
val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
combOp = (c1, c2) =& {
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
if (miniBatchSize & 0) {
* lossSum is computed using the weights from the previous iteration
* and regVal is the regularization value computed in the previous iteration as well.
stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
stepSize, i, regParam)
weights = update._1
regVal = update._2
previousWeights = currentWeights
currentWeights = Some(weights)
if (previousWeights != None && currentWeights != None) {
converged = isConverged(previousWeights.get,
currentWeights.get, convergenceTol)
logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(
stochasticLossHistory.takeRight(10).mkString(", ")))
(weights, stochasticLossHistory.toArray)
combOp是θ的更新过程中的∑过程.在二元逻辑回归情况下:
* For Binary Logistic Regression.
* Although the loss and gradient calculation for multinomial one is more generalized,
* and multinomial one can also be used in binary case, we still implement a specialized
* binary version for performance reason.
val margin = -1.0 * dot(data, weights)
val multiplier = (1.0 / (1.0 + math.exp(margin))) - label
axpy(multiplier, data, cumGradient)
if (label & 0) {
MLUtils.log1pExp(margin)
MLUtils.log1pExp(margin) - margin
margin 就是(-θTx),而multiplier就是hθ(xi)-yi.axpy方法就是(hθ(xi)-yi))xi.
(数学公式编辑器)
没有更多推荐了，

逻辑回归极大似然，解最大似然估计为什么用梯度下降

我要回帖

更多关于极大似然估计法的步骤的文章

随机推荐

逻辑回归极大似然，解最大似然估计为什么用梯度下降

我要回帖

更多关于 极大似然估计法的步骤 的文章

随机推荐

更多关于极大似然估计法的步骤的文章