机器学习专家与统计学家观点上有哪些不同

【数盟致力于成为最卓越的数据科学社区,聚焦于大数据、分析挖掘、数据可视化领域,业务范围:线下活动、在线课程、猎头服务、项目对接】【促销月】4.9~5.8 国内独家数据可视化课程六折优惠,现价358?!23课时!超值体验!课程链接点击文末:阅读原文飞绝眷岭于知乎的回答这个问题是一个非常好的问题,很多机器学习界和统计学界的大牛们都或多或少的思考过这个问题。比如,Ryan Adams在最近脍炙人口的Talking machine节目中就论述过统计学和机器学习的区别。先上主要结论吧:统计学家更关心模型的可解释性,而机器学习专家更关心模型的预测能力。以下内容是由对“talking machine”的转述归纳(引用格式标出的部分)+个人评注共同组成,节目介绍及英文原文附在最后。统计学和机器学习之间的边界到底在哪里?在过去的十多年间,机器学习取得了惊人的突破,我们是否可以把这些巨大的成功归功于其背后统计学方法的进步,还是说另有原因?Ryan Adams: “我认为统计学和机器学习最本质的区别在于根本目标不同。统计学家更关心模型的可解释性,而机器学习专家更关心模型的预测能力。”* 统计学想干嘛?“以下的论述可能比较卡通化,是对问题的一个简化,但是从某种意义上来说,统计学更像是面向科学服务的一门工具。统计检验和假设检验,是当科学家们想要理解这个世界的一些性质,或者想要回答关于某个进程的一些关键因素时的首选利器。比如说,某种药物到底对治病有没有效果?统计学家为这一类的问题提供了非常完美的工具,有了这些工具,科学家们得以去测量和估计某些可以理解的变量的效果和作用。所谓可以理解的变量,说的是,科学家在建模时所涉及到的变量通常都有正常人类能够理解的量纲(比如物理学模型中的质量,速度),并且能与你要观察的某种现象或者效应直接挂钩(比如研究某种基因修饰对表观型的影响有多大时,该种基因含量,所表达的蛋白质的含量)。也就是说,统计学模型里涉及到的参数都是有实际意义的,因此当某个假设通过了检验时,我们就知道一些现实中的变量是如何相互作用的。从这个角度来说,统计学通常把自己看做是量化分析的守门人,它的目标是通过严格的测量估计,假设检验,挑选出值得信任的假设,以此来理解各种生物体中的因果关联,或者是社会学中的某种进程或者是效应。当然,统计学里还包括了很多其他的重要想法和概念,比如,它对心理学、医学、社会学中的各种实验设计提出了规范和建设性框架。”简而言之,统计学更多是关于世界本质的一个个检验,它的目标是建立一个可以理解的世界的模型。*机器学习目标为何?“不同于统计学,机器学习更关心的不是模型的可解释性,而是模型的预测能力。机器学习的目标是搭建一套高效可靠的系统,能够持续的预测未来并且稳定的工作。比如,机器视觉系统需要做的是正确预测一张图片里的小动物到底是猫还是狗,两张人脸的照片里是不是同一个人,一个室内机器人是不是能够正确的识别出它周围的环境,等等。但这些系统内部的参数通常是数量巨大并且无法被人们直接理解的,更不用说有和现实生活中的某些特性对应的量纲了。但即使搞不清这些参数到底代表了什么,只要你的模型work,总能给出极为准确的预测就是王道。有时候,某一类机器学习问题的正确率突然有了大幅度的提升,可能得益于人们弄清了该优化问题背后的理论难点,但更多的时候,某种算法是否成功完全由预测结果说了算,即使人们对其中的原理依然所知甚少。”深度学习(deep learning)之所以在2000年至2010年的十年间比较沉寂,原因不外乎是人们无法理解它的学习过程中到底发生了什么,它一层一层学出来的feature究竟是什么,而它的表现又并非一枝独秀,因此它在一批更容易被人们理解的模型之间就显得不那么起眼了。但当Geff 在 2010 image net contest比赛上以deep convolutional neural network把对手甩开十条街后,deep net便瞬间令所有人折服了 。引用Geff的原话,”归根到底,我们还是得拿数据说话,当你的方法能将错误率降低一半时,人们必须得对你刮目相看。”(data always wins, when you can half the error rate, people will take you seriously.)*谁促进了谁?从某种程度上说,统计学习理论里的很多想法的确给了机器学习一些启发,但是,这几年来机器学习发展的如此之快,如此火爆的根本原因更多的是来源于可训练数据量的大幅度提升(互联网的普及,human computation平台的成熟,各类线下数据的电子化等等)以及电脑运算性能的突飞猛进(ps,显卡计算的飞速发展已经到了丧心病狂的地步了,NVIDIA对研究机构的资助简直是不遗余力,不惜血本的拼命往外送显卡,真是超级大手笔。。。),而并不一定是统计理论本身的根本性突破。“作为一个快速发展的行业,毫无疑问,机器学习吸引了统计学家的目光,越来越多的统计学界的一流人才都开始从机器学习领域吸收新鲜的想法,无论是算法层面,模型层面还是统计推断层面。可以说,机器学习是将统计学习中很早就提出来的一些想法进行了重新挖掘和重新定义。尽管机器学习的成功涉及到了一系列的统计学习方法,但这并不意味着统计学本身是这种成功的根本原因和最大推动力。”*求同存异尽管统计学和机器学习的着眼目标不同,但在某些情况下,两者共同关注的一个问题是,一个模型究竟为什么work。虽然之前说到了实战结果是检验模型有效型的终极标准,但大家总归是希望最终能够理解在一步一步优化的过程中到底发生了什么,是什么trick,满足了什么条件,使得prediction error能够快速converge。有人说,devils are in the details比如,即使在深度学习横扫了全领域的今天,人们对neural network内部的trick依然是一知半解,如何改进现有网络的结构和更新迭代的规则使得它能够更快的converge,更准确的generalize?机器学习界的研究者们在一步步探究其内部机理的过程中也逐渐的将模型的准确率提上新高。当然,作为更偏向于应用于实际的机器学习界,除了受制于理论上的upper bound,lower bound之外,在实际问题中,还会碰到很多对运算时间和存储容量上的限制,而这些往往是做统计理论的人不太关心的问题。在回答的最后,Ryan说道:做为在这两个领域里都有所活动的研究人员,有时自己也会迷茫我到底属于哪边?通常来说,机器学习界的专家,是不会缺席统计学的顶会的。但内心深处,我还是会觉得机器学习才是我真正的家:)以下是答主的碎碎念:机器学习真是一个非常有趣的领域,它的有趣性不仅在于你能够通过它发现现实中data的很多有意思的pattern,还在于那些引领着你发现有趣规律的算法中本身蕴藏着的智慧。能进入这个领域我真是深感幸运,希望能早点进化成a serious player.最后给 talking machine节目打个小广告,Talking Machines是今年年初推出的一档融趣味性与高质量于一身,巨星大佬云集的机器学习访谈节目,每期节目里通常会邀请行业神牛谈谈行业发展,最新动向,以及比如他们在某期节目里居然同时邀请到了Geff Hinton, Yoshua Bengio and Yann LeCun同台论道! 这节目的规格有多高就不用我赘述了吧。对machine learning感兴趣的筒子们可以去关注下~本期节目的英文原文(4:50-10:50)(比较潦草的听打的,有不准确的还请轻拍)The History of Machine Learning from the Inside Out (节目链接)Q: One may be concerned with statistical efficiency and one is concerned with computational efficiency. But in practice, they play with a lot of the same problems and some advances. Are some of the advances we’ve seen in the last 10 years in image processing, speech recognition and translation, are they the results of advances in machine learning or are they the results of advances or uses in statistics?A: I think there are really some cool differences between statistics and machine learning. It really has a lot to do with what the objectives are. This is gonna be a little bit cartoonish, but at some level, statistics as related to service to science. Which scientists broadly defined, means answer questions, coherently about properties of the world. So statistical testing, like hypothesis testing, is a really important thing, where there is some effect you would like to understand. Rather than whether it exists or not, whether this drug works or not. And statistics has an amazing toolkit for answering questions like that of this flavor. Then there is also kind of estimating interpretable properties of the world.So you build a model and it contains variable, and should be understandable in terms of phenomena, and have units we understand. You want to know what effect this genetic modification is on phenotype. So I think statistics, sort of views itself as in many ways, being about performing those kinds of estimation. And getting answers to that are trustable, trustable by society broadly defined. And as a result in some ways, the field of statistics is kind of a gate-keeper for a lot of quantitative ideas of estimation of the data, in which it requires some kind of rigor for understanding a lot of biological, sociological processes. I should say that statistics, of course, includes a lot of other important ideas, like experimental design, gene-statistics, and a lot of other things.Machine learning, has on the other hand, been a largely about prediction. And about building systems that are not necessarily interpretable, that don’t necessarily with parameter estimation that something makes sense, like a unit, and so on. But it is entirely about making a great prediction about something like, oh, it’s an image, a cat or a dog, or is this person the same, what environment this robot is navigating in, and so on.And there is kind of a philosophy that statistics, about testing, about recovering that truth, whereas machine learning people have been happier to just make great predictions. Some successes have been due to theoretical understanding that empirical success, as measured by actually doing well, on different problems, is kind of sufficient.As I have said, it’s kind of a cartoon impression of things. But I think it holds true in forms of a lot of different sorts of problems. And this is different though, that have the success of machine learning, in the last 8 or 15 years. Has those been due to statistical ideas… And you know at some level, it is certainly true that statistical ideas inform machine learning and there is a lot of language, but I think, part of the reason that machine learning has become popular is that because it is so aggressively raised in new computational capabilities, algorithmic. And statistical methodology particularly, has this sort of conservatism. That caused it to embrace pure algorithmic ideas.What we are saying, I think, are the sort of, is actually a real merging of this fields, in which many good statisticians are starting to pay a lot more attention to machine learning community for interesting algorithmic ideas, and sort of new modeling insights, and inference insights.And I think machine learning is really coming around to push there is a long history of very important ideas in statistics that a don’t need to just be reinvented over and over again.I guess at the end of the day I would say that just because a lot of these new successes have involved a lot of statistical methodologies doesn’t mean that statistics are sort of responsible for them.And in some ways, and some of the very best people around and had a hard time identifying I belong to one, or the other. I kind of think myself as being like this, like I like to go to statistics conferences, and talk to statisticians, but I also really care about computation, I care about the AI version of this problem. And consider this machine learning community as been my home.点击[阅读原文] 学习数据可视化课程数盟(DataScientistUnion) 
 文章为作者独立观点,不代表大不六文章网立场
DataScientistUnion数盟力于打造最卓越的数据科学交流平台,倡导“数据创造价值",经常举办线上活动、线下活动、在线课程培训,同时数盟有专业的数据团队提供数据服务~
官网:http://dataunion.org 合作:contact@dataunion.org热门文章最新文章DataScientistUnion数盟力于打造最卓越的数据科学交流平台,倡导“数据创造价值",经常举办线上活动、线下活动、在线课程培训,同时数盟有专业的数据团队提供数据服务~
官网:http://dataunion.org 合作:contact@dataunion.org&&&&违法和不良信息举报电话:183-
举报邮箱:
Copyright(C)2016 大不六文章网
京公网安备78

我要回帖

 

随机推荐