什么是好的大数据科学家家

大数据揭秘:数据科学家到底是干什么的? - 简书
大数据揭秘:数据科学家到底是干什么的?
原文链接:
Big Data Uncovered: What Does A Data Scientist Really Do?大数据揭秘:数据科学家到底是干什么的?
The world of Big Data and data science can often seem complex or even arcane from the outside looking in. In business, a lot of people by now probably understand the basics of what Big Data analysis involves – collecting the ever growing amount of data we are generating, and using it to come up with meaningful insights. But what does this actually involve on a day to day level for the professionals who get their hands dirty with the nuts and bolts?在外界看来,大数据领域和数据科学常被认为是高深复杂甚至神秘的。商业领域中,很多人现在可能已经了解了大数据分析所包含的基本概念:对我们生成的不断增长的海量数据加以收集,发掘其中具有重要意义的信息。但从事具体研究的专业人士每天究竟都在做些什么呢?
To have a look under the hood of a job that some describe as the 'Sexiest Job Of The 21st Century' I spoke to leading data scientist Dr Steve Hanks to get an overview of what the work of a data scientist actually involves, and what sort of person is likely to be successful in the field.有人将数据科学家称为“21 世纪最性感的工作”,为了揭开其神秘面纱,我与权威的数据科学家 Steve Hanks 博士进行了交谈,大体了解了数据科学家的工作包括哪些方面,以及哪种人更加适合这个领域。
Dr Hanks gained a PhD in computer science at Yale University, has spent 15 years as a professor of computer science and has worked at companies including Amazon, Yahoo and Microsoft. Today he is chief data scientist
where he is responsible for overseeing the Contact Graph – a database containing contact information for over 200 million people. The database is searched around two billion times every month and is the company's primary business asset.Hanks 在耶鲁大学获得计算机科学博士学位,在 15 年时间内作为计算机科学专家先后供职于亚马逊、雅虎和微软等多家公司。目前他是
的首席数据科学家,负责 Contact Graph 的监管工作。Contact Graph 是一个数据库,存储了超过 2 亿人的联系信息。这个数据库每月被搜索约 20 亿次,是该公司主要的业务资产。
This database has driven Whitepage's business since it was launched in 1997 and more recently it has diversified into app development. Caller ID, its replacement mobile user interface, queries the main Whitepages database to give more complete information on who is calling, and to help cut nuisance and spam calls. It also generates another revenue stream by providing its data to other companies to use in fraud prevention.自 1997 年推出以来,这个数据库一直是 Whitepages 的业务驱动力。最近这家公司又开发出一款手机应用 Caller ID,它可替代手机用户界面,通过查询 Whitepages 的主数据库,提供更加完善的来电显示信息,还可以屏蔽骚扰电话和广告电话。此外,这一数据库还扩展出一条新的盈利途径,即为其他公司提供数据以用于预防诈骗。
Key Capabilities of a data scientist数据科学家的关键能力
The term "data scientist" can cover many roles across many industries and organizations from academia to finance or Government. Hanks leads a team of 12 to 15 members responsible for all of the analytics at Whitepages, and their skillsets and duties vary. However, he tells me, there are three key capabilities which every data scientist has to understand.“数据科学家”这一术语可以代表学术、金融、政府等多种领域和组织中的多种角色。Hanks 所带领的团队有 12 至 15 名成员,他们共同负责 Whitepages 的所有数据分析工作,而各成员的技能和职责则各不相同。不过他告诉我,有三种能力是每个数据科学家必须具备的。
You have to understand that data has meaning你必须清楚数据是有意义的
Hanks makes the point that we often overlook the fact that data means something and that it is important to understand that meaning. We have to look beyond the numbers and understand what they stand for if we are to gain any valid insights from it. Hanks points out "It doesn't have anything to do with algorithms or engineering or anything like that. Understanding data is really an art, and it's really important."Hanks 认为,我们经常忽视一个事实,即任何数据都是有意义的,关键在于理解这些意义。如果想要从数据中提炼出任何有效的信息,我们必须将目光超越数据本身,探寻其所表示的东西。Hanks 指出,这与算法、工程学或类似的技术无关,理解数据实际上是一种艺术,并且非常重要。
You have to understand the problem that you need to solve, and how the data relates to that你必须清楚自己需要解决的问题以及数据与这些问题的关系
Here is where you open your tool-kit to find the right analytics approaches and algorithms to work with your data. Hank talks about machine learning – which is very popular right now, but makes the point that there are hundreds of techniques to use data to solve problems – operations research, decision theory, game theory, control theory – which have all been around for a very long time. Hank says "Once you understand the data and you understand the problem you're trying to solve, that's when you can match the algorithm and get a meaningful solution."这表示你需要从所掌握的技能中找出合适的分析方法和算法来搞定你的数据。Hanks 谈到了当前非常流行的机器学习,他指出使用数据解决问题的方法有几百种之多,如运筹学、决策论、博弈论、控制论等,且这些方法均已出现了很长时间。Hanks 认为,一旦你理解了数据,理解了试图去解决的问题,便能够找到最合适的算法并提供理想的解决方案。
You have to understand the engineering你必须了解工程学
The third capability is about understanding and delivering the infrastructure required to perform any analysis. In Hank's words "It doesn't do any good to solve the problem if you don't have the infrastructure in place to deliver the solution effectively, accurately and at the right time and place."第三种能力即能够对数据分析工作所需的基础知识有足够了解并运用自如。用 Hanks 的话来说,如果不具有相应的基础知识,以便能够适时适地提供准确有效的解决方案,对解决问题是毫无帮助的。
Being a good data scientist is really about paying attention to all three of those capabilities. You have to pay attention to the data and what it means, understand the problems and know about matching algorithms to those problems, and you have to understand the engineering to come up with solutions.对于想成为一名优秀数据科学家的人,以上三种能力是必不可少的。你需要关注数据及其意义,理解问题并知晓解决问题的理想算法,还需要了解工程学,这将更有助于你解决问题。
At the same time it doesn't mean there's no room for specialization. Hanks makes the point that it is virtually impossible to be an expert in all three of those areas, not to mention all the sub-divisions of each of them. It is okay to specialize in one of these areas as long as you have an appreciation of all of them. Hanks tells me: "Even if you're primarily an algorithm person or primarily an engineer. If you don't understand the problem you're solving and what your data is, you're going to make bad decisions."然而这并不表示没有专攻某一种能力的可能。Hanks 认为,实际上不可能存在精通全部三个领域的专家,更何况这些领域各自又具有若干分支。而在已对这些领域建立了解的基础之上,完全可以专门研究其中一个领域。但 Hanks 告诉我,即使你以算法研究为主或以工程师作为第一角色,如果没有理解所解决的问题或是没搞清楚数据的意义,同样没办法胜任数据科学家。
Key qualities of a data scientist数据科学家的关键品质
In terms of personal qualities, a curiosity about data is essential, as well as communications skills, says Hanks. "People on my team spend a lot of time talking to customers to figure out what problems they need to solve, or talking to data vendors to find out what they can provide. So you become a middle man and communication is very important."就个人品质而言,对数据的好奇心是必不可少的,沟通技巧也同样重要。Hanks 说“我的团队成员会花很长时间与客户进行沟通,指出他们亟待解决的问题,还会与数据供应商进行交流,以便确定他们能够提供哪些帮助。因此,你成了一个中间人,可见沟通是非常重要的。”
Lots of different types of people go into data science, and Hanks explained to me that he was probably not a very typical example. However in my experience there is no such thing. The key capabilities Hanks mentioned cover a broad range of skills and people of different personality types and mind sets are attracted to the profession.许许多多不同类型的人从事着数据科学行业,Hanks 对我解释说他可能并不是个很典型的例子。而以我的经验来看,可不是这么回事儿。Hanks 提及的关键能力包含了范围广博的专业技术,而这个行业也不断吸引着具有不同个性和想法的人们。
"I just really loved the interplay", Hanks says, "From the beginning I was just totally fascinated. My first exposure to data science was probably in operations research, and I just loved the idea that you could take big data sets and use them to learn things, and improve things, and I found out that you really could use them to make a difference, I've found that fascinating for over 30 years now."“我真的非常喜欢这种互动,”Hanks 说,“一开始我完全被迷住了。我第一次接触数据科学时,好像是关于运筹学的案例,你可以使用大数据集习得并改进事物,这种概念让我深深着迷,并且我还发现,你真的可以利用数据带来一些不同,直到现在我已经沉迷于此超过 30 年了。”
Even after all that time in the business though, problems still come up which have him scratching his head, and these serve as a great example of the sort of challenges data scientists find themselves struggling with on a day to day basis.虽然在这一领域有着漫长而丰富的经历,还是可能遇到让他抓狂的问题。当谈到数据科学家的挑战时,这些问题就是极好的例子,就是他们每日钻研的目标。
Just this morning I was working on something and one of the algorithms just wasn't doing what it was supposed to do – basically it was showing us a link between a particular person and a particular phone number which we just knew was incorrect. These problems can be very intermittent and very hard to diagnose.就在今天早上,我正忙于工作,发现某个算法没有达到预期效果,基本上这表示某人与某个电话号码之间的已知链接是不正确的。这些问题可能会断断续续地出现,并且非常难以确定。
"We have very specific algorithms that are supposed to do very specific things, and when they don't we just have to take them apart and find out why not, the problem is these days they are very complex and have a lot of working pieces! I can be completely mystified, like I am right now … but we will get there – we always do! That's really the sort of challenge we face day to day – systems which just don't behave the way they are supposed to according to our schematics."“我们有非常具体的算法来处理非常具体的事情,当算法不奏效的时候,我们只能对它们进行仔细检查并找出原因,问题是这些天用到的算法非常复杂,并且有大量参与计算的代码段!我困惑极了,就像我现在这样……但我们总是会搞定的,我们一直如此!这就是我们每天面对的挑战——没有按照既定构思运行的各种系统。”
In the time that Hanks has been working with data he has seen huge changes in the field, from working on structured databases on mainframes, to distributed Hadoop networks, to the cloud based, real time data processing world of today. So where does he see the future taking analytics and Big Data?在 Hanks 从事数据科学的这些年中,他感受到了这个领域的巨大变化,从运行于主机上的结构数据库,到分布式 Hadoop 网络,再到今天基于云的实时数据处理,技术发展日新月异。那么随着数据分析和大数据技术的发展,他对行业的未来又是如何看待的呢?
The Future of data science数据科学的未来
Hanks sees a future of increased data streaming and real-time data processing, as opposed to huge batch processing of data. He believes that in this new world Hadoop MapReduce is less appropriate and in his work he is starting to use other systems like Scala and Akka.Hanks 认为增量式数据流和实时数据处理技术将大有未来,但并不看好海量数据批处理技术的前景。他相信,在这个崭新的时代,Hadoop MapReduce 将不再那么适用,他在工作中已开始使用 Scala 和 Akka 等其他系统。
One of the biggest challenges Hanks sees is the keeping up with the fast developments of new technologies and new algorithms. He believes that in order to be an effective data scientist you have to be holistic. He believes that it is relatively easy to become a specialist in MapReduce or a particular machine learning algorithm but the challenge is keeping up with the general speed of development in data science. "It's a field that is just stunningly big and complex, and has incredible breadth and depth", Hanks tells me, "You have to understand all of the pieces but the field is getting so vast – that's going to be the challenge facing data scientists going into the future."Hanks 眼中最大的挑战之一是要紧跟新技术和新算法快速发展的步伐。他认为,要成为一名出众的数据科学家,必须要具有全局观。他相信,成为 MapReduce 或某一机器学习算法领域的专家相对容易,更大的挑战在于紧跟数据科学的发展速度。“这是个非常庞大而复杂的领域,其范围有着无法想象的广度和深度,”Hanks 告诉我,“你必须要了解每个细节,但这个领域还在不停地飞速发展,这将是数据科学家未来所面临的挑战。”
译者注:今年我的第一篇翻译练笔,希望读者多多批评指正。手感逐渐恢复中...
坚持写下去在10个鸡蛋上坐了3星期后,终于成功孵化出第一颗鸡蛋。
但礼仪小姐、导购小姐依然是场内的一道风景线。
声明:本文由入驻搜狐公众平台的作者撰写,除搜狐官方账号外,观点仅代表作者本人,不代表搜狐立场。
  关于何为数据科学家,小科曾在文章中如此解释:
  数据科学家是二十一世纪的炼金术士:他们洞悉原始数据,从而进行转化。数据科学家利用统计、机器学习和分析方法来解决关键业务问题,帮助公司将大数据量化为有价值、可操作的见解。
  感兴趣可查看:
  事实上,成为一名优秀的数据科学家,是许多数据人的梦想。那么,立志成为数据科学家的你,认为什么是数据人不可缺少的好习惯呢?小科特地摘取了知乎的高票答案,供大家借鉴,大家也可在文末留言区发布看法哦~
  答主:曾耀辉
  原答链接:/question/
  已有的答案大多谈的都是high level的比较抽象的东西,像了解业务、阅读人文、培养好奇心这些。我来说说具体关于数据分析的习惯好了。
  1. 分析数据前,一定要尽可能多的进行数据可视化!可视化!可视化!做exploratory data analysis!
  (说三遍!!!)
  我上过的几乎所有的应用性的统计课程上的老师都会强调这一点。这个习惯对于数据科学家、统计学家来说估计是最最实用的。在实际的数据分析过程中,数据可视化可以揭示很多insights:从选择什么样的模型,选择哪些feature建模,到如何分析结果,解释结果等等。
  给一个很著名的例子, Anscombe's quartet (安斯库姆四重奏):
  /?target=https%3A//en.wikipedia.org/wiki/Anscombe%2527s_quartet
  这个例子包含四组数据。每组数据有11个(x, y)数据样本点。四组数据样本里x的均值方差全相等,y的均值方差基本相等,x与y的相关系数也很接近。导致的结果是,四组数据线性回归的结果基本一样。但是,这四组数据本身差别很大。如下图。
  如果不做可视化,简单跑一个线性回归,我们只能得到同样的回归线。数据可视化后,很直观的,左上图是传统的线性回归;右上图需要high-order nonlinear term;左下图x和y是线性关系,但是有outlier;右下图x和y没有线性关系,也有outlier, etc.
  每一个数据科学家都应该熟悉各种图的画法,更重要的是,不同的图如何反映不同的信息以及面对不同的数据类型时,应该选择哪种图才能最好的揭示数据里蕴含的信息。
  为此,强烈推荐关于R里ggplot包的教程:ggplot2 - Elegant Graphics for Data Analysis
  /?target=http%3A///us/book/6
  当然另一方面,如果数据量太大维度太高,数据可视化做起来就比较困难。这时候就需要一些经验技巧了。
  2. 跑完程序得到模型结果时,一定提醒自己:任务只完成50%,分析,验证,解释结果才是根本!
  很多时候,我们以为写完code跑完程序就完事了。能做到这一步只能算是一个合格的data analyst。这离数据科学家,统计学家还差远了。分析,验证,解释结果才是根本! 这个过程更需要data sense, domain knowledge, and statistical expertise.
  在拿到结果的时候,一定要多问自己为什么。模型assumptions是否满足?结果是否make sense?能否解答research question?特别当结果不符合expectation时,要么有新发现,要么有错误!如果有错,错在哪里?如果模型假设不成立,如何修正?是否有outliers,如何处理?或有missing values,missing的机制是啥样的(missing at random, completely at random, or NOT at random)? 是否有multicollinearity? 数据收集是否有bias (如selection bias)?建模是否忽略了confounding factors (Simpson's paradox)?
  3. 养成story-telling的习惯!
  把分析结果跟你的boss或者collaborator讲!务必让他们明白!这个太需要技巧了, 特别是当你的collaborator是layperson的时候。
  不会说只能等着被虐,哪怕analysis做的再好!Over.
编辑 汪梦梦 王飞翔
  推荐阅读
  专业大数据竞赛平台
  中国数据青年成长之家
欢迎举报抄袭、转载、暴力色情及含有欺诈和虚假信息的不良文章。
请先登录再操作
请先登录再操作
微信扫一扫分享至朋友圈
搜狐公众平台官方账号
生活时尚&搭配博主 /生活时尚自媒体 /时尚类书籍作者
搜狐网教育频道官方账号
全球最大华文占星网站-专业研究星座命理及测算服务机构
主演:黄晓明/陈乔恩/乔任梁/谢君豪/吕佳容/戚迹
主演:陈晓/陈妍希/张馨予/杨明娜/毛晓彤/孙耀琦
主演:陈键锋/李依晓/张迪/郑亦桐/张明明/何彦霓
主演:尚格?云顿/乔?弗拉尼甘/Bianca Bree
主演:艾斯?库珀/ 查宁?塔图姆/ 乔纳?希尔
baby14岁写真曝光
李冰冰向成龙撒娇争宠
李湘遭闺蜜曝光旧爱
美女模特教老板走秀
曝搬砖男神奇葩择偶观
柳岩被迫成赚钱工具
大屁小P虐心恋
匆匆那年大结局
乔杉遭粉丝骚扰
男闺蜜的尴尬初夜
客服热线:86-10-
客服邮箱:数据科学家 | 统计之都 (中国统计学门户网站,免费统计学服务平台)写文章不容易,打个赏支持下作者吧|赞赏
收藏已收藏 | 31赞 | 4
分享到微信扫码分享到微信
大数据第一平台
55 篇作品39 万阅读总量
热门问题12345678910

我要回帖

更多关于 数据科学家养成手册 的文章

 

随机推荐