暂无中文方面的信息,E文的也非常少,原文连接:

A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”

The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.

That’s big data.

Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.

And therein lies the rub.

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

(Semi)Automated science

In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in “Science” titled, “Distilling Free-Form Natural Laws from Experimental Data”. The premise was simple, and it essentially boiled down to the question, “can we algorithmically extract models to fit our data?”

So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.

Their results were astounding.

In a matter of minutes the algorithm converged on Newton’s second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.

In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in “Nature Methods” titled “Large-scale automated synthesis of human functional neuroimaging data”. In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.

To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.

In other words, you type in a word such as “learning” on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.

But that’s not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, “given the data that I’m observing, what is the most probable behavioral state that this brain is in?”

Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.

How many undergrads would I need to hire to read through that many papers? Any volunteers?

Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that’s around 40 million person-hours dedicated to but one branch of the sciences.

Annually.

This means that in the 10 years I’ve been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.

So my wife and I said to ourselves, “there has to be a better way”.

Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.

For example, if 10,000 papers mention “Alzheimer’s disease” that also mention “dementia,” then Alzheimer’s disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer’s anddementia, whereas there are only 14 papers that mention Alzheimer’s and, for example, creativity.

From this, we built what we’re calling the “cognome”, a mapping between brain structure, function, and disease.

Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books (“culturomics”), identifying seasonality of mood from tweets, and so on.

But so what?

Deep data

What those three studies show us is that it’s possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.

My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we’re calling “semi-automated hypothesis generation,” which is predicated on a basic “the friend of a friend should be a friend” concept.

In the example below, the neurotransmitter “serotonin” has thousands of shared publications with “migraine,” as well as with the brain region “striatum.” However, migraine and striatum only share 16 publications.

That’s very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?

Perhaps there’s a missing connection?

Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren’t the only stories that our data can tell us.

For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.

At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.

No big deal.

But what’s cool was seeing where the outliers were. When I looked at the models’ residuals, that’s where I found the far more interesting story. While it’s good to have a model that fits your data, knowing where the modelbreaks down is not only important for internal metrics, but it also makes for a more interesting story:

What’s happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?

The paradox of information

The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that’s where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.

While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don’t fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.

In 2008, psychologists David McCabe and Alan Castel published a paper in the journal “Cognition,” titled,“Seeing is believing: The effect of brain images on judgments of scientific reasoning”. In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.

This should cause any data scientist serious concern. In fact, I’ve formulated three laws of statistical analyses:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

The first law is closely related to the “bike shed effect” (also known as Parkinson’s Law of Triviality) which states that, “the time spent on any item of the agenda will be in inverse proportion to the sum involved.”

In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can’t understand it — people will defer to expert opinion.

Such is the case with statistics.

If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, “correlation does not equal causation.”

We’ll go ahead and call that truism Voytek’s fourth law.

But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.

But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?

The always fantastic Radiolab did a follow-up story on the Schmidt and Lipson “automated science” research in an episode titled “Limits of Science”. It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?

Well sometimes the stories we tell with data … they just don’t make sense to us.

They found, “two equations that describe the data.”

But they didn’t know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, “the more we turn to computers with these big questions, the more they’ll give us answers that we just don’t understand.”

So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those “things” are.

Because at some point, we’ll have so much data that we’ll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.

Recently, Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he’s typed and every email he’s sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying “I’m looking at your data [Dr. Wolfram], and you know what’s amazing to me? How much of you is missing.”

Personally, I disagree; I believe that there’s a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:

“It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works — that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it.”

So go forth and create beautiful stories, my statistical friends. See you after peer-review.

Related:

  • Why Uber’s data fascinates a neuroscientist
阅读(423) | 评论(0) | 转发(0) |
0

上一篇:K860i的109升级需要的PinyinIME.apk和QuickSearchBox.apk两个文件

下一篇:shell中函数继承问题

相关热门文章
  • linux-6.5下 基于vsftpd+pam+m...
  • Path MTU Discovery
  • int类型取值范围
  • U-boot-2013.01移植OK6410-A(...
  • 最大最小堆算法
  • Hive性能改进思路
  • 开源爬虫软件汇总
  • 聚类算法总结
  • 机器学习书籍资料推荐...
  • cloudera impala初用之问题集...
  • 我的ChinaUnix博客被锁定了,...
  • 虚拟机中ubuntu无线连接问题...
  • IBM DS3400 盘阵怎么查看是单...
  • 启动auditd时,报错如下,怎么...
  • CGSL系统中root密码正确,但无...
给主人留下些什么吧!~~
评论热议

转载于:https://www.cnblogs.com/aquester/p/9891665.html

深数据 - Deep Data相关推荐

  1. 被神话的大数据——从大数据(big data)到深度数据(deep data)思维转变

    2019独角兽企业重金招聘Python工程师标准>>> 自从阿法狗战胜人类顶级棋手之后,深度学习.人工智能变得再一次火热起来.有些人认为,深度学习的再一次兴起是源于硬件的提升.数据量 ...

  2. 数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics

    数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

  3. Python中的浅复制(shallow copy)和深复制(deep copy)

    文章目录 python值管理方式 深复制与浅复制的使用及区别 近期杂事太多,博客一直没更新,9月最后一天了,总得写点吧 今天记一下以前碰到过,最近又碰到的问题:python的深复制和浅复制 神奇的py ...

  4. 别混淆数据争用(data race) 和竞态条件(race condition)

    在有关多线程编程的话题中,数据争用(data race) 和竞态条件(race condition)是两个经常被提及的名词,它们两个有着相似的名字,也是我们在并行编程中极力避免出现的.但在处理实际问题 ...

  5. 认知:大数据-Big Data

    大数据-Big Data 作者 | WenasWei 一 大数据 大数据(Big Data)也称为海量数据(Massivee),是随着计算机技术及互联网技术的高速发展而产生的数据现象,2013年也称为 ...

  6. 深度学习100问:什么是深监督(Deep Supervision)?

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达 所谓深监督(Deep Supervision),就是在深度神经网络 ...

  7. R语言data.table导入数据实战:data.table中编写函数并使用SD数据对象

    R语言data.table导入数据实战:data.table中编写函数并使用SD数据对象 目录 R语言data.table导入数据实战:data.table中编写函数并使用SD数据对象 #data.t ...

  8. R语言data.table导入数据实战:data.table使用by函数进行数据分组(aggregate)

    R语言data.table导入数据实战:data.table使用by函数进行数据分组(aggregate) 目录 R语言data.table导入数据实战:data.table使用by函数进行数据分组( ...

  9. R语言data.table导入数据实战:data.table生成新的数据列(基于已有数据列)、生成多个数据列

    R语言data.table导入数据实战:data.table生成新的数据列(基于已有数据列).生成多个数据列 目录 R语言data.table导入数据实战:data.

最新文章

  1. VMware克隆centos系统后不能识别eth0
  2. 如何在Ubuntu上创建桌面快捷方式
  3. efcore技巧贴-也许有你不知道的使用技巧
  4. 利用深度学习从单个损伤和斑点中识别植物病害
  5. 【回归预测】基于matlab哈里斯鹰算法优化混合核极限学习机KELM回归预测【含Matlab源码 1751期】
  6. 使用js获取移动端设备屏幕高度和宽度尺寸的方法
  7. opencv学习笔记(三)颜色转换 cvtColor
  8. zyf的童年(异或运算的运用)
  9. 通俗易懂的理解:什么是数据埋点?
  10. Android Canvas绘制丘比特之箭
  11. 从输入URL到页面加载的过程?由一道题完善自己的前端知识体系!
  12. 疑因内部宫斗被离职,中兴70后程序员从公司坠楼 ​​​​
  13. 干货福利分享:pdf怎么去水印 如何在线去除PDF水印
  14. Vue学习—深入剖析vue-cli脚手架(一)
  15. 程序员的浪漫--词云kumo
  16. iOS图片转成视频方法
  17. 配置基于区域策略的防火墙
  18. Dialogue System for Unity文档中英对照版(简雨原创翻译)第五篇(第三方插件拓展)
  19. chrome打开本地html自动刷新,谷歌浏览器插件Auto Refresh 网页自动刷新
  20. Windows键盘对应苹果的Option键

热门文章

  1. eval在类型转换的妙用
  2. matlab对一行矩阵fft,MATLAB?fft命令
  3. 反应能力测试题_微笑抑郁的表现症状有哪些?你是否正在受微笑抑郁困扰?(内附测试题)...
  4. c++中实现域内,左,右对齐的方法
  5. 【pip install psycopg2安装报错】Error: pg_config executable not found.
  6. WebService怎样在IIS上部署
  7. IDEA中双击两下shift全局搜索怎样取消和修改
  8. MyBatisPlus的ActiveRecord实现CRUD
  9. BigDeciml转String为0.00时避免踩坑
  10. MySQL高级-索引是什么