数据分析从头学

by Mina Demian

由Mina Demian

数据新闻学入门指南：让我们从头开始构建故事 (A Beginner’s Guide to Data Journalism: Let’s Build a Story From Scratch)

This is an introductory guide on how to produce the beginnings of a piece of data journalism. We’re going to walk through it together, as I outline the key things to consider before starting. We’ll cover:

这是有关如何产生数据新闻事业起点的入门指南。我们将一起探讨它，因为我概述了开始之前要考虑的关键事项。我们将介绍：

how to structure your work
如何安排工作
a basic process to follow
遵循的基本过程
a real-world case study to show how this process works
一个真实的案例研究，以展示此过程的工作原理

数据新闻仍然是故事 (Data journalism is still about stories)

The glitz and glamour of data journalism (the animations, the striking maps, and those great infographics) are all over the Internet. It’s easy to think, then, that it’s about the data and how cool you can make it look, sing, or dance. My wise friends at OpenUp, Raymond, and Adi keep reminding me (and the salivating Internet-at-large) that the focus should be on data journalism, and not data journalism.

数据新闻( 动画，引人注目的地图以及那些出色的信息图表 )的浮华与魅力遍布互联网。因此，很容易想到这与数据有关，以及它看起来，唱歌或跳舞时有多酷。我在OpenUp ， Raymond和Adi的明智朋友不断提醒我(以及垂涎的互联网)，重点应该放在数据新闻学上 ，而不是数据新闻学上。

Data journalism is no different from the journalism we know and consume every day. Where traditional journalism relies on human sources (e.g., insiders, experts, scholars, and scientists), data journalism treats data sources (e.g., spreadsheets, websites, and databases) with the rigor and scrutiny that journalists treat human sources.

数据新闻与我们每天了解和使用的新闻没有什么不同。在传统新闻依赖人源(例如内部人员，专家，学者和科学家)的地方，数据新闻对待记者(对电子表格，网站和数据库)的严格性和审查性。

The animations and snazzy work help to communicate the final product — the story — but they will never replace the actual story.

动画和时髦的工作有助于传达最终产品-故事-但它们永远不会取代实际的故事。

盛大的开始 (The grand start)

A data journalism story can start from an important event or it can simply be a question. You may see a breaking headline and wonder how much x did it take for y to happen? Or, you are thinking about food and wonder how much the average shopper spends on dog food. Both questions are valid and great starting points when evaluating a piece for data journalism.

数据新闻故事可以从重要事件开始，也可以只是一个问题。您可能会看到一个标题不明确的标题，并且想知道要使y发生需要多少x ？或者，您在考虑食物，想知道购物者平均在狗粮上花费多少。这两个问题都是有效的，也是评估数据新闻学方面的一个很好的起点。

What I’ve learned so far in my work is that there is little difference between doing the work of basic science and that of data journalism. You make an observation, come up with a question (hypothesis for purists), and then you go about trying to answer that question. Your work will show either that your initial hypothesis was indeed incorrect or that, yes, it was indeed correct.

到目前为止，我在工作中了解到的是，从事基础科学工作与数据新闻工作之间几乎没有区别。您进行观察，提出一个问题(对纯粹主义者的假设)，然后尝试回答该问题。您的工作将表明您的最初假设确实不正确，或者是，确实如此。

So, as I mentioned earlier, it’s not about the fancy graphics or how much data you trawled through, it’s about knowing what is your question and did you answer it?

因此，正如我之前提到的，这与花哨的图形或您拖曳的数据量无关，而是与知道您的问题以及您是否回答了有关 ？

Don’t believe the hype.

不要相信炒作。

“ 谁是你的数据，他做什么？” (“Who’s your data and what does he do?”)

This guide is based on data from South Africa’s statistics agency. (The results of the quarterly survey released in the summer of 2015 shows the official unemployment rate at a grim 25%.) The agency was kind enough to release the data in an Excel spreadsheet. I will write posts that address how to manage sources of data that are not released in an easy-to-use format.

本指南基于南非统计局的数据。 (2015年夏季发布的季度调查结果显示，官方失业率处于25％的严峻水平。)该机构非常乐于在Excel电子表格中发布数据。我将撰写有关如何管理未以易于使用的格式发布的数据源的文章。

The dataset is here, and there are enough sheets to warrant exploration. This exploration is important because without knowing what it’s about, what it covers, and so on, you may look at the wrong data. It may lead you to answer the wrong question or — the nightmare of every data journalist — waste hours while achieving very little.

数据集在此处，并且有足够的图纸来进行探索。这项探索很重要，因为在不了解其含义，所涵盖的内容等情况下，您可能会看到错误的数据。它可能会导致您回答错误的问题，或者-每个数据新闻工作者的噩梦-浪费时间却却很少。

So, before we talk about the process, let’s look at the data and see what it tells us. We don’t usually work with all the data (unless our initial idea or question requires this). It’s better to first look at all the data and then focus on a particular section that catches your attention.

因此，在讨论过程之前，让我们看一下数据，看看它能告诉我们什么。通常，我们不会处理所有数据(除非我们最初的想法或问题需要这样做)。最好先查看所有数据，然后专注于引起您注意的特定部分。

The spreadsheets from the statistics agency looks at different characteristics of the workforce (broken down by province, age, gender, and demographic). Even if this is your first time, quickly look over each sheet. It will help to develop a methodical work ethic that is invaluable to data journalism.

统计局的电子表格查看了劳动力的不同特征(按省，年龄，性别和人口统计)。即使您是第一次，也要快速浏览每张纸。这将有助于发展一种有条不紊的职业道德，这对数据新闻业来说是无价的。

An important side note: you only need a basic working knowledge of Excel. I don’t wield magic on the worksheets, so anyone can follow them. For the sake of brevity (and so you don’t go into a catatonic stupor), I will leave you to figure out how to do the basic manipulations in Excel.

重要的旁注：您只需要Excel的基本工作知识。我不会在工作表上使用魔术，因此任何人都可以关注它们。为了简洁起见(因此您不会陷入瘫痪)，我将让您了解如何在Excel中进行基本操作。

现在，旅程开始了 (Now the journey begins)

We’ve talked about what it means to produce a work of data journalism, how to evaluate if an idea will lead to a piece, and how to look at a dataset. Finally. We get to the process, the good stuff. How does it work?

我们已经讨论了产生数据新闻工作的意义，如何评估一个想法是否会成功，以及如何看待数据集。最后。我们进入流程，好东西。它是如何工作的？

第1步：从数据中吸一口 (Step 1: Take a bite out of the data)

For this guide, I want to know the size of the workforce of each province in South Africa, and how it fared between 2013 and the 2nd quarter of 2015. This data is displayed in the first worksheet. (You’re welcome to look through the other spreadsheets to see what interesting insights you can mine from them.)

对于本指南，我想知道南非每个省的劳动力规模以及2013年至2015年第二季度之间的表现。此数据显示在第一个工作表中。 (欢迎您浏览其他电子表格，以查看可以从中挖掘出哪些有趣的见解。)

So we went from working with the original spreadsheet, which has more than 20 worksheets, as shown below:

因此，我们不再使用原始电子表格，该电子表格具有20多个工作表，如下所示：

To working with just one worksheet, titled “Table 1: Population of working age (15–64 years),” as shown below.

如下图所示，只使用一张名为“表1：工作年龄人口(15-64岁)”的工作表即可。

Now, let’s copy the data at the bottom of the sheet, since it’s the data we want, and paste it into a new worksheet. To move towards a clean dataset, delete the row with the Thousands heading and delete the cell labelled South Africa. Also, delete the Totals row so that it doesn’t confuse us later. (I will adjust all the values to reflect millions in a minute.)

现在，让我们将数据复制到工作表底部，因为它是我们想要的数据，然后将其粘贴到新的工作表中。要移至干净的数据集，请删除标题为“千”的行，并删除标有“南非”的单元格。另外，删除“总计”行，以免日后混淆。 (我将调整所有值以在一分钟内反映数百万。)

It now should look like this:

现在看起来应该像这样：

Now, let’s change all the cells to show values in millions. Create a new column beside each existing column, and multiply the value by 1000. It now looks like this:

现在，让我们更改所有单元格以显示数百万的值。在每个现有列旁边创建一个新列，并将该值乘以1000。现在看起来像这样：

Remove all borders and decimal places, and make the thousand separator a comma. This helps to make our chart easier to read and more accessible. At this point, this table should be ready to be analysed.

删除所有边框和小数点后的位置，并以千位分隔符为逗号。这有助于使我们的图表更易于阅读和访问。此时，该表应该已经准备好进行分析了。

Not quite yet. Although it is cleaner, the data structure we need is not there. Why does this matter? Because the data needs to be organized in a way that we can aggregate or group them. The old sages of data journalism say that if your data is not summarized (or aggregated), it is not ready to be analyzed.

还不行尽管它更干净，但是我们所需的数据结构并不存在。为什么这么重要？因为数据需要以一种可以汇总或分组的方式进行组织。古老的数据新闻工作者说，如果您的数据未汇总(或汇总)，则不准备进行分析。

步骤2：将数据转换为可进行分析/可视化的结构 (Step 2: Transform the data into an analysis-/visualization-ready structure)

What factors are we looking to expose from this data set? They are province, year, and the total number of workers. But before that, we’re going to create this new data structure with the following columns:

我们希望从该数据集中揭示哪些因素？他们是省份，年份和总工人数。但是在此之前，我们将使用以下各列创建此新的数据结构：

If you studied database design, you would fail your database design test for presenting this dataset design. Or if you are a programmer, your boss would chide you for proposing this dataset design. Your lecturer or boss would be in the right to do so. It’s not a normalized (computer science speak for optimized) dataset. However, this is data analysis for a piece of data journalism, so you may scorn those rules! We need to have duplicate rows in order to aggregate the data later (remember?).

如果您学习了数据库设计，则由于呈现此数据集设计而使数据库设计测试失败。或者，如果您是一名程序员，您的老板会建议您提出这种数据集设计。您的讲师或老板将有权这样做。它不是规范化 (计算机科学代表优化)数据集。但是，这是对一部分数据新闻的数据分析，因此您可能会鄙视这些规则！我们需要重复的行，以便以后聚合数据(还记得吗？)。

步骤3：产生最终数据集 (Step 3: Produce the final dataset)

In the screenshot above, I entered the relevant years into the structure. Next, paste the totals for 2013, 2014 and 2015. The dataset now looks like this. (Medium doesn’t allow iframe embeds, so instead I have provided a link to the dataset.) It should have 91 rows, and only quarter 1 and quarter 2 are indicated for 2015.

在上面的屏幕截图中，我在结构中输入了相关年份。接下来，粘贴2013年，2014年和2015年的总数。数据集现在看起来像这样。 (Medium不允许iframe嵌入，因此我提供了指向数据集的链接。)它应该有91行，并且仅表示2015年的第1季度和第2季度。

We’re almost there!

我们快到了！

The last step is to aggregate the data. So, take a deep breath and create a PivotTable in a new sheet. Our summarized data looks like this:

最后一步是汇总数据。因此，深吸一口气，在新的工作表中创建一个数据透视表。我们的汇总数据如下所示：

Clean up the table. Enter the thousands separator, remove the decimal places, and delete the cell labelled Sum of Total Row Labels. The table now looks like this:

清理桌子。输入千位分隔符，删除小数位，然后删除标记为“总行标签总数”的单元格。该表现在如下所示：

步骤4：产生可视化 (Step 4: Produce the visualization)

Congratulations! You have a dataset that is ready to be visualized.

恭喜你！您已经准备好可视化数据集。

We’re going to use Infogr.am to produce an infographic. This guide doesn’t cover how to sign up with and use Infogr.am, so (as with Excel) you’ll need to get acquainted with the tool on your own. I assure you that it’s straightforward and intuitive! You’ll be using it like a professional in no time! You’ll see.

我们将使用Infogr.am生成信息图。本指南没有介绍如何注册和使用Infogr.am，因此(与Excel一样)您需要自己熟悉该工具。我向您保证，它简单明了！您将立即像专业人士一样使用它！你会看到的。

To create a new infographic, choose any template you like. The blank work area looks like this:

要创建新的信息图，请选择所需的任何模板。空白工作区如下所示：

Give the infographic a title like “Total workforce in provinces, 2013–2015” or something similar, as you see fit. Then add a grouped bar chart from the popup wizard. You should see the new chart in the work area. (Delete the existing chart that comes with the template, which now appears below the one you just created.)

给信息图表一个标题，例如“ 2013-2015年各省的总劳动力”或类似的名称(如果您认为合适)。然后从弹出向导中添加分组的条形图。您应该在工作区中看到新的图表。 (删除模板随附的现有图表，该图表现在显示在您刚刚创建的图表下方。)

Double-click on the new chart, and an interface similar to Excel will appear on the screen. Delete the data from this screen, and copy the data from the Excel worksheet used in the PivotTable, and paste it into the Infogr.am spreadsheet interface. It should look like this:

双击新图表，屏幕上将出现类似于Excel的界面。从此屏幕上删除数据，然后从数据透视表中使用的Excel工作表中复制数据，然后将其粘贴到Infogr.am电子表格界面中。它看起来应该像这样：

When you paste the data, the graphic automatically updates itself.

粘贴数据时，图形会自动更新。

It’s starting to look great!

它开始看起来很棒！

Have a look at the infographic. Everything is in there, but you may not understand the infographic immediately. You have to scroll down to read the legend to see which color refers to which province. So, instead of re-formatting the data, click on the two-directional arrows icon on the top right-hand corner of the spreadsheet. This nifty feature will switch the rows and columns so that the provinces now appear in the rows and the years in the columns.

看一下信息图表。一切都在其中，但您可能无法立即理解该信息图。您必须向下滚动以阅读图例，以查看哪种颜色代表哪个省。因此，不必重新格式化数据，而是单击电子表格右上角的双向箭头图标。这个漂亮的功能将切换行和列，以便现在各省显示在行中，而年份则显示在列中。

Always aim to show the values on the chart, where appropriate. So click the Show values switch and the totals will reflect on the chart. Also, click on the Settings button and scroll down to add “total (in millions)” in the X-axis textbox. This helps the reader (and you) to better understand the chart.

在适当的情况下，始终力图在图表上显示这些值。因此，单击“显示值”开关，总计将反映在图表上。另外，点击设置按钮并向下滚动以在X轴文本框中添加“总计(百万)”。这有助于读者(和您)更好地理解图表。

Click the Publish button to give your graphic a title. Then decide whether you want the graphic to be interactive or to be an image. This is how the final image would look like:

单击发布按钮为图形赋予标题。然后确定您是希望图形是交互式的还是图像。最终图像如下所示：

And you have produced your first visualization. Pat yourself on the back, have a coffee or beer, and get ready because you’ve just started the process.

您已经制作了第一个可视化文件。轻拍自己的背，喝咖啡或啤酒，然后做好准备，因为您刚刚开始此过程。

Before we look at the rest of the work, let’s review what we’ve done:

在查看其余工作之前，让我们回顾一下已完成的工作：

We looked at a data source and extracted the data we want. In this case, we asked, “What was the size of the workforce in South Africa between 2013 and 2015?”
我们查看了一个数据源并提取了所需的数据。在这种情况下，我们问：“ 2013年至2015年之间，南非的劳动力规模是多少？”
We followed a basic process of cleaning, formatting, transforming, and summarizing the data until we produced a table that shows the data we need.
我们遵循清理，格式化，转换和汇总数据的基本过程，直到生成了显示所需数据的表。
We then inserted the data into our visualization tool and produced an infographic, as shown above.
然后，我们将数据插入到可视化工具中，并生成了一个信息图，如上所示。

At this point, you’re so excited that you jump on Twitter or your email, and send your work to everyone you know.

在这一点上，您非常激动，以至于跳上Twitter或电子邮件，然后将您的作品发送给您认识的所有人。

Hold on! Not yet.

坚持，稍等！还没。

您的发现真正意味着什么？ (What do your findings really mean?)

Yes, you analyzed the data and you answered your question. Gauteng province had the largest workforce within the period we selected, but it’s been decreasing since 2013. The workforce for the Northern Cape has been consistently below 5 million since 2013. But why?

是的，您分析了数据并回答了您的问题。豪登省在我们选择的时期内拥有最大的劳动力，但自2013年以来一直在减少。自2013年以来，北开普省的劳动力一直低于500万。 为什么 ？

That’s why the first paragraph in this piece had this naughty qualification — “the beginnings of a story” — because now starts journalism as you know or were trained to do. At this point, do the following:

这就是为什么本文第一段具有这种顽皮的资格-“一个故事的开始”的原因-因为现在您知道或受过训练可以开始新闻工作。此时，请执行以下操作：

contact analysts, experts, or academics to interpret and comment on the data.
与分析师，专家或学者联系，以解释和评论数据。
look at other datasets or ask experts to explain the context of the findings, depending on the scope of the story or your editor’s instructions.
根据故事的范围或编辑者的指示，查看其他数据集或请专家解释结果的上下文。
analyze/visualize other datasets to test and refine your findings.
分析/可视化其他数据集以测试和完善您的发现。
do anything else required to make sure that the piece is balanced and fair.
做其他必要的事情以确保作品是平衡和公平的。

After you complete one or more of these steps, write the final article. Include the infographic produced above, and submit it for publication. If you run your own blog or website, publish it live.

完成这些步骤中的一个或多个之后，编写最后的文章。包括上面生成的信息图，然后提交发布。如果您经营自己的博客或网站，请实时发布。

没有尽头的地方！ (There’s no place like the end!)

And the end it is. I hope that you’ve come this far and your appetite has been whetted to do more (and sophisticated) work in data journalism.

最后是。我希望您已经走了这么远，并且您的胃口被吸引去做更多(更复杂)的数据新闻工作。

If something hasn’t worked for you, or you’d like some help with a section, follow me on Twitter and we can figure it out together.

如果有些问题对您没有帮助，或者您希望获得有关某个部分的帮助，请在Twitter上关注我，我们可以一起解决。

翻译自: https://www.freecodecamp.org/news/data-journalism-isnt-for-the-select-let-s-work-out-a-story-together-from-scratch-dd85b3017f4a/