
In this article, I attempt to demonstrate that with a minor amount of coding experience, a bit of data background, and a pinch of study and effort —even an inexperienced individual can discover insightful information from data.


Just a little background which may help provide a preface:


I’m not an analyst.


I’m not actually technically a “data” anything…


My official position is a Planner. It just so happens that what I “plan” is data and data-related activities (collection, compilation, QAQC, management, and reporting)

我的正式职位是规划师。 碰巧的是,我“计划”的是数据和与数据相关的活动(收集,编译,QAQC,管理和报告)

Sometimes, it feels like I’m swimming (or drowning) in data and yet I don’t really get the opportunity to fully unravel the data to create a meaningful data product, to truly tell a story with the data I work with.


On that sentiment, I unknowingly embarked on a journey to get more information from the data I worked with; information that could tell a story. The journey continues but I’d like to share what I’ve experienced thus far.

因此,我在不知不觉中踏上了从我使用的数据中获取更多信息的旅程。 可以讲故事的信息。 旅程继续进行,但我想分享到目前为止的经验。



像我这样的人,没有经验或没有分析师的背景,怎么能以讲故事的方式分析数据? (How does someone like me, with no experience or background as an analyst, analyze data in a way that can tell a story?)

Photo by Alejandro Gonzalez on Unsplash, Lettering by Author
图片由亚历杭德罗·冈萨雷斯( Alejandro Gonzalez)摄影,作者为刻字

Imaginably, there’s not one way to analyze data to extract meaningful information. This notebook is a little snapshot into my organic and yet cosmopolitan methodology, and is as follows:

我可以说,没有一种方法可以分析数据以提取有意义的信息。 这个笔记本是我有机但又国际化的方法的一个小快照,内容如下:

  1. EDA (Exploratory Data Analysis)


  2. Data Wrangling/Cleaning


  3. Visualization/Reporting


There are only 3 required packages and 2 optional packages (and one unlisted optional one, whoops)


所需包装: (Required Packages:)

  1. Pandas


  2. Numpy


  3. Plotly Express


可选包装:(Optional Packages:)

  1. Pandas Profiling


  2. Datapane


This is purposeful because, unless you are a full-time data professional, you probably don’t have excess time to learn a myriad of coding techniques and packages that get touted about your favorite Medium channels and StackOverflow threads. So simplicity and directness is key.

这是有目的的,因为除非您是一名全职数据专业人员,否则您可能没有多余的时间来学习大量的编码技术和软件包,这些技术和软件包被吹捧为您最喜欢的Medium通道和StackOverflow线程。 因此,简单和直接是关键。

I’ll try to keep this brief.


Alright then, shall we?


P技术1:EDA(探索性数据分析) (Part 1: EDA (Exploratory Data Analysis))

fabio on 法比奥上UnsplashUnsplash

I’m not sure how to even really begin talking about EDA, which coincidentally enough is probably similar to how most non-analyst (or maybe even some analysts) feel about doing EDA.


It can be hard to know where to begin -


but it’s not impossible…


In fact, sometimes you just gotta start walking to get there.


import pandas as pd
import numpy as np
df = pd.read_csv('your_data')

And you’re off, as it were.


Now, granted it may feel overwhelming, but for me, this is the adventure of it all. It’s time to sleuth, explore, dissect, and imagine. I love patterns and puzzle-solving so I find this part riveting.

现在,尽管它可能会让您感到不知所措,但是对我而言,这是这一切的冒险。 是时候探寻,探索,剖析和想象了。 我喜欢模式和解决难题的方法,所以我发现这部分很吸引人。

In our above example, this is actual data taken from my notebook. You don’t have to be extremely familiar with the data, transportation data, or really much of anything to understand to some degree what is happening.

在上面的示例中,这是从笔记本中获取的实际数据。 您不必非常了解数据,运输数据或任何东西,就可以在某种程度上了解正在发生的事情。

We have some sort of identifier or ID field (tmc_code), a near-datetime field (measurement_tstamp) which informs when the recording transpired, various different speed fields which are all nearly similar, an unique column dealing with travel duration (travel_time_seconds), and a bit of an odd-ball field at the end which presumably is a coded value for amount of data (data_density).


不要再猜测自己了,您可能已经弄清楚了这些数据是如何工作的。 (Don’t second guess yourself, you probably already figured out how this data works.)

The tmc_code is a sort of station (in fact, raw probe) that records a number vehicles (data_density) during different timeframes (measurement_tstamp) at certain speeds (speed, average_speed, reference_speed). The amount of time it takes to travel the segment that the tmc_code is recording is travel_time_seconds.

tmc_code是一种工作站(实际上是原始探测器),它以一定的速度( speed,average_speed,reference_speed)在不同的时间段( measurement_tstamp)记录多个车辆( data_density ) 。 行进tmc_code记录的段所花费的时间为travel_time_seconds。

So data like this will provoke questions that inspire stories, questions like:


how long does it take to travel on a certain road and how often is it like that?


But we’re getting ahead of ourselves here; we can still get to know the data a lot more.

但是我们在这里取得了领先。 我们仍然可以了解更多数据。

# statistical description
# dataframe information
# quick look at null values in the dataset

Those are fairly easy lines of codes to memorize and you can already tell a lot about this data and specifically this dataset.


In fact, you probably noticed that:


  • It’s a fairly big dataset (nearly 14 million rows)这是一个相当大的数据集(近1400万行)
  • It has some missing values (but nothing crazy)它具有一些缺失的值(但没什么疯狂的)
  • Something odd is happening with the travel_time_seconds field (hint: look at the max value compared to the 75%)


If you didn’t notice that right away, don’t fret. I’ve found that it’s one of those sort of things that once you’ve noticed it, you start to notice it everywhere.

如果您没有立即注意到这一点,请不要担心。 我发现这是其中的一种,一旦您注意到它,便开始在各处发现它。

This is probably also a good time to mention a handy tool called pandas-profiling


import pandas_profiling
# call pandas_profiling

If you were second guessing yourself (don’t worry, I live there), Pandas-Profiling is an awesome tool to verify or clarify the nature or characteristics of your dataset.


In fact, remember how we noticed the travel_time_seconds field and its strangely large maximum value?


So did Pandas-Profiling:


SKEWED warning for column

This is my data and I’ll be the first to admit that I don’t fully understand it — but I’m gonna bet that not many roads take 274385.88 seconds (or ~76.22 hours).


Actually, in Alaska, the longest road is the Dalton Highway, which is 414 miles and takes about 12 hours to drive (give or take).


The longest tmc_code (which happens to be on the Dalton) is ~282 miles (length in miles present in other data.


So something is amiss.


This is the part where the EDA will have its value revealed. This is the part where decisions are made.

这是EDA将会显示其价值的部分。 这是做出决定的部分。

第2部分:数据整理/清理 (Part 2: Data Wrangling/Cleaning)

Photo by Scott Graham on Unsplash
Scott Graham在Unsplash上拍摄的照片

Data wrangling/cleaning can be pretty intense and some pretty smart guys out there do stuff that I don’t fully understand. But I have learned enough to make things come together in a dataset.

数据争用/清理可能非常激烈,并且一些聪明的家伙在做一些我不完全了解的事情。 但是我学到了足够多的知识,可以使数据集中在一起。

Like how this dataset is unnecessarily big, or specifically, inefficient.


# values over 50,000 are about 14 hours, which is longer than any driving time in Alaska even with reasonable delays
# However, in other situations, it would make sense to leave those values in so use with caution
ttr_df = df.loc[df['travel_time_seconds']<50000]# drop nulls as they are likely not consequential and will help to reduce dimensionality/size
print('dropping nulls')
ttr_df = ttr_df.dropna()
print('dropped shape: ' + str(ttr_df.shape))
print('original shape: ' + str(df.shape))'''dropped shape: (13784065, 7)
original shape: (13936498, 7)'''# replace string values for integers
ttr_df['data_density'].replace({"A":'1',"B":'2',"C":'3'}, inplace=True)#create dictionary for converting values
convert_dict={'speed': int,'average_speed': int,'reference_speed': int,'data_density': int}print('converting types')ttr_df_2= ttr_df.astype(convert_dict)print('conversion complete')

That’s better.


But it’d be even better if we had fields to easily filter things like time and location.


# transform values from measurement_tstamp to datetime
print('datetime transform')
ttr_df_2['measurement_tstamp'] = pd.to_datetime(ttr_df_2['measurement_tstamp'])
ttr_dfc = ttr_df_2print('indexing datetime')
ttr_dfc.index = ttr_dfc['measurement_tstamp']# some of the visualizations are greatly enhanced bythe datetime values parsed out
print('adding month/day columns for slider')
ttr_dfc['month'] = ttr_dfc['measurement_tstamp'].dt.month
ttr_dfc['day_of_year'] = ttr_dfc['measurement_tstamp'].dt.dayofyearttr_dfc['hour'] = ttr_dfc['measurement_tstamp'].dt.hourconvert_dict2={'month': int,'day_of_year': int,'hour':int}# convert the field types
print('converting types')
ttr_dfc= ttr_dfc.astype(convert_dict2)
ttr_dfc.info()# data itself does not have geographic coordinates, but the tmc's are physical locations
print('Getting Spatial data')
# add in additional sheet for getting coordinates to tmc's
tmc_df = pd.read_csv(r'TMC_Identification.csv')# rename column for easier merging
print('renaming columns for merge')
tmc_df = tmc_df.rename(columns={'tmc':'tmc_code'})# there are duplicates for start/end active dates, unnecessary for the scope of appending spatial values
print(tmc_df.tmc_code.count())# merge for future use involving mapping or geospatial components
print('Merge started')
ttf_df_xy = ttr_dfc.merge(tmc_df[['tmc_code','county', 'zip', 'miles','start_latitude','start_longitude']], on='tmc_code', how='left')
print('Merge complete')convert_dict3={'start_latitude': 'float32','start_longitude': 'float32','zip': 'int32','miles':'float32'}print('converting types')ttf_df_xy= ttf_df_xy.astype(convert_dict3)ttf_df_xy.info()



It’s definitely a bigger dataset now, but it’s all useful and more efficient.


The data wrangling here is not very sophisticated and there is a lot more that we could do or do differently, but for brevity’s sake — let’s get the fun stuff.


第3部分:可视化/报告 (Part 3: Visualization/Reporting)

Visualization is one of my favorite parts of doing anything, as I tend to really get into aesthetics and UX concepts.

V isualization是我最喜欢做什么的地方之一,因为我倾向于真正进入美学和UX概念。

As far as visualizing large datasets (like millions of row records across thousands of features), I’m still looking for an insightful and efficient way to display such a dataset. But otherwise —

至于可视化大型数据集(例如成千上万个功能中的数百万行记录),我仍在寻找一种有见地且高效的方式来显示此类数据集。 然而在其他方面 -

You can:


减少 (Reduce)


ArcGIS Pro Map (by author)
ArcGIS Pro Map(按作者)

约束 (Constrain)

ttr_df_juneau_std = ttr_df_juneau.groupby('tmc_code')['travel_time_seconds'].std().sort_values(ascending=False)

Ultimately, for simplicity sake, we’re going to constrain our scope.


This is actually fitting because, at least in my business unit, people generally want to know about certain subsections; either by area, time, or facet.

这实际上是合适的,因为至少在我的业务部门中,人们通常希望了解某些子节。 按面积,时间或方面。

We’ll focus in on the tmc_codes with the widest or most irregular spread of values in Juneau, Alaska (my hometown).


那么……我们如何可视化这些潜在的不可靠tmc_codes? (So… How can we visualize these potentially unreliability tmc_codes?)

Visualizations with Plotly Express makes it very simple and direct, which is perfect for when you’re just assessing things and feeling out the edges

V isualizations与Plotly快递 使其非常简单直接,非常适合当您只是评估事物并感觉到边缘时

  • with animations带有动画
print('creating line graph')fig = px.line(ttr_df_juneau, x="hour", y="travel_time_seconds", color='tmc_code', log_y=True,  title='TTR', animation_frame='day_of_year',line_group="county", hover_name="tmc_code",hover_data=['tmc_code','miles','travel_time_seconds'],line_shape="spline", render_mode="svg")
  • interactive visualizations of different plots不同地块的交互式可视化
ttr_df_juneau_delay = ttr_df_juneau.loc[ttr_df_juneau['tmc_code'].isin(['133-04910', '133+04911', '133-04911', '133+04912'])]
fig = px.histogram(ttr_df_juneau_delay, x="travel_time_seconds", log_y='count')
  • combining plots结合地块
# try double-clicking on 133+04911, again under the "tmc_code" and verify what hours may be the offenders
fig_margin = px.scatter(ttr_df_juneau_delay, x="hour", y="travel_time_seconds",color="tmc_code", size='speed', hover_data=['tmc_code',                 'travel_time_seconds','hour','day_of_year','month','speed', 'data_density'],marginal_x='histogram', marginal_y='violin')
  • or parsing facets或解析方面
# try double-clicking on 133+04911 under "tmc_code" and assess what hours may be the offenders
fig = px.scatter(ttr_df_juneau_delay, x="speed", y="travel_time_seconds", log_y=True,color="tmc_code", size='speed', hover_data=['tmc_code',                 'travel_time_seconds','hour','day_of_year','month','speed', 'data_density'],facet_col='hour', facet_col_wrap=4,category_orders={'hour': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]})

And of course, being able to map it on the fly without using a GIS platform/framework is a big advantage as well —


# density mapbox plot
fig = px.density_mapbox(ttr_df_juneau_delay, lat='start_latitude', lon='start_longitude', z='travel_time_seconds', radius=10,center=dict(lat=58.373940, lon=-134.618457), zoom=12,mapbox_style="stamen-terrain",animation_frame='day_of_year')
# had to read documentation to sort out this part
fig.update_layout(mapbox_style="white-bg",mapbox_layers=[{"below": 'traces',"sourcetype": "raster","source": ["https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"]}])

Hopefully, you can see that fairly insightful and attractive visualizations can be accomplished with just a little study on the arguments. No need to become a webapp developer just to communicate something meaningful with stakeholders or your team.

希望您可以看到,只需对参数进行一点研究,就可以完成颇具洞察力和有吸引力的可视化。 无需成为Webapp开发人员,只需与利益相关者或您的团队进行有意义的交流即可。

Speaking of which, what is a good way to share this visualization with someone?


Reporting (or sharing) is a key part of visualization because while running a notebook may work with doing a live presentation/recorded presentation — it wouldn’t easily work for someone else to run your notebook and view your data approach/methods/results.

[R eporting(或共享)是可视化的重要组成部分,因为同时运行的笔记本电脑可以用做现场演讲/录制演示文稿-它不会轻易被别人跑你的笔记本电脑和查看数据的方法/方法/结果工作。

There’s other ways to share and deploy solutions/visualizations, but I’ve really enjoyed the simplicity of Datapane.


import datapane as dp
r = dp.Report(dp.Table(ttr_df_juneau_delay), dp.Plot(fig))
r.publish(name='TTR', open=True)

From a very minimal amount of code (and a login if you are uploading to it), you can share your results with anyone or by use of the private link.


This way, they can inspect your data and see if it is truly the sort of meaningful insight they are looking for, then download it themselves.


结论: (Conclusion:)

Data analysis, as I understand it, is any process to collect, inspect, clean, and transform data into meaningful insight and communication.

d ATA分析,按照我的理解,是任何过程,收集,检查,清洁,以及将数据转换为有意义的洞察力和沟通。

Though I do aspire to gain more technical skills in this area, I don’t personally see why creating stories from data must wait until one has mastered certain techniques and technological/programmable applications.


Would I bank the direction of my section and our objectives off my simple analysis? Maybe not so much so soon.

我可以通过简单的分析来确定本节的方向和我们的目标吗? 也许没有那么快。

But it is a powerful way to substantiate an observation and to tell a story, which I think anyone, including the non-analyst, can do.


In case, you missed it, here’s the link to my notebook!


Thank you!


翻译自: https://medium.com/@ejmdpg/data-analysis-for-the-non-analyst-b959bfdb9d66




  • python海量数据分析师职业技能_大数据分析师技能图谱详解与零基础自学内容大全...
  • python海量数据分析师_数据分析师真的月入过万吗?(基于Python的招聘数据分析全流程实操)...
  • van访谈_谷歌业务分析师访谈
  • 数据分析师职业简介
  • 如何让中文转换成其拼音首字母大写
  • 拼音输入法-java
  • Excel汉字转换得到其拼音函数
  • .net 汉字转拼音 - 输入汉字获取其拼音
  • Java+spring 基于ssm的美食网站设计与实现#毕业设计
  • 「翻译」一个成功的 Git 分支模型
  • 介绍一个成功的 Git 分支模型
  • 手动绘制R语言Logistic回归模型的外部验证校准曲线(Calibration curve)(2)
  • K近邻算法 模拟sklearn调用 自定义优化Knn算法模型 ---完整代码
  • Stata:市场调整模型(MA)计算的并购事件的累积超额回报(CAR)
  • 手动绘制logistic回归预测模型校准曲线(Calibration curve)(1)
  • 介绍一个成功的 Git 分支模型——终于知道如何管理git分支了(好文章!!强烈建议看本文的英文原文)
  • bluemix_使用Bluemix采用混合云模型
  • 数学建模:微分方程模型—常微分方程数值解算法及 Python 实现
  • 成功的 Git 分支模型
  • 数仓架构实践3:苏宁售后体系四层模型架构
  • 基于模型开发总结
  • XGboost模型训练与调参
  • MapReduce模型过程详解
  • 消费者行为分析模型
  • 【OTDR曲线工具箱】03 创建sor文件
  • sor文件分析软件
  • 小学五年级上册计算机教案新疆,新疆青少版信息技术五年级上册全册教案(共十五课24页).doc...
  • 计算机魅力沈阳一日游教案,沈阳版小学信息技术教案五年级上册全册教案
  • 考研英语 - word-list-19
  • 云课堂让职业院校物联网技术应用教学更简单


  1. 数据框按行拼接_利用Python进行数据分析

    1.一维数据分析 #导入numpy包 import numpy as np#导入panda包 import pandas as pd numpy 一维数组 array #定义:numpy一维数组arr ...

  2. python 数据框按行拼接_使用python进行数据分析

    Python常用的两类数据分析包:numpy.pandas 一.一维数据分析 (1)numpy数据包的导入.一维数据组的赋值与查询 (2)numpy一维数据与列表的区别 1.可以用来实现统计功能 如计 ...

  3. 量化分析师的python日记_量化分析师的Python日记【第1天:谁来给我讲讲Python?】...

    "谁来给我讲讲Python?" 作为无基础的初学者,只想先大概了解一下Python,随便编个小程序,并能看懂一般的程序,那些什么JAVA啊.C啊.继承啊.异常啊通通不懂怎么办,于是 ...

  4. 大数据的说法 正确的是_数据量——让数据分析师永远头疼的指标

    在现在社会的节奏里,IT部门的脚步是停不下来的,那么对他们来说,大数据的到来意味着一系列新的挑战.那么,不可避免的就是,有一些挑战就会远远超过现有的令人头疼的问题,这就意味着企业需要新的数据管理平台. ...

  5. 数据科学家数据分析师_使您的分析师和数据科学家在数据处理方面保持一致

    数据科学家数据分析师 According to a recent survey conducted by Dimensional Research, only 50 percent of data a ...

  6. 20张可视化大屏,给数据分析师最全的大屏模板!无代码直接套用

    作为数据分析师或者可视化方面的设计师,设计制作可视化大屏是非常加分的一个技能. 在缺少专业大屏软件的情况下,制作一张大屏模板可能需要编写大量的代码搭建iframe布局.接入echart图形. 而帆软大 ...

  7. 大数据工程师和数据分析师有什么区别

     不少朋友都很困惑,大数据工程师和数据分析师有什么区别,哪一个的就业好薪资高? 首先我们来区别下大数据工程师和数据分析师: 1.概念区别 数据分析师,是数据师的一种,专门从事行业数据搜集.整理.分 ...

  8. python 数据分析师前景及待遇_数据分析师的前景怎么样?

    从20世纪90年代起,欧美国家开始大量培养数据分析师,直至现在对数据分析师的需求仍然长盛不衰而且还有扩展之势.根据美国劳工部预测,到2020年,数据分析师的需求量将增长20%.就算你不是数据分析师,但 ...

  9. 如何区分大数据下的三大利器:数据科学家,数据工程师与数据分析师

    与其他一些相关工程职位一样,数据科学家的影响力与互联网同进同退.数据工程师和数据分析师与数据科学家携手共同完成这幅"大数据时代"巨作.他们共同努力拟定数据平台要求,基础和高级算法, ...


  1. 卡顿严重_王者峡谷:S20出现bug?卡顿十分严重
  2. 转载:(C/C++函数返回多个值)
  3. android 解决getNetworkInfo过时
  4. (转)JAVA正则表达式语法大全
  5. 5.[BX]和Loop指令
  6. dbunit java_java - 错误地抛出了Java DBUnit AmbiguousTableNameException - 堆栈内存溢出
  7. Docker 系列之 常用镜像
  8. 不拥抱算法的张小龙,还能带着微信继续避免失败?
  9. 达梦系统录音服务器是哪个,达梦服务器安装及使用教程
  10. Spring Boot 应用上传文件时报错
  11. Ubuntu下替换软件列表
  12. Laravel下载文件及文档
  13. Windows NT/2000/XP下不用驱动的Ring0代码实现
  14. 怎么给word文档注音_如何给Word文档中的汉字加拼音?一键加拼音超级方便
  15. ai杀手级_设计师的10个杀手级Adobe Photoshop技巧
  16. sql查询本月数据,当天数据
  17. 【PX4 飞控剖析】05 PIX4 连接QGC 可以烧录固件但是连接不上
  18. 动态等待转圈效果(HTML、CSS、JS)
  19. 计算机图形学应用题,计算机图形学教学大纲
  20. 设备管理器,其他设备,PCI数据捕获和信号处理控制器出现感很多未知设备感叹号,通用解决方法,以华为matebook为例


  1. [Java]分布式自平衡多文件云传输
  2. 姚期智:为了中国计算机科学的腾飞(zz)
  3. 为你的Android Studio更换好看的主题风格
  4. 明明有内存报错CUDA out of memory
  5. 电子产品PCB电路板散热的方法
  6. 企业的组织架构对技术架构的影响
  7. r ridge回归_R语言逻辑回归和泊松回归模型对发生交通事故概率建模
  8. 怎么进行固定资产盘点,资产盘点报告如何一键生成
  9. uniapp中字体加粗问题
  10. Requests如何在Python爬虫中实现post请求 ?