一个数据包的旅程

by Elena Nisioti

由Elena Nisioti

数据科学语言的个人旅程 (A personal journey through the languages of data science)

One does not simply walk into TensorFlow.

一个人不只是走进TensorFlow。

A PhD is a good opportunity for introspection. In fact, it is important to create opportunities for introspection no matter how busy or insignificant the present feels like.

博士学位是自省的好机会。实际上，无论现在多么忙碌或微不足道，创造自省的机会都是很重要的。

We should not regard our past as an immature period, but as an unfolding story. A story of discoveries, mistakes, skills, and projects that are now part of our professional consciousness.

我们不应将过去视为不成熟的时期，而应将其视为一个不断发展的故事。有关发现，错误，技能和项目的故事现在已成为我们专业意识的一部分。

No matter how good a tutorial, detailed an article, or well-designed a library, it all comes down to personal assimilation when you learn a new tool. And what is this personal quality that shapes our receptive filters, so that one person’s favorite language is another person’s nightmare? (My personal nightmare is actually doing my everyday programming in C.)

无论教程的质量，文章的详细信息或库的设计良好，当您学习新工具时，所有这些都可以归结为个人同化。塑造我们的接受过滤器的这种个人品质是什么，以至于一个人最喜欢的语言是另一个人的噩梦？ (我个人的噩梦实际上是在用C进行日常编程。 )

The past. This is what this article is going to be about. An account of my attempts to use MATLAB, Weka, R, C++, and Python in my data science career.

过去。这就是本文要讨论的内容。在我的数据科学职业中尝试使用MATLAB，Weka，R，C ++和Python的说明。

Data science is a wide field, employing people from a huge variety of backgrounds, like economics, biology, and linguistics. Although data science emerged from a pure statistical background, it soon hijacked the field of computer science and is today a tool as versatile and essential as a calculator.

数据科学是一个广阔的领域，雇用了具有广泛背景的人员，例如经济学，生物学和语言学。尽管数据科学起源于纯粹的统计背景，但它很快就劫持了计算机科学领域，并且如今已成为一种像计算器一样通用且必不可少的工具。

As a result, the palette of programming languages for data science kind of feels like the universe: a lifetime is not enough to explore it, and it is constantly expanding.

结果，用于数据科学的编程语言调色板就像宇宙一样：一生不足以探索它，并且它还在不断扩展。

We know that there are trade-offs involved with the generality, power, and complexity of a language. Therefore, the popularity of a language should serve only as an indication of current trends, not a factor for determining your own choice. Ultimately, it’s a matter of application, experience, and taste.

我们知道，在某种语言的普遍性，功能和复杂性之间需要权衡取舍。因此，一种语言的流行只应作为当前趋势的指示，而不是决定您自己选择的因素。最终，这取决于应用程序，经验和口味。

的MATLAB (MATLAB)

I was introduced into the world of machine learning through an online course taught by Andrew Ng. I recommend it to this day to people looking for a smooth introduction into the admittedly scarily vast world of machine learning.

我由Andrew Ng教授的在线课程将我介绍到了机器学习的世界。到目前为止，我一直向那些希望平稳地介绍公认的机器学习世界的人们推荐。

Although Python and R were much more popular at that time, Andrew chose MATLAB for the course’s assignments. Little did this bother me at that time, but it sure feels odd these days. Data science courses focus more on how to use a language (or a library) to do data analysis than how to do data analysis using a language.

尽管Python和R在当时更受欢迎，但安德鲁还是选择MATLAB作为课程的作业。那时这几乎没有打扰我，但这些天确实让我感到奇怪。与如何使用语言进行数据分析相比，数据科学课程更侧重于如何使用一种语言(或一种库)进行数据分析。

In retrospect, I see that Andrew opted for a general-purpose language. One that his audience, consisting mostly of undergraduate computer scientists and engineers, was probably already familiar with. As the focus of the course was on implementing learning algorithms without the use of libraries, MATLAB was as good as any specialised language would be.

回想起来，我看到安德鲁选择了通用语言。他的听众大部分都是由本科生的计算机科学家和工程师组成，他的听众可能已经很熟悉了。由于本课程的重点是在不使用库的情况下实现学习算法，因此MATLAB与任何专业语言一样出色。

Although a fan of automated tools and handy libraries, I can’t emphasize enough the importance of the do-it-yourself attitude towards data science algorithms at the beginning of your path.

尽管喜欢自动工具和方便的库，但在开始之初，我对自己动手对待数据科学算法的态度的重视不够。

I learned very early the difference between knowing the name of something and knowing something. — Richard Feynman

我很早就了解了知道某物的名称和某物的区别。 —理查德·费曼(Richard Feynman)

MATLAB does not lack the libraries to perform a wide selection of data analysis and machine learning tasks. I’m sure it is the preferred framework for people that swear by it, like signal processing and control engineers.

MATLAB不缺少执行各种数据分析和机器学习任务的库。我敢肯定，对于那些喜欢它的人(例如信号处理和控制工程师)，它是首选的框架。

But it is not hard to trace why it did not conquer the field of data analysis, and me. It is a very expensive tool. Its free alternative, Octave, it is far less than being its equal. It could also be that I was never into languages that don’t start counting from zero.

但是，我不难追踪为什么它没有征服数据分析领域。这是一个非常昂贵的工具。它的免费替代品Octave，远不及其同类产品。也可能是我从未接触过不会从零开始计数的语言。

威卡 (Weka)

My experience with Weka was short-lived. We were introduced to it as an optional tool for an assignment for the Pattern Recognition course at my university.

我在Weka的经历是短暂的。我们作为一种可选工具被介绍给我，用于我大学的“模式识别”课程。

Without any intention to underestimate the skills I acquired through this course, the most valuable lesson was this: the effect a GUI has on the data scientist is profound. Weka boasts about its ease of use and comprehensibility, providing the ability to train a machine learning model by loading a dataset and simply pressing a button. It is not hard to see the benefits of this approach. There is a global market desperate for prediction models and not enough experts to satisfy those needs.

无意低估我在本课程中学到的技能，最有价值的一课是：GUI对数据科学家的影响是深远的。 Weka以其易用性和可理解性而自豪，它可以通过加载数据集并只需按一下按钮来训练机器学习模型。不难看出这种方法的好处。全球市场迫切需要预测模型，而专家不足以满足这些需求。

Finding automated tools and using them to derive off-the-self solutions is a current research area, termed as AutoML, but it took us some years, and failures, to realise that we need a human in the loop.

寻找自动化工具并使用它们来获得非常规的解决方案是当前的研究领域，称为AutoML ，但是花了我们几年时间和失败，才意识到我们需要一个人参与其中 。

The illusion that we can produce good models for real problems without first having a good understanding of the data collapsed, with failures such as MarketSwitch and KXEN, in the late 90s. Automated tools can ease our work, discovering good parameterizations of the algorithms, useful pre-processing steps, and efficient testing pipelines. But they can’t substitute for the human expert, at least with our current level of expertise.

错觉是我们可以为实际问题生成良好的模型，而无需首先对崩溃的数据(如MarketSwitch和KXEN等)在90年代后期出现的崩溃有一个很好的了解。自动化工具可以简化我们的工作，发现算法的良好参数化，有用的预处理步骤以及有效的测试管道。但是，至少在我们目前的专业水平上，他们不能替代人类专家。

All in all, you should take responsibility for the models that you create.

总之，您应对所创建的模型负责。

“Man,” I cried, “how ignorant art thou in thy pride of wisdom!” — Mary Wollstonecraft Shelley, Frankenstein

我大声喊道：“男人，你以自己的智慧为荣！” — Mary Wollstonecraft雪莱，科学怪人

[R (R)

I delved into the mysteries and wonders of R during my diploma thesis. You’ve probably heard that R is a special child in the family of data analysis languages. But a steep learning curve is an understatement for the feelings of self-doubt and utter disorientation I experienced at the beginning of the deployment.

在毕业论文中，我深入研究了R的奥秘和奇观。您可能已经听说R是数据分析语言家族中的一个特殊孩子。但是，陡峭的学习曲线对我在部署开始时所经历的自我怀疑和完全迷失方向的感觉低估了。

Our goal was to create a software tool for the automated execution of machine learning experiments. R was more of a purpose than a means, as we wanted to conduct an extensive research on machine learning techniques by using the rich repository of R libraries.

我们的目标是创建一个用于自动执行机器学习实验的软件工具。 R不仅仅是目的，还是手段，因为我们想通过使用R库的丰富存储库对机器学习技术进行广泛的研究。

Having to set up a whole framework, I wanted to make use of the wonders of object-oriented programming in my design. So, the first question I had to address was: does R support object-orientation? It does! In four different ways, actually. None of which directly matches the object-oriented programming I’d experienced in C++, Java, or Python.

必须建立一个完整的框架，我想在设计中利用面向对象编程的奇迹。因此，我必须解决的第一个问题是：R是否支持面向对象？是的！实际上，有四种不同的方式。没有一个与我在C ++，Java或Python中遇到的面向对象编程直接匹配。

The different ways came up gradually while the needs of the R community were still being discovered and methods to easily define and group functionalities were necessary. With no clear plan for the desired class qualities, it is not surprising that you now have the freedom (or should I say burden) to choose between S3, S4, reference, and R6 classes. There are quite a few resources nowadays on this subject but, it suffices to say, if your project needs object-orientation, then R is probably not the language to go for.

在仍然发现R社区的需求的同时，逐渐出现了不同的方式，并且有必要轻松定义和组功能的方法。对于所需的类质量没有明确的计划，您现在拥有在S3，S4，参考和R6类之间进行选择的自由(也可以说是负担)就不足为奇了。如今，在这个主题上有很多资源，但是可以说，如果您的项目需要面向对象，那么R可能不是该使用的语言。

After I settled for reference classes, I then began giving flesh to my software skeleton. I soon realised that R has apparently developed with — what I call — the principle of most astonishment. Specialising in data analysis, R has to offer lots of handy tools, such as the fancy data structures called data.frames, which elegantly capture the characteristics and needs of a dataset.

在我完成了参考类的学习后，我便开始充实我的软件框架。我很快意识到，R显然已发展为(我称之为) 最令人惊讶的原理。 R专门从事数据分析，必须提供许多方便的工具，例如称为data.frames的精美数据结构，可以优雅地捕获数据集的特征和需求。

However, I remember some subtle technicalities in R that gave me nightmares back at the time. Five different assignment operators. All variables are weakly typed, unless they aren’t. RStudio, a free UI for R, throws a runtime error if a plot does not fit in its plane. Did someone mention namespaces?

但是，我记得R中一些细微的技术使我回想起了当时的噩梦。五个不同的赋值运算符。除非不是，否则所有变量均为弱类型。 RStudio是R的免费UI，如果绘图不适合其平面，则会引发运行时错误。有人提到命名空间吗？

People decide to name their SVM package “e1071” instead of something more intuitive, and that’s how you should load it. You want to perform the same operation, for example training, and different packages use different names for it. It’s a drag to have to read the manual of different packages to perform the same action. It also leads to lots of bugs, if you ignore the manuals because you assume consistency.

人们决定将其SVM软件包命名为“ e1071” 而不是更直观的内容，那就是应该如何加载它。您想要执行相同的操作，例如训练，并且不同的程序包使用不同的名称。必须阅读不同软件包的手册才能执行相同的操作，这很麻烦。如果由于假定一致性而忽略手册，也会导致很多错误。

Up to this point, I’ve probably given the impression that I dislike R. But this is not the case. Although I would never attempt to build a framework from scratch in R again, the abundance of packages provided by the open source, heterogeneous R community can help you make visualisations and state-of-the-art pre-processing. This is cool for standalone experiments.

到目前为止，我可能已经给我留下了不喜欢R的印象。但是事实并非如此。尽管我再也不会尝试在R中从头开始构建框架，但是开源的，异构的R社区提供的丰富软件包可以帮助您进行可视化和最先进的预处理。这对于独立实验来说很酷。

When it comes to machine learning, there is a remedy for the lack of compatibility between different packages. It is called caret and is an attempt to provide common interfaces for pre-processing, training, and making predictions that support many useful packages, such as nnet for neural networks and svmRadial for support vector machines. Our automl tool would have been (much more of ) a mess, had we not exploited the usefulness of caret.

当涉及到机器学习时，有一种补救措施是不同软件包之间缺乏兼容性。它被称为插入符号 ，是为了提供预处理，训练和进行预测的通用接口，以支持许多有用的程序包，例如用于神经网络的nnet和用于支持向量机的svmRadial。如果我们不利用插入符号的有用性，那么我们的automl工具本来就是(一团糟)。

C ++ (C++)

Now, why would you do data analysis in C++? Why would anyone do it?

现在，为什么要用C ++进行数据分析？为什么会有人这样做？

Since a summer internship is my only experience in a non-academic workplace, I am not a guru of the psychology of a big company when choosing the tools of its employees. I suspect it was out of a combination of tradition and need for commercial, efficient-in-execution-time code.

由于暑期实习是我在非学术性工作场所的唯一经历，因此在选择员工的工具时，我不是大公司心理专家。我怀疑这是结合了传统和对商业，高效执行时间代码的需求。

Nevertheless, I decided to perform my experiments in R and, when the end of the internship approached, I could transfer my models and functions to C++. What could possibly go wrong?

尽管如此，我还是决定在R中进行实验，当实习期结束时，我可以将模型和函数转移到C ++。可能出什么问题了？

I soon found out that it is not hard to impress people that do data analysis in C++ with fancy diagrams and impressive pre-processing techniques using R packages. Some of my coworkers even got interested in R and started experimenting with it, which kind of made me proud as I am generally bad at persuading people.

我很快发现，用花哨的图表和令人印象深刻的预处理技术(使用R包)来打动用C ++进行数据分析的人并不难。我的一些同事甚至对R感兴趣，并开始对其进行试验，这使我感到自豪，因为我通常不擅长说服人们。

After I acquired satisfactory results, using simple R packages for PCA and support vector machines, I ventured into incorporating my models into the existing (and impressively bulky) C++ framework. The libsvm package seemed to be appropriate for my case, offering operations related to support vector machines.

在使用PCA和支持向量机的简单R包获得令人满意的结果之后，我冒险将模型整合到现有的(且体积庞大)的C ++框架中。 libsvm 软件包似乎适合我的情况，提供了与支持向量机有关的操作。

Now, there’s quite a few options when you want to transfer machine learning models across languages, acting on different levels of the problem. Moving from simple to sophisticated, one can transfer the mathematical model, that is the parameterization of the algorithm, translate the model file across libraries, or use a package to interface across languages.

现在，当您想跨语言转移机器学习模型时，有很多选择，可以解决不同级别的问题。从简单到复杂，可以转移数学模型，即算法的参数化，跨库转换模型文件，或使用包跨语言进行接口。

I found out the hard way that simply using the same parameterization is not enough. Although the family of algorithms remains the same — in my case, SVMs with a gaussian kernel — different implementations may adopt different mathematical models, thus requiring different sets of parameters. Even if models remain the same, implementation-specific factors can affect the performance of the model so drastically that different parameterization is required.

我发现仅使用相同的参数化是不够的。尽管算法家族保持不变(在我的情况下是具有高斯内核的SVM)，但不同的实现可能采用不同的数学模型，因此需要不同的参数集。即使模型保持不变，特定于实现的因素也会极大地影响模型的性能，以至于需要进行不同的参数设置。

The most appropriate way seems to be rcpp, a package that gracefully interfaces between existing C++ frameworks and R scripts. Compatibility among libraries of the two different languages is also supported by some packages, but is hardly ever the case. Sometimes retraining is the easiest and most trustworthy solution.

最合适的方法似乎是rcpp ，它是在现有C ++框架和R脚本之间进行优雅交互的软件包。某些软件包还支持两种不同语言的库之间的兼容性，但事实并非如此。有时，再培训是最简单，最值得信赖的解决方案。

After this experience with the data scientific aspect of C++, I reconsidered my severe judgment on R’s slack attitude.

在具有C ++数据科学方面的经验之后，我重新考虑了对R的懈怠态度的严格判断。

Python (Python)

One of the first discussions with my current supervisor was:

与我现任主管的第一个讨论是：

-So, what language will you use for your future experiments?

-那么，您将来的实验将使用哪种语言？

-I think I’ll go for Python.

-我想我会去买Python。

-So, you are experienced with Python?

-那么，您对Python有经验吗？

-No, I’ve just been through a lot and I have a very good hunch about it.

不，我经历了很多，对此我有很好的预感。

Happy that my horrendously insufficient arguments persuaded him, I now enjoy the benefits of doing data analysis in Python. The ease of setting up experiments, appending functionality, and benefiting from rich libraries have really set my work forward. Although I largely write my own code, I have so far used OpenAI gym to define my own environment for reinforcement learning experiments, and TensorForce, a library that extends TensorFlow with a great selection of reinforcement learning algorithms.

很高兴我的论点不足以说服他，现在我享受了用Python进行数据分析的好处。轻松设置实验，附加功能以及从丰富的库中受益确实使我的工作前进。尽管我主要编写自己的代码，但到目前为止，我已经使用过OpenAI体育馆 定义我自己的强化学习实验环境，以及TensorForce (一个扩展TensorFlow并提供大量强化学习算法选择的库)。

Nevertheless, I won’t argue in favour of an unquestionable superiority of Python, as that would defeat my purpose. Programmers tend to solidify their beliefs into strong statements about languages. Probably forgetting that there can’t be one language to rule them all. If there were, it would have to be so general that it couldn’t be that effective.

但是，我不会主张毫无疑问的Python优势，因为那样会破坏我的目的。程序员倾向于将自己的信念巩固为关于语言的强有力的陈述。可能忘记了没有一种语言可以统治所有语言。 如果有的话，它必须是如此笼统，以至于没有那么有效。

So, next time you are in front of a new dataset, don’t be afraid to add another software arrow into your data science quiver. If all else fails, you will at least have something to complain about.

因此，下次您面对新数据集时，不必担心在数据科学箭囊中添加另一个软件箭头。如果所有其他方法都失败了，那么您至少会有所抱怨。

Life can only be understood backwards; but it must be lived forwards — Søren Kierkegaard

生活只能向后理解。但它必须向前发展—SørenKierkegaard

翻译自: https://www.freecodecamp.org/news/a-personal-journey-through-the-languages-of-data-science-48f516cbb81c/

一个数据包的旅程

一个数据包的旅程_数据科学语言的个人旅程相关推荐

android数据包放在,安卓游戏数据包是什么_安卓游戏数据包放在哪里
现在随着各大游戏厂商的游戏效果越来越华丽,各种游戏也越来越大了.越来越多的游戏也需要用到游戏数据包了,今天小编就和机友们说说关于安卓游戏数据包是什么. 安卓游戏可以分为含数据包的和不含数据包的两个大类 ...
python分析数据包_Python解析pcap数据包
Post Views: 29,789 零.前言历时数月,终于结束了考研初试,Blog也很长时间没有更新了,期间还是有些小伙伴来Blog看文章很是感动.以后一定会坚持更新,尽量给大家推送一些干货.这次 ...
python构造数据包库_scapy构造数据包
一.进入scapy交互界面在终端下输入:scapy ,进入交互界面: 二.查看scapy已经实现的网络协议 ls() 列出scapy中已实现的网络协议 ls(协议类型) 查看某个协议头部字段格式 l ...
MAC系统利用charles抓取微信小程序和手机APP数据包(http和https数据包)
本文中使用的是mac上的抓包工具charles进行抓包,手机是华为荣耀8 下载并安装Charles for Mac Charles for Mac(HTTP信息抓包工具) V4.2.5 苹果电脑版要 ...
蓝牙广播数据包_Android 蓝牙广播数据包分析
Android 4.0以后开始引入ble(低功耗蓝牙),但是5.0之前的版本只支持手机做为中央设备(Central ),手机不能设置外设模式, 谷歌从5.0后才加入,而手机作为外设时需要发送广播,中央 ...
数据包覆盖Android,安卓数据包怎么安装安卓游戏数据包安装教程
安卓数据包怎么安装?安卓数据包放在哪?这是很多安卓单机游戏爱好者经常问的问题,下面小编就为各位玩家带来:安卓游戏数据包安装教程,诸如数据包存放好后,为什么还是不能正常玩都能为您解答~ 安卓数据包怎么安 ...
jpcap抓包，TCP数据包逻辑分析，TCP数据包解析
1.jpcap抓包 https://github.com/kumqu/java-jpcap 2.TCP逻辑分析 3.TCP数据包解析 3.1 数据包解析源码 import com.alibaba.fa ...
战地指挥官服务器维护,战地指挥官突然显示数据包损坏怎么办解决数据包损坏方案...
战地指挥官突然显示数据包损坏怎么办解决数据包损坏方案.战地指挥官是一款非常好玩的游戏,许多玩家被其优秀的内容和有趣的玩法所吸引.但是近日有小伙伴反应战地指挥官突然显示数据包损坏,小编这里整理了一些战 ...
数据帧、数据包、数据报以及数据段
参考链接数据在OSI七层模型中的名字数据帧.数据包.数据报以及数据段数据帧(Frame):是一种信息单位,它的起始点和目的点都是数据链路层数据包(Packet):是一种信息单位,它的起始和目 ...

一个数据包的旅程_数据科学语言的个人旅程