python数据科学库_Python数据科学库

python数据科学库

什么是数据科学？ (What is Data Science?)

We live in an information age, where the challenge is to extract meaningful information from large volumes of data.
我们生活在信息时代，挑战在于从大量数据中提取有意义的信息。
Data Science is the process of extracting knowledge and useful insights from data.
数据科学是从数据中提取知识和有用见解的过程。
Data Science uses scientific methods, algorithms, processes to extract this insight.
数据科学使用科学的方法，算法和过程来提取这种见解。
Fields such as Analytics, Data Mining, and Data Science are devoted to the study of data.
诸如分析，数据挖掘和数据科学等领域专门用于数据研究。

In this article, we will understand the overview of Data Science. We will also go through the commonly used Python libraries that form an ideal part in a Data Scientist’s toolbox.

在本文中，我们将了解数据科学的概述。我们还将介绍构成数据科学家工具箱中理想部分的常用Python库。

为什么选择Python进行数据科学？ (Why Python for Data Science?)

Python is undoubtedly a versatile and flexible language preferred by Data Scientists. The reasons are as follows:

无疑，Python是数据科学家首选的一种通用且灵活的语言。原因如下：

Python is simple, yet can handle complex mathematical processing and algorithms.
Python很简单，但是可以处理复杂的数学处理和算法。
Optimises development time due to its simple syntax.
由于其简单的语法，因此优化了开发时间。
Has ready to use, in-built libraries that serve as Data Science tools.
已准备就绪，可以用作数据科学工具的内置库。
It is cross-platform and has huge community support
它是跨平台的，具有巨大的社区支持
Code written using other languages like C or Java can be directly used with the help of Python packages.
使用其他语言（例如C或Java）编写的代码可以在Python软件包的帮助下直接使用。
Has excellent memory management capabilities. This makes code to execute faster when compared to other Data Science languages like MATLAB, R.
具有出色的内存管理功能。与其他数据科学语言（例如MATLAB，R）相比，这使代码的执行速度更快。

Python数据科学库 (Python Data Science Libraries)

Python provides a huge number of libraries for scientific analysis, computing, and visualization. This is where the tremendous potential of Python is unleashed.

Python提供了大量用于科学分析，计算和可视化的库。这就是释放Python巨大潜力的地方。

We will go through some of the popularly used Python libraries in the field of Data Science. The libraries are categorized according to their functionality.

我们将介绍数据科学领域中一些流行的Python库。这些库根据其功能进行分类。

核心库 (Core libraries)

The core libraries can be imported by users to make use of its functionality. These are a part of the Python package.

用户可以导入核心库以利用其功能。这些是Python包的一部分。

1.脾气暴躁 (1. Numpy)

NumPy is a core Python package for performing mathematical and logical operations. It supports linear algebra operations and random number generation. NumPy stands for “Numerical Python”.

NumPy是用于执行数学和逻辑运算的Python核心软件包。它支持线性代数运算和随机数生成。 NumPy代表“数字Python”。

NumPy has built-in functions to perform linear algebra operations.
NumPy具有内置函数来执行线性代数运算。
To perform logical and mathematical operations on arrays.
对数组执行逻辑和数学运算。
NumPy supports multi-dimensional arrays to perform complex mathematical operations.
NumPy支持多维数组以执行复杂的数学运算。
Shape manipulatio using Fourier transforms.
使用傅立叶变换进行形状操纵。
Inter-operability with programming languages like C, FORTRAN etc.
与C，FORTRAN等编程语言的互操作性

2.科学 (2. SciPy)

SciPy is a Python library that is built upon NumPy. It makes use of NumPy arrays. SciPy is significantly used for performing advanced operations like regression, integration, and probability. It contains efficient modules for statistics, linear algebra, numerical routines, and optimization.

SciPy是基于NumPy构建的Python库。它利用了NumPy数组。 SciPy大量用于执行高级操作，如回归，积分和概率。它包含用于统计，线性代数，数值例程和优化的有效模块。

Python SciPy library supports integration, gradient optimization, ordinary differential equation solvers, parallel programming tools and many more.
Python SciPy库支持集成，梯度优化，常微分方程求解器，并行编程工具等。
An interactive session with SciPy is a data-processing and system-prototyping environment similar to MATLAB, Octave, Scilab or R-lab.
与SciPy的交互式会话是类似于MATLAB，Octave，Scilab或R-lab的数据处理和系统原型制作环境。
SciPy provides high-level commands and classes for Data Science. This increases the power of an interactive Python session by significant order.
SciPy为数据科学提供高级命令和类。这大大提高了交互式Python会话的功能。
Besides mathematical algorithms, SciPy includes everything from classes to parallel programming. This makes it easier for programmers to develop sophisticated and specialized applications.
除了数学算法，SciPy还包括从类到并行编程的所有内容。这使程序员更容易开发复杂的专业应用程序。
SciPy is an open source project. Hence, it has good community support.
SciPy是一个开源项目。因此，它具有良好的社区支持。

3.熊猫 (3. Pandas)

Pandas stands for Python Data Analysis Library. It is a Python library used for high-performance Data Science and analysis.

Pandas代表Python数据分析库。它是用于高性能数据科学和分析的Python库。

Pandas provides a variety of built-in datatypes like Data Frame, Series, Panels. These Data Structures enables to accomplish the high-speed analysis of data.
熊猫提供了各种内置数据类型，例如数据框，系列，面板。这些数据结构使您可以完成数据的高速分析。
Provides tools to load data into in-memory data objects from various file formats.
provides integrated handling of missing data.
提供将数据从各种文件格式加载到内存数据对象中的工具。
提供丢失数据的集成处理。
Reshaping large data sets due to label-based slicing and indexing.
由于基于标签的切片和索引，因此将重塑大型数据集。
The tabular format of Data Frames allow database-like columns addition and deletion on the data.
数据框的表格格式允许在数据上添加和删除类似数据库的列。
Group data based on aggregation.
根据汇总对数据进行分组。
Functionalities for different data such as tabular, ordered and unordered time series
不同数据的功能，例如表格，有序和无序时间序列
Merging data to provide high performance.
合并数据以提供高性能。
The panel data structure gives better visualisation of data due to it’s 3D data structure.
面板数据结构具有3D数据结构，因此可以更好地显示数据。

绘图库 (Plotting Libraries)

The key to Data Science is to present the outcome of complex operations on data in an understandable format.

数据科学的关键是以一种易于理解的格式呈现对数据进行复杂操作的结果。

Visualization plays an important role when we try to explore and understand data.

当我们尝试探索和理解数据时，可视化起着重要作用。

Python supports numerous libraries that can be used for data visualization and plotting. Let’s analyze some of the commonly used libraries in this field.

Python支持许多可用于数据可视化和绘图的库。让我们分析一下该领域中一些常用的库。

1. Matplotlib (1. Matplotlib)

Matplotlib is a Python library for data visualisation.
Matplotlib是用于数据可视化的Python库。
It creates 2D plots and graphs using Python scripts.
它使用Python脚本创建2D绘图和图形。
Matplotlib has features to control line styles, axes, etc.
Matplotlib具有控制线型，轴等的功能。
It also supports a wide range of graphs and plots like histogram, bar charts, error charts, histograms, contour plots, etc.
它还支持各种图形和图，例如直方图，条形图，误差图，直方图，轮廓图等。
In addition, Matplotlib provides an effective environment alternative for MatLab, when used along with NumPy.
此外，与NumPy一起使用时，Matplotlib还为MatLab提供了有效的环境替代方案。

2. Seaborn (2. Seaborn)

Used along with Matplotlib, Seaborn is a statistical plotting library in Python.
Seaborn与Matplotlib一起使用，是Python中的统计绘图库。
It provides a high-level interface to draw statistical graphics.
它提供了一个高级界面来绘制统计图形。
The library is built on top of Matplotlib and it also supports Numpy and Pandas data structures. It supports statistical units from SciPy, too.
该库基于Matplotlib构建，并且还支持Numpy和Pandas数据结构。它还支持SciPy的统计单位。
As it is built on top of Matplotlib, we will often invoke matplotlib functions directly for simple plots.
由于它是基于Matplotlib构建的，因此我们经常会直接为简单的图调用matplotlib函数。
The high-level interface of seaborn and variety of back-ends for matplotlib combined together makes it easy to generate publication-quality figures.
matplotlib的seaborn和各种后端的高级界面组合在一起，可以轻松生成具有出版物质量的图形。

3.密谋 (3. Plotly)

Plotly is a Python library which is used for 3D plotting.
Plotly是用于3D绘图的Python库。
It can be integrated with web applications.
它可以与Web应用程序集成。
Its easy to use API can be imported and is compatible with other languages.
它易于使用的API可以导入，并且与其他语言兼容。
Plotly can be used to represent real-time data. Users can configure the graphics of both clients, as well as server side and interchange data between them.
可以使用Plotly表示实时数据。用户可以配置两个客户端以及服务器端的图形，并在它们之间交换数据。
Plotly inter-operates with Matplotlib data format.
与Matplotlib数据格式进行互操作。

剧情特色 (Plotly Features)

Plotly is interactive by default.
默认情况下，Plotly是交互式的。
Charts are not saved as images. They are serialized as JSON. So it can be read easily with R, MATLAB, Julia, etc.
图表不保存为图像。它们被序列化为JSON。因此可以使用R，MATLAB，Julia等轻松读取它。
Exports vector for print/publication.
导出矢量以进行打印/发布。
Easy to manipulate/embed on web.
易于操作/嵌入网络。

自然语言处理（NLP）库 (Natural Language Processing (NLP) Libraries)

There is a huge boom in the field of speech recognition using Natural Language Processing. Python supports NLP through a huge number of packages. Some of the commonly used libraries are as follows:

使用自然语言处理的语音识别领域正在蓬勃发展。 Python通过大量的软件包支持NLP。一些常用的库如下：

1. NLTK (1. NLTK)

NLTK stands for Natural Language Toolkit. As the name implies, this python package is used for common tasks of Natural Language Processing(NLP).

NLTK代表自然语言工具包。顾名思义，此python包用于自然语言处理（NLP）的常见任务。

NLTK的特点 (Features of NLTK)

Text tagging, classification and tokenizing.
文本标记，分类和标记。
Facilitate research of NLP and it’s related fields like Cognitive Science, Artificial Intelligence, semantic analysis, and Machine Learning.
促进NLP及其相关领域的研究，例如认知科学，人工智能，语义分析和机器学习。
Semantic reasoning
语义推理

2.空间 (2. SpaCy)

Spacy is an open-source library, focused on commercial use.
Spacy是一个开放源代码库，专注于商业用途。
SpaCy comprises neural network models for popular languages like English, German, Dutch, Sanskrit and many more.
SpaCy包含适用于流行语言（例如英语，德语，荷兰语，梵语等）的神经网络模型。
The popularity of SpaCy is due to its ability to process documents rather than data.
SpaCy之所以受欢迎，是因为它能够处理文档而不是数据。
SpaCy also provides useful APIs for machine learning and deep learning.
SpaCy还提供了有用的API，用于机器学习和深度学习。
Quora uses SpaCy as a part of its platform.
Quora使用SpaCy作为其平台的一部分。

3. Gensim (3. Gensim)

Gensim is a platform independent Python package that uses NumPy and SciPy packages.
Gensim是使用NumPy和SciPy软件包的独立于平台的Python软件包。
GenSim stands for GENerate SIMilar and can efficiently keep a huge amount of data in memory. Hence, it is widely used in healthcare and financial domains.
GenSim代表GENerate SIMilar，可以有效地将大量数据保留在内存中。因此，它被广泛用于医疗保健和金融领域。
Gensim features data streaming, handling large text collections and efficient incremental algorithms.
Gensim具有数据流，处理大型文本集合和高效的增量算法的功能。
Gensim is designed to extract semantic topics from documents. This extract is done automatically in an efficient and effortless manner.
Gensim旨在从文档中提取语义主题。此摘录以高效，轻松的方式自动完成。
This actually differentiates it from other libraries, as most of them target only in-memory and batch processing.
这实际上使它与其他库区分开来，因为它们中的大多数仅针对内存和批处理。
Gensim examines word statistical co-occurrence patterns within a corpus of training documents. This is done to discover the semantic structure of documents.
Gensim会检查训练文档库中的单词统计共现模式。这样做是为了发现文档的语义结构。

报废图书馆 (Scraping Libraries)

As the web is growing tremendously with each day, web scraping has gained popularity. Web scraping solves issues related to crawling and indexing of the data. Python supports many libraries for web scraping.

由于网络每天都在蓬勃发展，因此网络抓取已变得越来越流行。 Web抓取解决了与爬网和索引数据有关的问题。 Python支持许多用于Web抓取的库。

1.崎cra (1. Scrapy)

Scrapy is an open-source framework used to parse web pages and store data in an understandable format. Scrapy can process request asynchronously. This means it allows requests to be processed in parallel, without having to wait for a request to be finished.

Scrapy是一个开放源代码框架，用于解析网页并以易于理解的格式存储数据。 Scrapy可以异步处理请求。这意味着它允许并行处理请求，而不必等待请求完成。

It processes other requests, even though some requests fail or an error happens while processing it. Scrapy allows us to do very fast crawls.

即使某些请求失败或在处理过程中发生错误，它也会处理其他请求。 Scrapy允许我们进行非常快速的爬网。

2.美丽的汤4 (2. Beautiful Soup 4)

In short, called as BS4, Beautiful Soup is an easy to use Parser that is a part of Python’s standard library.

简而言之，叫做BS4的Beautiful Soup是一个易于使用的解析器，它是Python标准库的一部分。

BS4 is a parsing library which can be used to extract data from HTML and XML documents.

BS4是一个解析库，可用于从HTML和XML文档中提取数据。

BS4 builds a parse tree to help us navigate a parsed document and easily find what we need.

BS4构建了一个分析树，以帮助我们浏览已分析的文档并轻松找到我们所需要的。

BS4 can automatically detect encoding and handle HTML docs with special characters.

BS4可以自动检测编码并处理带有特殊字符HTML文档。

3. Urllib (3. Urllib)

We can use Python urllib to get website content in a Python program.

我们可以使用Python urllib在Python程序中获取网站内容。

We can also use this library to call the REST web services. We can make GET and POST http requests.

我们还可以使用此库来调用REST Web服务。我们可以发出GET和POST http请求。

This module allows us to make HTTP as well as HTTPS requests. We can send request headers and also get information about response headers.

这个模块允许我们发出HTTP以及HTTPS请求。我们可以发送请求标头，还可以获取有关响应标头的信息。

结论 (Conclusion)

In this article, we have categorized the commonly used Python libraries for Data Science. Hope this tutorial would help Data Scientists to deep dive into this vast field and make the most out of these Python libraries.

在本文中，我们对数据科学中常用的Python库进行了分类。希望本教程可以帮助数据科学家深入研究这个广阔的领域，并充分利用这些Python库。

翻译自: https://www.journaldev.com/31010/python-data-science-libraries

python数据科学库