大数据介绍项目流程

About Big Data

关于大数据

什么是大数据？(What is Big Data?)

In modern world, there are many big problems. One of those problems is Big Data. At present world, data collection is very important.It is they key to the success of a company. But as users are increasing day by day, data is becoming larger and larger.

在现代世界中，存在许多大问题。这些问题之一是大数据。在当今世界，数据收集非常重要，这是公司成功的关键。但是随着用户的日益增加，数据变得越来越大。

Some of the companies which acquire enormous data on daily basis are google,facebook,twitter, instagram etc. People all around the world post images and other stuff everyday. For example, facebook generates 4 PetaBytes per day. See below stats

每天都会获取大量数据的一些公司是google，facebook，twitter，instagram等。世界各地的人们每天都会发布图片和其他内容。例如，facebook每天生成4 PetaBytes。见以下统计

统计：每分钟评分(Stats: Per Minute Ratings)

Here are some of the per minute ratings for various social networks:

以下是各种社交网络每分钟的收视率：

Snapchat: Over 527,760 photos shared by users

Snapchat：用户共享了超过527,760张照片
LinkedIn: Over 120 professionals join the network

领英：超过120名专业人员加入了网络
YouTube: 4,146,600 videos watched

YouTube：观看了4,146,600个视频
Twitter: 456,000 tweets sent or created

Twitter：发送或创建了456,000条推文
Instagram: 46,740 photos uploaded

Instagram：上传了46,740张照片
Netflix: 69,444 hours of video watched

Netflix：观看了69,444小时的视频
Giphy: 694,444 GIFs served

Giphy：送出694,444张GIF
Tumblr: 74,220 posts published

Tumblr：已发布74,220个帖子
Skype: 154,200 calls made by users.

Skype：用户拨打了154,200个电话。

So, how does these companies manage the data. The answer is by using a combination of massively paralleled systems.

因此，这些公司如何管理数据。答案是通过使用大规模并行系统的组合。

The concept used in solving Big Data is Distributed System. To understand distributed systems, we need to understand another concept called IOPS.

解决大数据所使用的概念是分布式系统。 要了解分布式系统，我们需要了解另一个称为IOPS的概念。

什么是IOPS？ (What is IOPS?)

IOPS means Input/Output operations per second. It is the unit for measuring performance characteristics of storage devices. IOPS represents how quickly a given storage device or medium can read and write commands in every second.When writing data into the disk, we dont write it in bytes.Rather we write it in form of blocks. Blocks have different sizes and it depends on the system. SQL Server uses 64kb blocks whereas Windows server uses 4kb blocks. To get better understanding of IOPS, lets take SSD and HDD. We know that SSD’s are faster than HDD’s. The iops for ssd is in range 3000 to 40,000 whereas iops for hdd is in range of 55 to 80.

IOPS表示每秒的输入/输出操作。它是测量存储设备性能特征的单位。 IOPS表示给定的存储设备或介质每秒可以读取和写入命令的速度，当将数据写入磁盘时，我们不以字节为单位写入数据，而是以块形式写入数据。块的大小不同，具体取决于系统。 SQL Server使用64kb块，而Windows Server使用4kb块。为了更好地了解IOPS，请使用SSD和HDD。我们知道SSD的速度比HDD的速度更快。固态硬盘的IOPS范围为3000至40,000，而硬盘硬盘的IOPS范围为55至80。

什么是分布式系统？ (What is a Distributed System?)

A distributed system, also known as distributed computing, is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user.

分布式系统，也称为分布式计算，是一种具有位于不同机器上的多个组件的系统，这些组件可以通信和协调动作，以显示为最终用户的单个连贯系统。

Let us consider a storage appliances. Lets suppose we have 40 TB of data. To write 40 TB of data into a disk, it takes 40 min. If we split the data into 10 TB blocks and start writing the data in 4 disks, it takes total of 10 mins.Suppose if we split 40 TB in 5 TB blocks and start the process of storing it in 8 disks, it takes 5 mins to copy all data to disks. From this concept, we can say that using more number of disks and storing small data is more efficient than using large disk with large amount of data. This concept is called parallelisation and is used by Distributed System. Not only storage, even compute power and many other services use this concept.

让我们考虑一个存储设备。假设我们有40 TB的数据。要将40 TB的数据写入磁盘，需要40分钟。如果将数据拆分为10 TB的块并开始将数据写入4个磁盘中，则总共需要10分钟;假设我们将40 TB的数据拆分为5 TB的块并开始将其存储在8个磁盘中的过程需要5分钟将所有数据复制到磁盘。从这个概念，我们可以说使用更多数量的磁盘并存储少量数据比使用大量数据的大型磁盘更有效。这个概念称为并行化，由分布式系统使用。不仅存储，甚至计算能力和许多其他服务都使用此概念。

分布式系统的主要优点 (Main benefits of Distributed System)

Horizontal Scalability — Since computing happens independently on each machine, it is easy and generally inexpensive to add additional devices and functionality as necessary.

水平可扩展性—由于计算是在每台计算机上独立进行的，因此根据需要添加其他设备和功能很容易，而且通常也不便宜。
Reliability — Most distributed systems are fault-tolerant as they can be made up of hundreds of machines that work together. The system generally doesn’t experience any disruptions if a single machine fails.

可靠性-大多数分布式系统都是容错的，因为它们可以由数百台可协同工作的机器组成。如果单台计算机出现故障，系统通常不会受到任何干扰。
Performance — Distributed systems are extremely efficient because work loads can be broken up and sent to multiple machines.

性能—分布式系统非常高效，因为可以分解工作负载并将其发送到多台机器。

There are so many Big Data technologies like Apache Hadoop,Microsoft HDInsight, NoSQL, Hive, Sqoop etc. Out of all, most widely used is Hadoop.

大数据技术如此之多，例如Apache Hadoop，Microsoft HDInsight，NoSQL，Hive，Sqoop等。其中，最广泛使用的是Hadoop。

The Three Vs of Big Data

大数据的三大诉求

Volume: Means amount of data. Big data is about volume. Volumes of data that can reach unprecedented heights in fact. It’s estimated that 2.5 quintillion bytes of data is created each day

数量：表示数据量。大数据与数量有关。实际上，数据量可以达到前所未有的高度。估计每天创建2.5亿个字节的数据

Velocity: Velocity is the fast rate at which data is received and (perhaps) acted on. For example, Facebook users upload more than 900 million photos a day. Facebook has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.

速度：速度是接收和(或可能)作用于数据的快速速率。例如，Facebook用户每天上传超过9亿张照片。 Facebook必须每天处理大量的海啸照片。它必须吸收所有内容，对其进行处理，将其归档，并在以后以某种方式能够对其进行检索。

Variety: It refers to the many types of data that are available.For example, you may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other.

种类：它指的是可用的多种数据类型，例如，您可能已经注意到我在谈论照片，传感器数据，推文，加密的数据包等。这些都彼此非常不同。

什么是Hadoop？ (What is Hadoop?)

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop是一个开放源代码软件框架，用于在商用硬件集群上存储数据和运行应用程序。它可为任何类型的数据提供海量存储，强大的处理能力以及处理几乎无限的并发任务或作业的能力。

Hadoop uses Master-Slave architecture. It comprises of single NameNode (Master Node) and other nodes are DataNodes (Slave Nodes). All the data are stored in Data Nodes, not in master node. We use MasterNode to manage DataNodes.

Hadoop使用主从架构。它由单个NameNode(主节点)组成，其他节点为DataNode(从节点)。所有数据都存储在数据节点中，而不是主节点中。我们使用MasterNode来管理DataNode。

Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data. Hadoop provides the building blocks on which other services and applications can be built.

Hadoop使使用群集服务器中的所有存储和处理能力以及对大量数据执行分布式进程变得更加容易。 Hadoop提供了构建其他服务和应用程序的基础。

This is just one software. There are so many other softwares to handle big data. Every software(technology) has its own benefits.

这只是一个软件。还有许多其他软件可以处理大数据。每种软件(技术)都有自己的优势。

大数据的重要性 (Importance of Big Data)

There is a famous quote. That is “Data is the new Oil”. It means data is very important in present times.

有一个著名的报价。那就是“数据是新的石油”。这意味着数据在当前非常重要。

Data improves quality of life.数据可以改善生活质量。
Data allows organizations to more effectively determine the cause of problems. Data allows organizations to visualize relationships between what is happening in different locations, departments, and systems.数据使组织可以更有效地确定问题的原因。数据使组织可以可视化不同位置，部门和系统中发生的事情之间的关系。
Data Analytics provide us solutions for most of the problems we face today.数据分析为我们今天面临的大多数问题提供了解决方案。
Data helps you understand performance.It helps you to understand your customers数据可以帮助您了解绩效，也可以帮助您了解客户

翻译自: https://medium.com/@kvs.vishnu23/introduction-to-big-data-8f28a4daa73f

大数据介绍项目流程

查看全文

http://www.taodudu.cc/news/show-2840747.html

大数据1
【大数据】Hadoop （二） HDFS
大数据入门级介绍(一)
大数据Hadoop3.1.3 HDFS 详细入门笔记
java 大数据以及Hadoop相关概念介绍
2018 大数据面试
大数据以及Hadoop相关概念介绍
固态硬盘损坏数据如何挽救_大数据挽救生命
什么是DBMS以及DBMS的分类
什么是DBMS，DB，DBMS？
简述DB ,DBMS与DBS
DB、DBMS、SQL分别是什么，有什么关系？
DB DBMS SQL 分别是什么?
SQL、DB、DBMS分别是什么，他们之间的关系？
DBMS分类
PowerDesigner 修改 DBMS
DBMS_SQL的使用
oracle dbms_metadata,DBMS_METADATA报错解决方法
oracle dbms_crypto,dbms_crypto函数包的简单介绍
oracle sys.dbms job,DBMS_JOB,dbms_ijob用法
mysql属于dbms还是dbs_DB、DBMS和DBS三者之间有什么关系_数据库
DBMS概要
mysql属于dbms还是dbs_DB、DBMS和DBS三者之间有什么关系
oracle 基础dbms错误,更改对 DBMS 错误的响应
数据库学习之(5)详解DBMS
DBMS 教程
数据库之区分DB\DBMS\DBS
oracle dbms是什么意思,oracle的dbms_stats包详细解说
mysql dbms是什么_DBMS体系结构的三种类型分别是什么
什么是DBMS，什么是数据库？

大数据介绍项目流程_大数据介绍相关推荐

数据的四大特征_大数据
数据的四大特征_大数据我们总是在谈数据分析,数据分析什么的,那我们现在先不谈数据分析,我们先来谈谈数据分析的基础--数据.那么到底什么是数据,数据有什么特征呢?这个问题虽基础却重要. 这里我们所说的 ...
数据科学学习心得_学习数据科学时如何保持动力
数据科学学习心得 When trying to learn anything all by yourself, it is easy to lose motivation and get thrown ...
java大作业设计_Java程序设计_大作业.doc
Java程序设计_大作业.doc Java程序设计_大作业专业:计算机科学与技术专业学号:1245713131 姓名: 2014年12月10日目录作业内容:2 1.IPublisherDao接 ...
遵义大数据中心项目工程概况_市委书记张新文到曹州云都大数据中心等项目现场调研建设情况...
4月25日,市委书记张新文到曹县调研重点项目建设情况,研究推进措施.市委常委.秘书长任仲义参加活动. 张新文首先来到曹州云都大数据中心项目建设现场,查看项目推进情况.曹州云都大数据中心,是涵盖云计算区 ...
关于大数据技术的演讲_大数据核心技术介绍：大数据处理技术
大数据之所以能够从概念走向落地,说到底还是因为大数据处理技术的成熟,面对海量的数据,在有限的硬件条件下,以低成本满足大数据处理的各种实际需求.那么具体处理大数据需要哪些技术,今天我们来简单介绍一下大数 ...
丽水数据可视化设计价格_大数据可视化项目报价模板
项目系统需求需求内容单价数量单位合计 1 系统方案设计 1. 整合分析项目需求和特性,制作需求文档: 进行软件产品界面(信息架构.操作逻辑.功能.用户体验等) 的交互策划,并输出产 ...
遵义大数据中心项目工程概况_投资2.27亿元！贵州省又添一大数据中心项目
原标题:投资2.27亿元!贵州省又添一大数据中心项目谈到数据中心,除北上广深等一线经济发达城市外,贵州省想必是一颗正在冉冉升起的"明星".近年来在国家政策的支持下贵州省借着得天独 ...
遵义大数据中心项目工程概况_遵义市大数据中心项目建设加快推进
4月10日,笔者从市大数据发展局获悉,近日,新蒲新区管委会与中国电信遵义分公司签署遵义市大数据中心项目合作协议,将共同加快推进该项目建设,推动信息化与实体经济深度融合发展,助力数字遵义新型智慧城市建设 ...
大数据端到端_成为数据科学家的端到端指南
大数据端到端数据科学提示/入门指南 (DATA SCIENCE TIPS /BEGINNERS GUIDE) Data Science has improved considerably over ...

大数据介绍项目流程_大数据介绍