aws spark

技术提示 (TECHNICAL TIPS)

介绍 (Introduction)

At first, it seemed to be quite easy to write down and run a Spark application. If you are experienced with data frame manipulation using pandas, numpy and other packages in Python, and/or the SQL language, creating an ETL pipeline for our data using Spark is quite similar, even much easier than I thought. And comparing to other database (such as Postgres, Cassandra, AWS DWH on Redshift), creating a Data Lake database using Spark appears to be a carefree project.

最初，写下并运行一个Spark应用程序似乎很容易。如果您熟悉使用Python和/或SQL语言中的pandas，numpy和其他软件包进行数据帧操作的经验，那么使用Spark为我们的数据创建ETL管道非常相似，甚至比我想象的要容易得多。与其他数据库(例如Postgres，Cassandra，Redshift上的AWS DWH)相比，使用Spark创建Data Lake数据库似乎是一个轻松的项目。

But then, when you deployed Spark application on the cloud service AWS with your full dataset, the application started to slow down and fail. Your application ran forever, you even didn’t know if it was running or not when observing the AWS EMR console. You might not know where it was failed: It was difficult to debug. The Spark application behaved differently between the local mode and stand alone mode, between the test set — a small portion of dataset — and full dataset. The list of problems went on and on. You felt frustrated. Really, you realized that you knew nothing about Spark. Well, optimistically, then it was indeed a very good opportunity to learn more about Spark. Running into issues is the normal thing in programming anyway. But, how to solve problems quickly? Where to start?

但是，当您将具有完整数据集的Spark应用程序部署到云服务AWS上时，该应用程序开始运行缓慢并失败。您的应用程序永远运行，在观察AWS EMR控制台时，您甚至都不知道它是否正在运行。您可能不知道它在哪里失败：这很难调试。在局部模式和独立模式之间，测试集(数据集的一小部分)和完整数据集之间，Spark应用程序的行为有所不同。问题的清单还在不断。你感到沮丧。确实，您意识到自己对Spark一无所知。好吧，乐观的话，那确实是一个很好的机会，更多地了解Spark。无论如何，遇到问题是编程中的正常现象。但是，如何快速解决问题？从哪儿开始？

After struggling with creating a Data Lake database using Spark, I feel the urge to share what I have encountered and how I solved these issues. I hope it is helpful for some of you. And please, correct me if I am wrong. I am still a newbie in Spark anyway. Now, let’s dive in!

在努力使用Spark创建Data Lake数据库之后，我感到有分享自己遇到的问题以及如何解决这些问题的渴望。希望对您中的某些人有所帮助。如果我错了，请纠正我。无论如何，我还是Spark的新手。现在，让我们开始吧！

Cautions

注意事项

1. This article assumes that you already have some working knowledge of Spark, especially PySpark, command line environment, Jupyter notebook and AWS. For more about Spark, please read the reference here.

1.本文假设您已经具备一些Spark的工作知识，尤其是PySpark，命令行环境，Jupyter Notebook和AWS。有关Spark的更多信息，请在此处阅读参考。

2. This is your responsibility for monitoring usage charges on the AWS account you use. Remember to terminate the cluster and other related resources each time you finish working. The EMR cluster is costly.

2.这是您负责监视所使用的AWS账户的使用费用的责任。请记住，每次完成工作时都要终止集群和其他相关资源。 EMR集群的成本很高。

3. This is one of the accessing projects for the Data Engineering nanodegree on Udacity. So to respect the Udacity Honor Code, I would not include the full notebook with the workflow to explore and build the ETL pipeline for the project. Part of the Jupyter notebook version of this tutorial, together with other tutorials on Spark and many more data science tutorials could be found on my github.

3.这是Udacity上的数据工程纳米学位的访问项目之一。因此，为了遵守Udacity荣誉守则，我不会在工作流程中包括完整的笔记本来探索和构建该项目的ETL管道。本教程的Jupyter笔记本版本的一部分，以及Spark上的其他教程以及更多数据科学教程，都可以在我的github上找到。

项目介绍 (Project Introduction)

项目目标 (Project Goal)

Sparkify is a startup company working on a music streaming app. Through the app, Sparkify has collected information about user activity and songs, which is stored as a directory of JSON logs (log-data - user activity) and a directory of JSON metadata files (song_data - song information). These data resides in a public S3 bucket on AWS.

Sparkify是一家致力于音乐流应用程序的新兴公司。通过该应用程序，Sparkify收集了有关用户活动和歌曲的信息，这些信息存储为JSON日志的目录( log-data -用户活动)和JSON元数据文件的目录( song_data歌曲信息)。这些数据位于AWS上的公共S3存储桶中。

In order to improve the business growth, Sparkify wants to move their processes and data onto the data lake on the cloud.

为了提高业务增长，Sparkify希望将其流程和数据移至云上的数据湖中。

This project would be a workflow to explore and build an ETL (Extract — Transform — Load) pipeline that:

该项目将是一个工作流程，用于探索和构建ETL(提取-转换-加载)管道 ，该管道包括：

Extracts data from S3从S3提取数据
Processes data into analytics tables using Spark on an AWS cluster在AWS集群上使用Spark将数据处理到分析表中
Loads the data back into S3 as a set of dimensional and fact tables for the Sparkify analytics team to continue finding insights in what songs their users are listening to.将数据作为一组维度表和事实表加载到S3中，以供Sparkify分析团队继续查找其用户正在收听的歌曲的见解。

Below are the sample from JSON log file and JSON song file:

以下是JSON日志文件和JSON歌曲文件的示例：

The dimension and fact tables for this database were designed as followed: Fields in bold: partition keys.

此数据库的维和事实表的设计如下： 粗体字的字段：分区键。

(ERD diagram was made using https://dbdiagram.io/)

(ERD图是使用https://dbdiagram.io/制作的 )

Project Workflow

项目工作流程

This is my workflow for the project. An experienced data engineer might skip many of these steps, but for me, I would rather go slowly and learn more:

这是我的项目工作流程。经验丰富的数据工程师可能会跳过许多步骤，但是对我来说，我宁愿慢慢学习并了解更多信息：

Build the ETL process step-by-step using Jupyter notebook on sample data in local directory; write output to local directory.使用Jupyter Notebook在本地目录中的示例数据上逐步构建ETL流程；将输出写入本地目录。
Validate the ETL Process using the sub-dataset on AWS S3; write output to AWS S3.使用AWS S3上的子数据集验证ETL流程; 将输出写入AWS S3。
Put all the codes together to build the script etl.py and run on Spark local mode, testing both the local data and a subset of data on s3//udacity-den. The output result from the task could be test using a jupyter notebook test_data_lake.ipynb.

将所有代码放在一起以构建脚本etl.py并在Spark本地模式下运行，在s3//udacity-den上测试本地数据和数据子集。可以使用Jupyter笔记本test_data_lake.ipynb测试该任务的输出结果。
Build and launch an EMR cluster. As what I know, you could submit the project on Udacity without using EMR, but I highly recommend you to run it on the Spark stand alone mode on AWS to see how it works. You definitely will learn a lot more.构建并启动EMR集群。据我所知，您可以在不使用EMR的情况下在Udacity上提交项目，但是我强烈建议您在AWS的Spark独立模式下运行它，以查看其工作方式。您肯定会学到更多。
Submit a Spark job for etl.py on EMR cluster, using a subset of data on s3//udacity-den.

使用s3//udacity-den上的数据子集为EMR集群上的etl.py提交Spark作业。
Finally, submit a Spark job for etl.py on EMR cluster, using a full dataset on s3//udacity-den.

最后，使用s3//udacity-den上的完整数据集为EMR集群上的etl.py提交Spark作业。
Try to optimize the Spark performance using various options.尝试使用各种选项来优化Spark性能。
Provide example queries and results for song play analysis. This part was described in another jupyter notebook called sparkifydb_data_lake_demo.ipynb.

提供示例查询和结果以进行歌曲播放分析。在另一个名为sparkifydb_data_lake_demo.ipynb jupyter笔记本中描述了此部分。

The validation and demo part could be found on my github. Other script file etl.py and my detailed sparkifydb_data_lake_etl.ipynb are not available in respect of Udacity Honor Code.

验证和演示部分可以在我的github上找到。关于Udacity Honor Code，其他脚本文件etl.py和我详细的sparkifydb_data_lake_etl.ipynb不可用。

项目中的一些技巧和问题 (Some Tips and Issues in The Project)

技巧1-在建立ETL管道以使用脚本处理整个数据集之前，先在Jupyter笔记本中逐步建立ETL流程。 (Tip 1 — Build the ETL process incrementally in Jupyter notebook before building ETL pipeline to process a whole dataset with scripts.)

Jupyter notebook is a great environment for exploratory data analysis (EDA), testing things out and promptly validating the results. Since debugging and optimizing the Spark application is quite challenging, it is highly recommended to build the ETL process step by step before putting all the codes together. You will see how advantage it is when we come to other tips.

Jupyter Notebook是探索性数据分析(EDA)，测试事物并及时验证结果的绝佳环境。由于调试和优化Spark应用程序非常困难，因此强烈建议在将所有代码放在一起之前逐步构建ETL过程。当我们介绍其他技巧时，您将看到它的优势。
Another important reason for using notebook: It is impractical to create etl.py script and then try to debug it since you would have to create a spark session each time you run etl.py file. With notebook, the spark session is always available.使用笔记本的另一个重要原因：创建etl.py脚本然后尝试对其进行调试是不切实际的，因为每次运行etl.py文件时都必须创建一个spark会话。使用笔记本电脑，火花会话始终可用。

技巧2-仔细浏览数据集。如果数据集“很大”，则从一个小的子集开始项目。 (Tip 2— Carefully explore the dataset. If the dataset is “big”, start the project with a small subset.)

In order to work on the project, first, we need to know the overview about the dataset, such as the number of files, number of lines in each file, total size of the dataset, the structure of the file, etc. It is especially crucial if we work on the cloud, where requests could cost so much time and money.

为了进行该项目，首先，我们需要了解有关数据集的概述 ，例如文件数，每个文件中的行数，数据集的总大小，文件的结构等。如果我们在云上工作尤其重要，因为云上的请求可能会花费大量时间和金钱。

To do that, we can use boto3, the Amazon Web Services (AWS) SDK for Python. boto3 allows us to access AWS via an IAM user. The detail on how to create an IAM user can be found in here, Step 2: Create an IAM user.

为此，我们可以使用 boto3 ，适用于Python的Amazon Web Services(AWS)SDK 。 boto3允许我们通过IAM用户访问AWS。有关如何创建IAM用户的详细信息，请参见此处的步骤2：创建IAM用户。

Below is the way to set up the client for S3 on Jupyter notebook:

下面是在Jupyter笔记本上为S3设置客户端的方法：

The key and access key obtained from an IAM user could be save to the file credentials.cfg at local directory as below. Note that you may run into “configure file parsing error” if you put your key and secrete key inside “ ” or ‘ ’, or if the file does not have the header such as [AWS]

可以将从IAM用户获得的密钥和访问密钥保存到本地目录中的certificate.cfg文件中，如下所示。 请注意，如果将密钥和秘密密钥放在“”或“'”中，或者文件没有诸如[AWS]的标题，则可能会遇到“配置文件解析错误”

The content of the credentials.cfg file. Note that you may run into “configure file parsing error” if you put your key and secrete key inside “ “ or ‘ ‘.

With this client for S3 created by boto3, we can access the dataset for the project and look at the file structures of log-data and song_data:

使用由boto3创建的S3客户端，我们可以访问项目的数据集并查看log-data和song_dat a的文件结构：

The ouputs of the exploration process are:

勘探过程的输出是：

The dataset is not big, ~3.6MB. However, the song_data has ~15,000 files. It is better to use a subset of song_data, such as ‘song_data/A/A/A/’ or ‘song_data/A/’ for exploring/creating/debugging the ETL pipeline first.

数据集不大，约为3.6MB。 但是，song_data具有约15,000个文件。 最好使用song_data的子集(例如“ song_data / A / A / A /”或“ song_data / A /”)先探索/创建/调试ETL管道。

技巧3 —在Spark中将文件读取到数据帧时包括定义的架构 (Tip 3— Include defined schema when reading files to data frame in Spark)

My ETL pipeline worked very well on the subset of the data. However, when I run it on the whole dataset, the Spark application kept freezing without any error notice. I had to reduce/increase the sub dataset to actually see the error and fix the problem, for example changing from ‘song_data/A/A/A’ to ‘song_data/A/’ and vice versa. So what is the problem here?

我的ETL管道在数据子集上工作得很好。但是，当我在整个数据集上运行它时，Spark应用程序保持冻结，而没有任何错误通知。我必须减少/增加子数据集才能真正看到错误并解决问题，例如从'song_data / A / A / A'更改至 'song_data / A /' ，反之亦然。那么这是什么问题呢？

It turned out that on this specific data, on the small dataset, my Spark application could automatically figure out the schema. But it could not on bigger dataset, perhaps due to inconsistency among the files and/or incompatible data types.事实证明，在此特定数据上，在小型数据集上，我的Spark应用程序可以自动找出模式。但是它可能无法在更大的数据集上使用，可能是由于文件之间的不一致和/或数据类型不兼容。
Moreover, the loading would take less time with a defined schema.而且，使用定义的架构，加载将花费更少的时间。

How to design a correct schema:

如何设计正确的架构：

You can manually create schema by looking at the structure of the log_data json files and the song_data json files. For simple visualization, I generated the view using pandas data fram as below您可以通过查看log_data json文件和song_data json文件的结构来手动创建架构。为了进行简单的可视化，我使用pandas数据框架生成了如下视图

For me, the trick is letting Spark read and figure out the schema on its own by reading the small subset of files into data frame, and then use it to create the right schema. With that, we don’t need to guess any kind of datatypes, whether it is string or double or long, etc. The demonstration for this trick is as followed:

对我来说，诀窍是让Spark通过将文件的一小部分读取到数据帧中来自行读取并找出模式，然后使用它来创建正确的模式。 这样，我们就不必猜测任何类型的数据类型，无论它是字符串，双精度还是长整数等。此技巧的演示如下：

提示4 —打印出任务并记录每个任务的时间 (Tip 4— Print out the task and record the time of each task)

Although it is the best practice in programming, we sometimes forget to do that. For big dataset, observing the time for each task is very important for debugging and optimizing Spark application.

尽管这是编程中的最佳做法，但有时我们却忘记这样做。对于大型数据集，观察每个任务的时间对于调试和优化Spark应用程序非常重要。

With recoding the time, we know that it takes around 9 mins to read all the .json files from song_data on S3 to the spark data frame using Spark on local mode

Unless you turn off INFO logging in Spark, it is very difficult, if not impossible, to know the progress of the Spark application on the terminal, which is overwhelming with INFO logging. By printing out the task name and recording the time, everything is better:

除非您在Spark中关闭INFO日志记录，否则很难(即使不是不可能)知道终端上Spark应用程序的进度，这对于INFO日志记录来说是不知所措的。通过打印任务名称并记录时间，一切都会变得更好：

Printing out the task name and recording the time help us to keep track on the progress of the application

提示5 —在pyspark软件包中导入和使用函数的最佳方法是什么？ (Tip 5 — What is the best way to import and use a function in pyspark package?)

There are at least 2 ways to import and use a function, for example:

至少有2种方式导入和使用函数，例如：

from pyspark.sql.functions import max

from pyspark.sql.functions import max
or import pyspark.sql.functions as F and then use F.max

或import pyspark.sql.functions as F ，然后使用F.max

Either is fine. I prefer the second approach since I don’t need to list all the function on the top of my script etl.py.

都可以。我更喜欢第二种方法，因为我不需要在脚本etl.py的顶部列出所有功能。

Notice that max function is an exception since it is also a built-in max function in Python. To use max function from the pyspark.sql.functions module, you have to use F.max or using alias, such asfrom pyspark.sql.functions import max as max_

请注意， max函数是一个例外，因为它也是Python中的内置max函数。要从pyspark.sql.functions模块使用max函数，您必须使用F.max或使用别名，例如from pyspark.sql.functions import max as max_

提示6 –当我的Spark应用程序冻结时，怎么了？ (Tip 6 — What is wrong when my Spark application freezes?)

There could be many problems with it. I got some myself:

可能有很多问题。我有一些自己：

Difference in AWS region: Please make sure to use us-west-2 when setting up boto3/EMR cluster/S3 output bucket, etc. since the available dataset is on that AWS region.

AWS区域的差异：设置boto3 / EMR群集/ S3输出存储桶等时，请确保使用us-west-2 ，因为可用的数据集在该AWS区域上。
Didn’t include defined schema when reading files to data frame: Fix using Tip 3.

在将文件读取到数据帧时未包含已定义的架构：使用提示3进行修复。
It takes such a long time to run the ETL pipeline on the whole dataset: The project is quite impractical because reading and writing to S3 from EMR/Spark are extremely slow. When running the ETL pipeline on a small sub dataset, you can see the same pattern of INFO logging repeats again and again on the terminal, such as the one below:

在整个数据集上运行ETL管道需要花费很长时间：该项目非常不切实际，因为从EMR / Spark读取和写入S3的速度非常慢。在小的子数据集上运行ETL管道时，您可以看到相同的INFO日志记录模式在终端上一次又一次地重复，例如以下示例：

This is on the “INFO ContextCleaner: Cleaned accumulator xxx” where I found my Spark application appeared to be freezing again and again. It’s expected to be a long running job, which took me ~115 min to write only the songs table into the s3 bucket. So if you are sure that your end-to-end process works perfectly, then be patient for 2 hours to see how it works. The process could be speeded up, check out on Tip 9 below.

这是在“ INFO ContextCleaner：清理的累加器xxx”上 ，我发现我的Spark应用程序似乎一次又一次地冻结。预计这将是一项长期工作，花了我约115分钟才能将Songs表仅写入s3存储桶。 因此，如果您确定端到端流程可以完美运行，请耐心等待2个小时，看看其工作原理。 可以加快该过程，请查看下面的提示9 。

4. Checking the running time on AWS EMR console: You can see how long your Spark application ran when choose the Application user interfaces tab on your cluster on EMR console. The list of application can be found at the end of the page:

4.检查AWS EMR控制台上的运行时间：在EMR控制台上的集群上选择Application User Interfaces选项卡时，您可以看到Spark应用程序运行了多长时间。您可以在页面末尾找到应用程序列表：

My ETL pipeline on the whole dataset took ~2.1 hrs to finished on the EMR cluster (1 Master node and 2 Core nodes of type m5.xlarge).

我在整个数据集上的ETL管道在EMR集群(1个主节点和2个m5.xlarge类型的核心节点)上花费了大约2.1小时。

技巧7-使用Spark自动增加songplays_id的问题-这不是一个小问题。 (Tip 7 — Auto-increment for songplays_id using Spark— It is not a trivial issue.)

This issue is trivial in other database: In Postgres, we can use SERIAL to auto-increment a column, such as songplays_id SERIAL PRIMARY KEY. In AWS Redshift, we can use IDENTITY(seed, step).

这个问题在其他数据库中是微不足道的：在Postgres中，我们可以使用SERIAL自动增加一列，例如songplays_id SERIAL PRIMARY KEY 。在AWS Redshift中，我们可以使用IDENTITY(seed, step) 。

It is not trivial to perform auto-increment for table using Spark, at least when you try to understand it deeply and in consideration of Spark performance. Here is one of the good references to understand auto-increment in Spark.

使用Spark对表执行自动递增并非易事，至少在您尝试深入了解并考虑Spark性能的情况下。 这是了解Spark中自动递增的很好的参考之一。

There are 3 methods for this task:

此任务有3种方法：

Using row_number() function using SparkSQL使用SparkSQL使用row_number()函数
Using rdds to create indexes and then convert it back to data frame using the rdd.zipWithIndex() function使用rdds创建索引，然后使用rdd.zipWithIndex()函数将其转换回数据帧
Using the monotonically_increasing_id()使用monotonically_increasing_id()

I prefer the rdd.zipWithIndex() function:

我更喜欢rdd.zipWithIndex()函数：

Step 1: From the songplays_table data frame, use the rdd interface to create indexes with zipWithIndex(). The result is a list of rows, each row contains 2 elements: (i) all the columns from the old data frame zipped into a “row”, and (ii) the auto-increment indexes:

步骤1：从songplays_table数据框中，使用rdd接口使用zipWithIndex()创建索引。结果是一个行列表，每行包含2个元素：(i)旧数据帧中的所有列都压缩为“行”，并且(ii)自动增量索引：

Step 2: Return it back to data frame — we need to write a lambda function for it.

第2步：将其返回数据框-我们需要为其编写一个lambda函数。

技巧8：谈论时间，加载和写入每个表需要多长时间？ (Tip 8— Talking about time, how long does it take to load and write each table?)

Below is the time for running the Spark application on AWS EMR cluster, reading from and writing to S3:

下面是在AWS EMR集群上运行Spark应用程序，读取和写入S3的时间：

My EMR cluster had 1 Master node and 2 Core nodes of type m5.xlarge, as shown below:

我的EMR群集具有1个主节点和2个m5.xlarge类型的核心节点，如下所示：

aws emr create-cluster --name test-emr-cluster --use-default-roles --release-label emr-5.28.0 --instance-count 3 --instance-type m5.xlarge --applications Name=JupyterHub Name=Spark Name=Hadoop --ec2-attributes KeyName=emr-cluster  --log-uri s3://s3-for-emr-cluster/

技巧9-如何加快ETL流程？ (Tip 9 —How to speed up the ETL pipeline?)

We definitely love to optimize the Spark application since reading and writing into S3 take a long time. Here is some optimizations that I have tried:

我们绝对喜欢优化Spark应用程序，因为对S3的读写需要很长时间。这是我尝试过的一些优化：

Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version to 2

将 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 设置为2

You can read in detail about it here. It can be done simply by adding spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2") into spark session.

您可以在此处详细阅读。只需在spark会话中添加spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")即可完成此操作。

With this optimization, the total ETL time reduced dramatically from ~2.1hr to only 30 min.

通过这种优化，总的ETL时间从〜2.1hr大大减少到只有30分钟。

Use HDFS to speed up the process

使用HDFS加快过程

- “On a per node basis, HDFS can yield 6X higher read throughput than S3”. So we can save the analytics tables to HDFS, then copy from HDFS to S3. We could use s3-dist-cp to copy from HDFS to S3.

- “在每个节点上，HDFS可以产生比S3高6倍的读取吞吐量”。因此，我们可以将分析表保存到HDFS，然后从HDFS复制到S3。我们可以使用s3-dist-cp从HDFS复制到S3。

技巧10-输出如何？如何在S3上显示分析表？ (Tip 10 — How is the output? How do the analytics tables turn out on S3?)

This ETL pipeline is a long running job, in which the task writing the song table took most of the time. The songs table was partitioned by “year” and “artist”, which could produce skew data where some early years (1961 to 199x) don’t contain many songs comparing to the years 200x.

此ETL管道是一项长期运行的工作，其中大部分时间都在编写song表的任务中。歌曲表按“年”和“艺术家”划分，这可能会产生歪斜的数据，其中某些早期年份(1961年至199x)与200x年相比并不包含很多歌曲。

The data quality check result to make sure if the ETL pipeline successfully added all the records to the tables, together with some example queries and results for song play analysis could be found on my notebook on github.

可以在我的github笔记本上找到数据质量检查结果，以确保ETL管道是否成功将所有记录添加到表中，以及一些示例查询和歌曲播放分析结果。

技巧11 –不要让AWS Billing Dashboard混淆您 (Tip 11 — Don’t let AWS Billing Dashboard confuse you)

Although I have used AWS “quite a lot” and already reached the Free Tier usage limit with this account, whenever I came to the Billing Dashboard, the total amount due is 0.

尽管我已经“大量使用” AWS，并且已经使用此帐户达到了免费套餐使用限制，但是每当我来到Billing Dashboard时， 应付的总金额为0。

Don’t let AWS Billing Dashboard confuse you. What it shows is the total balance, not your AWS expense. It is the balance which — according to Wikipedia — is “the difference between the sum of debit entries and the sum of credit entries entered into an account during a financial period.”

不要让AWS Billing Dashboard混淆您。 它显示的是总余额，而不是您的AWS费用。 平衡是-根据维基百科 -是“借记项之和在财政期间进入账户贷方的总和之间的差异。”

I thought when looking at the AWS Billing Dashboard, I would see the amount I had spent so far, my AWS expense. But no. Even when click on the Bill Details, everything is 0. And so I thought that I didn’t use AWS that much. My promo credit was still safe.

我以为，在查看AWS Billing Dashboard时，我会看到到目前为止已花费的金额，即我的AWS费用。但不是。即使单击“账单明细”，所有内容都为0。所以我认为我使用的AWS并不多。我的促销信用仍然很安全。

Only when one day, I click on the Expand All button, and I was in big surprise realizing my promo credit had almost gone!!! So again, what you see on the Dashboard is the balance, not the expense. Be careful when using your EMR and EC clusters. It may cost you more money than you thought. (Well, although I admit that gaining AWS experience is so worth it).

仅在一天之内，我单击“ Expand All 按钮，我感到惊讶的是，我的促销信用几乎消失了！！！ 同样，您在仪表板上看到的是余额，而不是费用。使用EMR和EC群集时请小心。它可能会花费您比您想象的更多的钱。 (嗯，尽管我承认获得AWS经验非常值得)。

Thank you so much for reading this lengthy post. I do aware that people get discouraged easily with long post, but I want to have a consolidated report for you. Good luck with your project, and I am more than happy for any discussion.

非常感谢您阅读这篇冗长的文章。我知道，长期任职的人很容易灰心，但我想为您提供一份综合报告。祝您的项目好运，对于任何讨论我都非常满意。

The Jupyter notebook version of this post, together with other tutorials on Spark and many more data science tutorials could be found on my github.

这篇文章的Jupyter笔记本版本以及Spark上的其他教程以及更多数据科学教程都可以在 我的github 上找到 。

翻译自: https://towardsdatascience.com/some-issues-when-building-an-aws-data-lake-using-spark-and-how-to-deal-with-these-issues-529ce246ba59

aws spark

查看全文

http://www.taodudu.cc/news/show-995015.html

数据科学家编程能力需要多好_我们不需要这么多的数据科学家
sql优化技巧_使用这些查询优化技巧成为SQL向导
物种分布模型_减少物种分布建模中的空间自相关
清洁数据ploy n_清洁屋数据
基于边缘计算的实时绩效_基于绩效的营销中的三大错误
上凸包和下凸包_使用凸包聚类
决策树有框架吗_决策框架
mysql那本书适合初学者_3本书适合初学者
阎焱多少身价_2020年，数据科学家的身价是多少？
卡尔曼滤波滤波方程_了解卡尔曼滤波器及其方程
朴素贝叶斯分类器文本分类_构建灾难响应的文本分类器
Seaborn：Python
销货清单数据_2020年8月数据科学阅读清单
米其林餐厅盐之花_在世界范围内探索《米其林指南》
spotify 数据分析_我的Spotify流历史分析
纹个鸡儿天才小熊猫_给熊猫用户的5个提示
图像离群值_什么是离群值？
数据预处理工具_数据预处理
自考数据结构和数据结构导论_我跳过大学自学数据科学
在PyTorch中转换数据
tidb数据库_异构数据库复制到TiDB
刚认识女孩说不要浪费时间_不要浪费时间寻找学习数据科学的最佳方法
什么是数据仓库，何时以及为什么要考虑一个
探索性数据分析入门_入门指南：R中的探索性数据分析
python web应用_为您的应用选择最佳的Python Web爬网库
在FAANG面试中破解堆算法
itchat 道歉_人类的“道歉”
数据科学 python_为什么需要以数据科学家的身份学习Python的7大理由
动量策略 python_在Python中使用动量通道进行交易
高斯模糊为什么叫高斯滤波_为什么高斯是所有发行之王？

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题相关推荐

华为云MRS基于Hudi和HetuEngine构建实时数据湖最佳实践
数据湖与实时数据湖是什么? 各个行业企业都在构建企业级数据湖,将企业内多种格式数据源汇聚的大数据平台,通过严格的数据权限和资源管控,将数据和算力开放给各种使用者.一份数据支持多种分析,是数据湖最大的特 ...
如何快速构建企业级数据湖仓？
更多技术交流.求职机会,欢迎关注字节跳动数据平台微信公众号,回复[1]进入官方交流群本文整理自火山引擎开发者社区技术大讲堂第四期演讲,主要介绍了数据湖仓开源趋势.火山引擎 EMR 的架构及特点,以及 ...
hdfs 数据迁移_基于JindoFS+OSS构建高效数据湖
作者:孙大鹏,花名诚历,阿里巴巴计算平台事业部 EMR 技术专家,Apache Sentry PMC,Apache Commons Committer,目前从事开源大数据存储和优化方面的工作. 为什么 ...
Apache Hudi 在 B 站构建实时数据湖的实践
简介: B 站选择 Flink + Hudi 的数据湖技术方案,以及针对其做出的优化. 本文作者喻兆靖,介绍了为什么 B 站选择 Flink + Hudi 的数据湖技术方案,以及针对其做出的优化.主要 ...
基于JindoFS+OSS构建高效数据湖
为什么要构建数据湖大数据时代早期,Apache HDFS 是构建具有海量存储能力数据仓库的首选方案.随着云计算.大数据.AI 等技术的发展,所有云厂商都在不断完善自家的对象存储,来更好地适配 Apa ...
hdfs 数据迁移_基于 JindoFS+OSS 构建高效数据湖
为什么要构建数据湖大数据时代早期,Apache HDFS 是构建具有海量存储能力数据仓库的首选方案.随着云计算.大数据.AI 等技术的发展,所有云厂商都在不断完善自家的对象存储,来更好地适配 Apa ...
基于Flink1.14 + Iceberg0.13构建实时数据湖实战
点击上方蓝色字体,选择"设为星标" 回复"面试"获取更多惊喜八股文教给我,你们专心刷题和面试 Hi,我是王知无,一个大数据领域的原创作者. 放心关注我,获取更 ...
数据湖 data lake_在Data Lake中高效更新TB级数据的模式
数据湖 data lake GOAL: This post discusses SQL "UPDATE" statement equivalent for a data lake ...
贾扬清谈云原生-让数据湖加速迈入3.0时代
简介: 摘要:2021云栖大会云原生企业级数据湖专场,阿里云智能高级研究员贾扬清为我们带来<云原生--让数据湖加速迈入3.0时代>的分享. 摘要:2021云栖大会云原生企业级数据湖专场,阿 ...

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题

技术提示 (TECHNICAL TIPS)

介绍 (Introduction)

项目介绍 (Project Introduction)

项目目标 (Project Goal)

项目中的一些技巧和问题 (Some Tips and Issues in The Project)

技巧1-在建立ETL管道以使用脚本处理整个数据集之前，先在Jupyter笔记本中逐步建立ETL流程。 (Tip 1 — Build the ETL process incrementally in Jupyter notebook before building ETL pipeline to process a whole dataset with scripts.)

技巧2-仔细浏览数据集。如果数据集“很大”，则从一个小的子集开始项目。 (Tip 2— Carefully explore the dataset. If the dataset is “big”, start the project with a small subset.)

技巧3 —在Spark中将文件读取到数据帧时包括定义的架构 (Tip 3— Include defined schema when reading files to data frame in Spark)

提示4 —打印出任务并记录每个任务的时间 (Tip 4— Print out the task and record the time of each task)

提示5 —在pyspark软件包中导入和使用函数的最佳方法是什么？ (Tip 5 — What is the best way to import and use a function in pyspark package?)

提示6 –当我的Spark应用程序冻结时，怎么了？ (Tip 6 — What is wrong when my Spark application freezes?)

技巧7-使用Spark自动增加songplays_id的问题-这不是一个小问题。 (Tip 7 — Auto-increment for songplays_id using Spark— It is not a trivial issue.)

技巧8：谈论时间，加载和写入每个表需要多长时间？ (Tip 8— Talking about time, how long does it take to load and write each table?)

技巧9-如何加快ETL流程？ (Tip 9 —How to speed up the ETL pipeline?)

技巧10-输出如何？如何在S3上显示分析表？ (Tip 10 — How is the output? How do the analytics tables turn out on S3?)

技巧11 –不要让AWS Billing Dashboard混淆您 (Tip 11 — Don’t let AWS Billing Dashboard confuse you)

相关文章：

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题相关推荐

最新文章

热门文章

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题

技术提示 (TECHNICAL TIPS)

介绍 (Introduction)

项目介绍 (Project Introduction)

项目目标 (Project Goal)

项目中的一些技巧和问题 (Some Tips and Issues in The Project)

技巧1-在建立ETL管道以使用脚本处理整个数据集之前，先在Jupyter笔记本中逐步建立ETL流程。 (Tip 1 — Build the ETL process incrementally in Jupyter notebook before building ETL pipeline to process a whole dataset with scripts.)

技巧2-仔细浏览数据集。 如果数据集“很大”，则从一个小的子集开始项目。 (Tip 2— Carefully explore the dataset. If the dataset is “big”, start the project with a small subset.)

技巧3 —在Spark中将文件读取到数据帧时包括定义的架构 (Tip 3— Include defined schema when reading files to data frame in Spark)

提示4 —打印出任务并记录每个任务的时间 (Tip 4— Print out the task and record the time of each task)

提示5 —在pyspark软件包中导入和使用函数的最佳方法是什么？ (Tip 5 — What is the best way to import and use a function in pyspark package?)

提示6 –当我的Spark应用程序冻结时，怎么了？ (Tip 6 — What is wrong when my Spark application freezes?)

技巧7-使用Spark自动增加songplays_id的问题-这不是一个小问题。 (Tip 7 — Auto-increment for songplays_id using Spark— It is not a trivial issue.)

技巧8：谈论时间，加载和写入每个表需要多长时间？ (Tip 8— Talking about time, how long does it take to load and write each table?)

技巧9-如何加快ETL流程？ (Tip 9 —How to speed up the ETL pipeline?)

技巧10-输出如何？ 如何在S3上显示分析表？ (Tip 10 — How is the output? How do the analytics tables turn out on S3?)

技巧11 –不要让AWS Billing Dashboard混淆您 (Tip 11 — Don’t let AWS Billing Dashboard confuse you)

相关文章：

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题相关推荐

最新文章

热门文章

技巧2-仔细浏览数据集。如果数据集“很大”，则从一个小的子集开始项目。 (Tip 2— Carefully explore the dataset. If the dataset is “big”, start the project with a small subset.)

技巧10-输出如何？如何在S3上显示分析表？ (Tip 10 — How is the output? How do the analytics tables turn out on S3?)