bigquery 教程

This medium article focusses on the detailed walkthrough of the steps I took to solve the challenge lab of the Insights from Data with BigQuery Skill Badge on the Google Cloud Platform (Qwiklabs). I got access to this lab in the Google Cloud Ready Facilitator Program. Thanks to Google!

这篇中篇文章重点介绍了我为解决Google Cloud Platform( Qwiklabs )上的BigQuery Skill Badge数据见解挑战实验室而采取的步骤的详细演练。 我可以通过Google Cloud Ready Facilitator计划访问此实验室。 感谢Google!

Till now, I have completed over 100 labs and 23 quests on Qwiklabs. Below is the reference of my profile.

到目前为止,我已经完成了100多个实验室和Qwiklabs上的23个任务 。 以下是我的个人资料参考。

This lab is only recommended for students who have completed the labs in the Insights from Data with BigQuery Quest. Knowledge of SQL and BigQuery is also needed to solve this challenge lab. Are you up for the challenge? Let’s go!

仅向在使用BigQuery Quest进行数据洞察中完成实验的学生推荐该实验。 的知识 解决此挑战实验室也需要SQL BigQuery你准备好接受挑战了吗? 我们走吧!

使用的数据集 (Dataset Used)

The dataset that we would be using in this challenge lab is bigquery-public-data.covid19_open_data.covid19_open_data. This dataset contains data related to covid-19 on a country basis globally. We would be using this in this skill badge tutorial.

我们将在此挑战实验室中使用的数据集为bigquery-public-data.covid19_open_data.covid19_open_data。 该数据集包含全球基于国家/地区与covid-19相关的数据。 我们将在本技能徽章教程中使用它。

BigQuery Tutorial can be found on the reference below:

可以在以下参考资料中找到BigQuery教程:

挑战场景 (Challenge Scenario)

There are 10 small tasks in this challenge lab, all of which should be completed to score 100/100. In order to pass the lab, there are 9 SQL commands and 1 Data Studio report that should be generated in order to score 100. This tutorial list out the steps I took to solve all the ten challenges within the lab. The ten tasks are as follows:

这个挑战实验室中10个小任务 ,所有这些小任务都应得分为100/100。 为了通过实验室,应生成9条SQL命令和1个Data Studio报告才能获得100分。本教程列出了我为解决实验室中的所有十个挑战而采取的步骤。 十个任务如下:

  1. Building a SQL query that outputs the total no. of confirmed cases.

    建立一个SQL查询,输出总编号。 确诊病例。

  2. Building a SQL query that outputs the worst affected areas.

    构建一个SQL查询以输出受影响最严重的区域。

  3. Building a SQL query that identifies the Hotspots in USA.

    建立一个SQL查询来标识美国热点。

  4. Building a SQL query that outputs the Fatality Ratio.

    建立一个输出致命率SQL查询

  5. Building a SQL query that identifies a specific day according to the constraints.

    建立一个SQL查询来根据约束条件确定特定的一天

  6. Building a SQL query that outputs the number of days with zero net new cases.

    建立一个SQL查询,以输出净新案例为零的天数。

  7. Building a SQL query that outputs the Doubling Rate.

    建立一个输出双倍速率SQL查询

  8. Building a SQL query that outputs the Recovery Rate.

    构建一个输出恢复率SQL查询

  9. Building a SQL query that outputs the CDGR — Cumulative Daily Growth Rate.

    构建一个输出CDGRSQL查询-累积每日增长率。

  10. Creating a Datastudio report.

    创建一个Datastudio报告。

重要的提示 (Important Note)

Before starting this lab, ensure that you do whatever is required. Allocating more resources or doing something that is not required may lead to blocking of account by qwiklabs admin. Doing something other than that required in the lab results in account blocked by qwiklabs. Don’t worry. I came across this problem. The account can easily be unblocked by contacting qwiklabs support within a second.

在开始本实验之前,请确保您执行所需的任何操作。 分配更多资源或执行不必要的操作可能会导致qwiklabs管理员阻止帐户。 如果执行实验室中未要求的操作,则会导致qwiklabs阻止帐户。 不用担心 我遇到了这个问题。 一秒钟内联系qwiklabs支持人员即可轻松解除帐户锁定。

加载数据集 (Loading the Dataset)

  1. In the cloud console, once logged in completely, Go to Menu > BigQuery.

    在云控制台中,一旦完全登录,请转到菜单> BigQuery。

  2. Click + Add Data and then click on Explore Public Datasets from the left pane.

    单击+添加数据 ,然后从左窗格中单击探索公共数据集

  3. Search covid19_open_data and then select “Covid-19 Open Data”. Click on View Dataset to explore more!

    搜索covid19_open_data ,然后选择“ Covid-19 Open Data”。 单击查看数据集以探索更多内容!

  4. Use filter and locate the table covid19_open_data under the covid19_open_data dataset.

    使用过滤器并在covid19_open_data下找到表covid19_open_data 数据集。

Image by Wynn Pointaux on Pixabay
该图片由Wynn Pointaux在Pixabay上发布

任务详细教程— 1 (Detailed Tutorial of Task — 1)

In task 1 it requires the user to execute a query that outputs the total count of confirmed cases on Apr 15, 2020. The output should contain only a single row containing the sum of confirmed cases across all the countries in the dataset. total_cases_worldwide should be the name of the column.

在任务1中,它要求用户执行查询,以输出2020年4月15日确诊病例的总数 。输出应仅包含一行,其中包含数据集中所有国家/地区的确诊病例的总数。 total_cases_worldwide应该是列的名称。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

SELECTSUM(cumulative_confirmed) AS total_cases_worldwideFROM  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE  date = "2020-04-15"

任务详细教程— 2 (Detailed Tutorial of Task — 2)

Task 2 requires to build a query for extracting the result of: “How many states in the US had more than 100 deaths on Apr 10, 2020?” The output should have the field name as count_of_states.

任务2需要构建一个查询来提取以下结果:“ 到2020年4月10日,美国有多少州的死亡人数超过100? 输出的字段名称应为count_of_states。

Hint: We don’t have to include NULL values.(Important)

提示:我们不必包含NULL值。(重要)

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

SELECTCOUNT(*) AS count_of_statesFROM (SELECT    subregion1_name AS state,SUM(cumulative_deceased) AS death_countFROM  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE  country_name="United States of America"AND date='2020-04-10'AND subregion1_name IS NOT NULLGROUP BY  subregion1_name)WHERE death_count > 100

任务详细教程— 3 (Detailed Tutorial of Task — 3)

Writing a query that will output the result of: “List all the states in the United States of America that had more than 1000 confirmed cases on Apr 10, 2020?” The output should have two columns named state and total_confirmed_cases that corresponds to State Name and the confirmed cases arranged in descending order.

编写查询将输出以下结果:“ 列出2020年4月10日美国确诊病例超过1000的所有州? ”输出应具有名为statetotal_confirmed_cases的两列,分别对应于State Name和已确认的个案,它们以降序排列。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

SELECT    subregion1_name AS state,SUM(cumulative_confirmed) AS total_confirmed_casesFROM    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE    country_name="United States of America"AND date = "2020-04-10"GROUP BY subregion1_nameHAVING total_confirmed_cases > 1000ORDER BY total_confirmed_cases DESC

任务详细教程— 4 (Detailed Tutorial of Task — 4)

Building a query in the query editor that will answer the following question: “What was the case-fatality ratio in Italy for the month of April 2020?

在查询编辑器中构建一个查询,该查询将回答以下问题: “意大利2020年4月的病死率是多少?

Case-fatality ratio is defined as (total deaths / total confirmed cases) * 100. The output should have three columns named total_confirmed_cases, total_deaths and case_fatality_ratio.

病死率定义为(总死亡人数/确诊病例总数)*100 。输出应具有三列,分别称为total_confirmed_casestotal_deaths和case_fatality_ratio

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

SELECT SUM(cumulative_confirmed) AS total_confirmed_cases, SUM(cumulative_deceased) AS total_deaths, (SUM(cumulative_deceased)/SUM(cumulative_confirmed))*100 AS case_fatality_ratioFROM `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE country_name="Italy" AND date BETWEEN "2020-04-01" AND "2020-04-30"

任务详细教程— 5 (Detailed Tutorial of Task — 5)

Building a query that will answer the following question: “On what day did the total number of deaths cross 10000 in Italy?

建立一个查询,将回答以下问题:“ 意大利的总死亡人数在哪一天超过10000?

The query should output the date with a column name “date” and in the format “yyyy-mm-dd”.

查询应以列名称“ date”和格式“ yyyy-mm-dd”输出日期。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

SELECT dateFROM  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE country_name = 'Italy'AND cumulative_deceased > 10000ORDER BY dateLIMIT 1

任务详细教程— 6 (Detailed Tutorial of Task — 6)

The query given should be updated to output the correct number of days in India between 21 Feb 2020 and 15 March 2020 when there were zero increases in the number of confirmed cases.

给出的查询应进行更新,以输出2020年2月21日至2020年3月15日之间印度的正确天数,此时确诊病例数增加为零。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

WITH india_cases_by_date AS (SELECT    date,SUM(cumulative_confirmed) AS casesFROM    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE    country_name="India"AND date between '2020-02-21' and '2020-03-15'GROUP BY    dateORDER BY    date ASC ), india_previous_day_comparison AS(SELECT  date,  cases,  LAG(cases) OVER(ORDER BY date) AS previous_day,  cases - LAG(cases) OVER(ORDER BY date) AS net_new_casesFROM india_cases_by_date)SELECTCOUNT(date)FROM  india_previous_day_comparisonWHERE  net_new_cases = 0

任务详细教程— 7 (Detailed Tutorial of Task — 7)

Using the query that we ran in Task 6 as a template, the user has to build a query to find out the dates on which the confirmed cases increased by more than 10% compared to the previous day in the US between the dates March 22, 2020 and April 20, 2020.

使用我们在任务6中运行的查询作为模板,用户必须构建查询以找出确认的病例比3月22日在美国的前一天增加了10%以上的日期, 2020年和2020年4月20日。

There should be four columns named Date, Confirmed_Cases_On_Day, Confirmed_Cases_Previous_Day and Percentage_Increase_In_Cases.

应该有四列,分别命名为DateConfirmed_Cases_On_DayConfirmed_Cases_Previous_DayPercentage_Increase_In_Cases

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

WITH us_cases_by_date AS (SELECT    date,SUM( cumulative_confirmed ) AS casesFROM    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE    country_name="United States of America"AND date between '2020-03-22' and '2020-04-20'GROUP BY    dateORDER BY    date ASC ), us_previous_day_comparison AS(SELECT  date,  cases,  LAG(cases) OVER(ORDER BY date) AS previous_day,  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases,  (cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increaseFROM us_cases_by_date)SELECT  Date,  cases AS Confirmed_Cases_On_Day,  previous_day AS Confirmed_Cases_Previous_Day,  percentage_increase AS Percentage_Increase_In_CasesFROM  us_previous_day_comparisonWHERE  percentage_increase > 10

任务详细教程— 8 (Detailed Tutorial of Task — 8)

Building a query to list the recovery rates of countries on the date May 10, 2020 with only those countries having more than 50K confirmed cases and output arranged in descending order (limit to 10). The name of the columns in the output should be as country, recovered_cases, confirmed_cases, recovery_rate in order to score full marks.

生成查询以列出2020年5月10日的国家的恢复率,只有那些确认病例和产量超过5万的国家/地区以降序排列(限制为10个)。 在输出列的名称应为国家 ,recovered_cases,confirmed_cases,recovery_rate才能得满分。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

WITH cases_by_country AS (SELECT    country_name AS country,SUM(cumulative_confirmed) AS cases,SUM(cumulative_recovered) AS recovered_casesFROM    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE    date="2020-05-10"GROUP BY    country_name), recovered_rate AS (SELECT    country, cases, recovered_cases,    (recovered_cases * 100)/cases AS recovery_rateFROM    cases_by_country)SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rateFROM   recovered_rateWHERE   cases > 50000ORDER BY recovery_rate DESCLIMIT 10

任务详细教程— 9 (Detailed Tutorial of Task — 9)

Building a query that outputs the correct CDGR in the correct format. The CDGR or Cumulative Daily Growth Rate is calculated as:

建立一个以正确格式输出正确CDGR的查询。 CDGR或累计每日增长率计算为:

((last_day_cases/first_day_cases)^1/days_diff)-1)

((last_day_cases/first_day_cases)^1/days_diff)-1)

Where last_day_cases, first_day_cases and days_diff is given as:

其中last_day_cases,first_day_cases和days_diff给出为:

  • last_day_cases corresponds to the number of confirmed cases on May 10, 2020

    last_day_cases对应于2020年5月10日的确诊病例数

  • first_day_cases corresponds to the number of confirmed cases on Feb 02, 2020

    first_day_cases对应于2020年2月2日的确诊病例数

  • days_diff corresponds to the number of days between Feb 02 - May 10, 2020

    days_diff对应于2020年2月2日至5月10日之间的天数

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询,然后单击“ 运行”。

WITH  france_cases AS (SELECT    date,SUM(cumulative_confirmed) AS total_casesFROM    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE    country_name="France"AND date IN ('2020-01-24',      '2020-05-10')GROUP BY    dateORDER BY    date), summary as (SELECT  total_cases AS first_day_cases,  LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,  DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diffFROM  france_casesLIMIT 1)select first_day_cases, last_day_cases, days_diff, POWER(last_day_cases/first_day_cases,1/days_diff)-1 as cdgrfrom summary

任务详细教程— 10 (Detailed Tutorial of Task — 10)

For creating the Data Studio report, a number of steps should be followed.

要创建Data Studio报表,应遵循许多步骤。

1. First of all, Copy the below query in the query editor and click on RUN.

1.首先,在查询编辑器中复制以下查询,然后单击“ 运行”。

SELECT  date, SUM(cumulative_confirmed) AS country_cases,SUM(cumulative_deceased) AS country_deathsFROM  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE  date BETWEEN '2020-03-15'AND '2020-04-30'AND country_name='United States of America'GROUP BY date

2. Click on EXPLORE DATA > Explore with Data Studio.

2.单击探索 数据 > 使用Data Studio探索

3. Give access to Data Studio and authorize it to control BigQuery.

3.授予对Data Studio的访问权限,并授权它控制BigQuery。

If you fail to create a report for the very first time login of Data Studio, click + Blank Report option and accept the Terms of Service. Then, go back again to BigQuery page and click Explore with Data Studio again.

如果您第一次登录Data Studio时未能创建报告,请单击+空白报告选项并接受服务条款。 然后,再次返回BigQuery页面,然后再次单击“使用Data Studio探索”

4. Create a new Time series chart in the new Data Studio report by selecting Add a chart > Time series Chart.

4.通过选择新的Data Studio报告创建一个新的时间序列图表 添加图表 > 时间序列图

5. Add country_cases and country_deaths to the Metric field.

5.将country_casescountry_deaths添加到“ 度量”字段。

6. Click Save to commit the change.

6.单击保存以提交更改。

恭喜!! (Congratulations!!)

This is the skill badge I got after completing this challenge lab :P

这是完成挑战实验后获得的技能徽章:P

Google Cloud — Skill Badge (Image by author)
Google Cloud —技能徽章(作者提供的图片)

With this, we have come to the end of this challenge lab. Thanks for reading this and following along. Hope you loved it! Bundle of thanks for reading it!

至此,我们已经到了挑战实验室的终点。 感谢您阅读并继续。 希望你喜欢它! 捆绑感谢您阅读!

My Portfolio and Linkedin :)

我的投资组合和Linkedin :)

翻译自: https://medium.com/swlh/insights-from-data-with-bigquery-challenge-lab-tutorial-f868992ef9dc

bigquery 教程


http://www.taodudu.cc/news/show-995328.html

相关文章:

  • java职业技能了解精通_如何通过精通数字分析来提升职业生涯的发展,第8部分...
  • kfc流程管理炸薯条几秒_炸薯条成为数据科学的最后前沿
  • bigquery_到Google bigquery的sql查询模板,它将您的报告提升到另一个层次
  • 数据科学学习心得_学习数据科学时如何保持动力
  • python多项式回归_在python中实现多项式回归
  • pd种知道每个数据的类型_每个数据科学家都应该知道的5个概念
  • xgboost keras_用catboost lgbm xgboost和keras预测财务交易
  • 走出囚徒困境的方法_囚徒困境的一种计算方法
  • 平台api对数据收集的影响_收集您的数据不是那么怪异的api
  • 逻辑回归 概率回归_概率规划的多逻辑回归
  • ajax不利于seo_利于探索移动选项的界面
  • 数据探索性分析_探索性数据分析
  • stata中心化处理_带有stata第2部分自定义配色方案的covid 19可视化
  • python 插补数据_python 2020中缺少数据插补技术的快速指南
  • ab 模拟_Ab测试第二部分的直观模拟
  • 亚洲国家互联网渗透率_发展中亚洲国家如何回应covid 19
  • 墨刀原型制作 位置选择_原型制作不再是可选的
  • 使用协同过滤推荐电影
  • 数据暑假实习面试_面试数据科学实习如何准备
  • 谷歌 colab_如何在Google Colab上使用熊猫分析
  • 边际概率条件概率_数据科学家解释的边际联合和条件概率
  • 袋装决策树_袋装树是每个数据科学家需要的机器学习算法
  • opencv实现对象跟踪_如何使用opencv跟踪对象的距离和角度
  • 熊猫数据集_大熊猫数据框的5个基本操作
  • 帮助学生改善学习方法_学生应该如何花费时间改善自己的幸福
  • 熊猫数据集_对熊猫数据框使用逻辑比较
  • 决策树之前要不要处理缺失值_不要使用这样的决策树
  • gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化
  • 数据库逻辑删除的sql语句_通过数据库的眼睛查询sql的逻辑流程
  • 数据挖掘流程_数据流挖掘

bigquery 教程_bigquery挑战实验室教程从数据中获取见解相关推荐

  1. 如何才能在大数据中获取价值

    从数据中获取价值都是一个挑战,不管你所在的行业和企业规模如何.然而,在早期阶段,这一挑战与可用数据量没多大关系.如果对数据处理过程和数据值提取的结构设计不合理,那么至少按照现在的标准,企业有数据和没数 ...

  2. 大数据时代:9种从大数据中获取商业价值的方法

    很多大数据都是来自一些新的来源,这代表客户或合作伙伴互动的新渠道.和任何新的数据来源一样,大数据值得探索.通过数据探索,你可以了解一些之前所不知道的商业模式和事实真相. 关于管理大数据的调查显示,89 ...

  3. MySQL中数组内的JSON数据中获取值

    MySQL中JSON数据获取值 1.MySQL中JSON数据中获取值 数据源: {"observeTruth": "111","preventHume ...

  4. Python智慧农业之将数据存储在表格中并从图表中获取见解,基于 Google 表格和 Neo4j 中维护伴随植物知识图谱(教程含源码)

    即使有足够 100 亿人的食物,世界上仍有 10% 的人经常饿着肚子上床睡觉.气候变化加剧了粮食危机.虽然养活世界人口(2022 年为 80 亿)已经足够困难,但为未来人口(2050 年为 98 亿) ...

  5. JavaScript:在一段时间不连续的数据中获取某一段时间段内相同时间间隔的数据

    需求:绘制highchats折线图,绘制某一个时间段内相同时间频率的图像. 如:绘制2020年5月11日-2020年6月11日之前,每隔4小时一次的数据(绘制近一个月内数据) 请求回的数据如下: 难点 ...

  6. Word控件Spire.Doc 【文本】教程(12) ;新方法在 C# 中获取 Word 文档中内容控件的别名、标签和 ID

    内容控件为您提供了一种设计文档的方法.当您向文档添加内容控件时,该控件由边框.标题和临时文本标识,这些文本可以向用户提供说明,并且可以防止用户编辑或删除文档的受保护部分. 将文档或模板的部分内容绑定到 ...

  7. 如何从机器学习数据中获取更多收益

     对于深度学习而言,合适的数据集以及合适的模型结构显得至关重要.选择错误的数据集或者错误的模型结构可能导致得到一个性能不佳的网络模型,甚至可能得到的是一个不收敛的网络模型.这个问题无法通过分析数据得到 ...

  8. 如何从机器学习数据中获取更多收益 1

    摘要: 本文讲解一些关于机器学习数据集的小技巧,分享个人经验,可供读者参考. 对于深度学习而言,合适的数据集以及合适的模型结构显得至关重要.选择错误的数据集或者错误的模型结构可能导致得到一个性能不佳的 ...

  9. matlab寻找频谱峰值,matlab – 从数据中获取FFT峰值

    如果你无法访问findpeaks,它的工作原理背后的基本前提是,对于信号中的每个点,它会搜索以此为中心的三元素窗口并检查该窗口的中心是否更大比这个窗口的左右元素.您希望能够找到正峰值和负峰值,因此您需 ...

最新文章

  1. Linux/Unix mii-tool command
  2. Oracle PL/SQL编程学习笔记:游标
  3. redis 亿级查询速度_亿级流量系统架构之如何保证百亿流量下的数据一致性(上)...
  4. 有负权重边的图可以有拉普拉斯矩阵吗?
  5. 一种基于云信sdk的互动直播的实现
  6. SAP UI5对于颜色使用的最佳实践
  7. 学习笔记之数据可视化(二)——页面布局(中)
  8. css @media 响应式布局
  9. 明晚8点,捷微团队QQ群公开课,讲解jeewx2.0版本maven环境的搭建入门!
  10. MogDB存储过程事务控制与异常块
  11. java 静态变量加载顺序_java 成员变量 静态变量代码块 静态代码快加载顺序
  12. IXWebHosting主机如何退款中文图解教程
  13. Eclipse-习惯设置/快捷键/插件
  14. 【Python】 html解析BeautifulSoup
  15. teablue数据分析_Bluetea蓝茶的品牌该如何分析,你知道吗
  16. 轻轻的,我来了!希望各大神关注~
  17. Quotes on Learing(求知若渴)
  18. javascript根据浏览器userAgent判断浏览器类型
  19. php怎么使用sendcloud,PHP开发之SendCloud发送邮件知几何
  20. 无法更新到Win8.1的原因与解决办法

热门文章

  1. Leetcode第286场周赛
  2. Linux网络编程服务器模型选择之循环服务器
  3. leetcode(一)刷题两数之和
  4. 【运筹与优化】单纯形法解线性规划问题(matlab实现)
  5. 膜拜大佬!不同层级的Android开发者的不同行为,社招面试心得
  6. 使用Nginx+uWSGI部署Django项目
  7. 【数据库】Oracle用户、授权、角色管理
  8. Python 拷贝对象(深拷贝deepcopy与浅拷贝copy)
  9. (算法)Trapping Rain Water I
  10. 解决Failed to connect session for conifg 故障