bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类

bigquery

BigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.

BigQuery可以加载TensorFlow SavedModel并执行预测。此功能是在数据仓库之上添加基于文本的相似性和群集的一种好方法。

Follow along by copy-pasting queries from my notebook in GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

然后在GitHub中从我的笔记本复制粘贴查询。您可以在BigQuery控制台或AI Platform Jupyter笔记本中尝试查询。

风暴报告数据 (Storm reports data)

As an example, I’ll use a dataset consisting of wind reports phoned into National Weather Service offices by “storm spotters”. This is a public dataset in BigQuery and it can be queried as follows:

举例来说，我将使用由“风暴发现者”致电国家气象局办公室的风报告组成的数据集。这是BigQuery中的公共数据集，可以按以下方式查询：

SELECT   EXTRACT(DAYOFYEAR from timestamp) AS julian_day,  ST_GeogPoint(longitude, latitude) AS location,  commentsFROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`WHERE EXTRACT(YEAR from timestamp) = 2019LIMIT 10

The result looks like this:

结果看起来像这样：

Let’s say that we want to build a SQL query to search for comments that look like “power line down on a home”.

假设我们要构建一个SQL查询来搜索看起来像“家中的电源线”的注释。

Steps:

脚步：

Load a machine learning model that creates an embedding (essentially a compact numerical representation) of some text.
加载一个机器学习模型，该模型创建一些文本的嵌入(本质上是紧凑的数字表示形式)。
Use the model to generate the embedding of our search term.
使用该模型生成搜索词的嵌入。
Use the model to generate the embedding of every comment in the wind reports table.
使用该模型可将每个评论嵌入风报告表中。
Look for rows where the two embeddings are close to each other.
查找两个嵌入彼此靠近的行。

将文本嵌入模型加载到BigQuery中 (Loading a text embedding model into BigQuery)

TensorFlow Hub has a number of text embedding models. For best results, you should use a model that has been trained on data that is similar to your dataset and which has a sufficient number of dimensions so as to capture the nuances of your text.

TensorFlow Hub具有许多文本嵌入模型。为了获得最佳结果，您应该使用经过训练的模型，该数据类似于您的数据集，并且具有足够的维数，以捕获文本的细微差别。

For this demonstration, I’ll use the Swivel embedding which was trained on Google News and has 20 dimensions (i.e., it is pretty coarse). This is sufficient for what we need to do.

在此演示中，我将使用在Google新闻上接受训练的Swivel嵌入，它具有20个维度(即，非常粗略)。这足以满足我们的需求。

The Swivel embedding layer is already available in TensorFlow SavedModel format, so we simply need to download it, extract it from the tarred, gzipped file, and upload it to Google Cloud Storage:

Swivel嵌入层已经可以使用TensorFlow SavedModel格式，因此我们只需要下载它，从压缩后的压缩文件中提取出来，然后将其上传到Google Cloud Storage：

FILE=swivel.tar.gzwget --quiet -O tmp/swivel.tar.gz  https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressedcd tmptar xvfz swivel.tar.gzcd ..mv tmp swivelgsutil -m cp -R swivel gs://${BUCKET}/swivel

Once the model files on GCS, we can load it into BigQuery as an ML model:

将模型文件保存到GCS后，我们可以将其作为ML模型加载到BigQuery中：

CREATE OR REPLACE MODEL advdata.swivel_text_embedOPTIONS(model_type='tensorflow', model_path='gs://BUCKET/swivel/*')

尝试在BigQuery中嵌入模型 (Try out embedding model in BigQuery)

To try out the model in BigQuery, we need to know its input and output schema. These would be the names of the Keras layers when it was exported. We can get them by going to the BigQuery console and viewing the “Schema” tab of the model:

要在BigQuery中试用模型，我们需要了解其输入和输出架构。这些将是导出时Keras图层的名称。我们可以通过转到BigQuery控制台并查看模型的“架构”标签来获得它们：

Let’s try this model out by getting the embedding for a famous August speech, calling the input text as sentences and knowing that we will get an output column named output_0:

让我们通过获得著名的August演讲的嵌入，将输入文本称为句子并知道我们将得到一个名为output_0的输出列来试用该模型：

SELECT output_0 FROMML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

The result has 20 numbers as expected, the first few of which are shown below:

结果有20个预期的数字，其中前几个显示如下：

文件相似度搜寻 (Document similarity search)

Define a function to compute the Euclidean squared distance between a pair of embeddings:

定义一个函数来计算一对嵌入之间的欧几里德平方距离：

CREATE TEMPORARY FUNCTION td(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>, idx INT64) AS (   (a[OFFSET(idx)] - b[OFFSET(idx)]) * (a[OFFSET(idx)] - b[OFFSET(idx)]));CREATE TEMPORARY FUNCTION term_distance(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) AS ((   SELECT SQRT(SUM( td(a, b, idx))) FROM UNNEST(GENERATE_ARRAY(0, 19)) idx));

Then, compute the embedding for our search term:

然后，为我们的搜索词计算嵌入：

WITH search_term AS (  SELECT output_0 AS term_embedding FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "power line down on a home" AS sentences)))

and compute the distance between each comment’s embedding and the term_embedding of the search term (above):

并计算每个评论的嵌入与搜索词的term_embedding之间的距离(如上)：

SELECT  term_distance(term_embedding, output_0) AS termdist,  commentsFROM ML.PREDICT(MODEL advdata.swivel_text_embed,(  SELECT comments, LOWER(comments) AS sentences  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`  WHERE EXTRACT(YEAR from timestamp) = 2019)), search_termORDER By termdist ASCLIMIT 10

The result is:

结果是：

Remember that we searched for “power line down on home”. Note that the top two results are “power line down on house” — the text embedding has been helpful in recognizing that home and house are similar in this context. The next set of top matches are all about power lines, the most unique pair of words in our search term.

请记住，我们搜索的是“家中的电源线”。请注意，最上面的两个结果是“房屋上的电源线断开”-文本嵌入有助于识别房屋和房屋在这种情况下是相似的。下一组热门匹配项都是关于电源线的，这是我们搜索词中最独特的词对。

文件丛集 (Document Clustering)

Document clustering involves using the embeddings as an input to a clustering algorithm such as K-Means. We can do this in BigQuery itself, and to make things a bit more interesting, we’ll use the location and day-of-year as additional inputs to the clustering algorithm.

文档聚类涉及将嵌入用作聚类算法(例如K-Means)的输入。我们可以在BigQuery本身中做到这一点，并使事情变得更加有趣，我们将位置和年份作为聚类算法的其他输入。

CREATE OR REPLACE MODEL advdata.storm_reports_clusteringOPTIONS(model_type='kmeans', NUM_CLUSTERS=10) ASSELECT  arr_to_input_20(output_0) AS comments_embed,  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,  longitude, latitudeFROM ML.PREDICT(MODEL advdata.swivel_text_embed,(  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`  WHERE EXTRACT(YEAR from timestamp) = 2019))

The embedding (output_0) is an array, but BigQuery ML currently wants named inputs. The work around is to convert the array to a struct:

嵌入(output_0)是一个数组，但是BigQuery ML当前需要命名输入。解决方法是将数组转换为结构：

CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)RETURNS STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,       p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64,        p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64,        p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,       p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>AS (STRUCT(    arr[OFFSET(0)]    , arr[OFFSET(1)]    , arr[OFFSET(2)]    , arr[OFFSET(3)]    , arr[OFFSET(4)]    , arr[OFFSET(5)]    , arr[OFFSET(6)]    , arr[OFFSET(7)]    , arr[OFFSET(8)]    , arr[OFFSET(9)]    , arr[OFFSET(10)]    , arr[OFFSET(11)]    , arr[OFFSET(12)]    , arr[OFFSET(13)]    , arr[OFFSET(14)]    , arr[OFFSET(15)]    , arr[OFFSET(16)]    , arr[OFFSET(17)]    , arr[OFFSET(18)]    , arr[OFFSET(19)]    ));

The resulting ten clusters can visualized in the BigQuery console:

可以在BigQuery控制台中看到生成的十个集群：

What do the comments in cluster #1 look like? The query is:

第1组中的注释是什么样的？查询是：

SELECT sentences FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`, (SELECT  sentences,  arr_to_input_20(output_0) AS comments_embed,  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,  longitude, latitudeFROM ML.PREDICT(MODEL advdata.swivel_text_embed,(  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`  WHERE EXTRACT(YEAR from timestamp) = 2019))))WHERE centroid_id = 1

The result shows that these are mostly short, uninformative comments:

结果表明，这些大多是简短的，无用的评论：

How about cluster #3? Most of these reports seem to have something to do with verification by radar!!!

第3组如何？这些报告大多数似乎与雷达验证有关！！！

Enjoy!

请享用！

链接 (Links)

TensorFlow Hub has several text embedding models. You don’t have to use Swivel, although Swivel is a good general-purpose choice.

TensorFlow Hub具有多个文本嵌入模型。尽管Swivel是一个不错的通用选择，但您不必使用Swivel 。

Full queries are in my notebook on GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

完整查询在我的GitHub笔记本上。您可以在BigQuery控制台或AI Platform Jupyter笔记本中尝试查询。

翻译自: https://towardsdatascience.com/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65

bigquery

http://www.taodudu.cc/news/show-997347.html

vlookup match_INDEX-MATCH — VLOOKUP功能的升级
flask redis_在Flask应用程序中将Redis队列用于异步任务
前馈神经网络中的前馈_前馈神经网络在基于趋势的交易中的有效性（1）
hadoop将消亡_数据科学家：适应还是消亡！
数据科学领域有哪些技术_领域知识在数据科学中到底有多重要？
初创公司怎么做销售数据分析_为什么您的初创企业需要数据科学来解决这一危机...
r软件时间序列分析论文_高度比较的时间序列分析-一篇论文评论
selenium抓取_使用Selenium的网络抓取电子商务网站
裁判打分_内在的裁判偏见
从Jupyter Notebook切换到脚本的5个理由
ip登录打印机怎么打印_不要打印，登录。
机器学习模型非线性模型_调试机器学习模型的终极指南
您的第一个简单的机器学习项目
鸽子为什么喜欢盘旋_如何为鸽子回避系统设置数据收集
追求卓越追求完美规范学习_追求新的黄金比例
周末想找个地方敲代码_观看我们的代码游戏，全周末直播
javascript 开发_25个新JavaScript开发人员的免费资源
感谢您的提问_感谢您的反馈，我们正在改进的5种方法
堆叠自编码器中的微调解释_25种深刻漫画中的编码解释
Free Code Camp现在有本地组
递归javascript_JavaScript中的递归
判断一个指针有没有free_Free Code Camp的每个人现在都有一个档案袋
使您的Java代码闻起来很新鲜
Stack Overflow 2016年对50,000名开发人员进行的调查得出的见解
编程程序的名称要记住吗_学习编程时要记住的5件事
如何在开源社区贡献代码_如何在15分钟内从浏览器获得您的第一个开源贡献
utf-8转换gbk代码_将代码转换为现金-如何以Web开发人员的身份赚钱并讲述故事。...
有没有编码的知识图谱_没有人告诉您关于学习编码的知识-以及为什么如此困难...
你鼓舞了我是世界杯主题曲吗_选择方法和鼓舞人心的网站列表
reddit_我在3天内疯狂地审查了Reddit上的50个投资组合，从中学到了什么。