opencv查找边界

One of the more difficult challenges we face when analyzing data is to effectively identify and report on boundaries. Data sets can contain any number of significant starting and stopping points that may indicate significant events, such as missing data, important business events, or actionable changes in usage. Regardless of the use case, knowing how to quickly locate and manage data boundaries is extremely useful. Knowing how to design solutions that can effectively avoid these scenarios can also be helpful in the long run.

我们在分析数据时面临的最困难的挑战之一是有效地识别和报告边界。 数据集可以包含许多重要的起点和终点,这些起点和终点可能指示重要的事件,例如数据丢失,重要的业务事件或用法的可更改。 无论用例如何,了解如何快速定位和管理数据边界都是非常有用的。 从长远来看,了解如何设计可有效避免这些情况的解决方案也将有所帮助。

销售成绩 (Sales Performance)

There are many applications in which we want to return summary data that indicates the overall continuity of the data within a table. Whether interested in the exceptions where rows are missing, or the size and shape of data groups, the solutions to this problem can a challenge without a framework to divide these problems into smaller, simpler solutions. In addition, many demos of these problems offer very simplistic or trivial use cases. Our goal here is to recreate a few realistic scenarios in which determining specific boundary data for a result set can be a worthwhile task.

在很多应用程序中,我们都希望返回汇总数据,该数据指示表中数据的整体连续性。 无论是对缺少行的异常还是数据组的大小和形状感兴趣,如果没有框架将这些问题分解为更小,更简单的解决方案,该问题的解决方案都会面临挑战。 此外,这些问题的许多演示都提供了非常简单或琐碎的用例。 我们的目标是重新创建一些现实的场景,在这些场景中,确定结果集的特定边界数据可能是一项值得的任务。

Let’s create an example using data from AdventureWorks using their sales data. Let’s say we were in charge of looking at sales trends over time and determined that any day where we sell less than $10,000 is significant and should be investigated further. If our company were performing optimally, there would never be a date for which our sales are below this amount. Running this query would return a row per day, including sales totals for the period of time tracked in these tables:

让我们使用AdventureWorks的销售数据来创建示例。 假设我们负责查看一段时间内的销售趋势,并确定我们每天卖出的商品少于$ 10,000的那一天是很重要的,应该进一步研究。 如果我们的公司表现最佳,那么我们的销售绝不会低于这个日期。 运行此查询将每天返回一行,包括这些表中跟踪的时间段内的销售总额:


SELECTOrderDate,SUM(SubTotal) AS Daily_Order_Total
FROM Sales.SalesOrderHeader
WHERE OrderDate BETWEEN '2005-07-01' AND '2008-07-31'
GROUP BY OrderDate
ORDER BY OrderDate ASC;

Any real data set has variations in it, and here we will find days where we sell ten times the limit defined above, and others where we cannot make even half of it. We can chart this data to show the ups and downs, but there use in understanding how sales group together. With that data, we can answer questions such as, “How many days in a row, on average, do we exceed our goal?” “When we fall below the goal, how many days do we do so for?” and “What is the average time that passes from one sales failure to the next?”

任何真实的数据集都有其变化,在这里我们会发现几天的时间里,我们卖出的价格是上述限制的十倍,而其他时候则无法获得一半。 我们可以将这些数据绘制成图表以显示涨跌,但是在理解销售如何组合在一起时很有用。 有了这些数据,我们可以回答诸如“我们平均连续几天超过目标?”之类的问题。 “当我们低于目标时,我们会持续几天?” 和“从一次销售失败到下一次销售失败平均需要多少时间?”

To construct a solution that could answer these and many other questions, we must build a summary data set based on the parameters we defined above. One way to parse this data would be to create a list of all days in which sales did not reach the goal:

要构建可以回答这些问题和许多其他问题的解决方案,我们必须基于上面定义的参数来构建汇总数据集。 解析此数据的一种方法是创建所有未达到目标销售天数的列表:


WITH CTE_SALES AS (
SELECTOrderDate,SUM(SubTotal) AS Daily_Order_Total
FROM Sales.SalesOrderHeader
WHERE OrderDate BETWEEN '2005-07-01' AND '2008-07-31'
GROUP BY OrderDate)
SELECTOrderDate
FROM CTE_SALES
WHERE Daily_Order_Total < 10000
ORDER BY OrderDate ASC;

This returns a list of dates, in order, for which daily sales were under $10k:

这将按顺序返回日期列表,这些列表的日销售额低于10000美元:

A total of 137 rows were returned, and while this is useful data, it doesn’t answer very many questions without further analysis. From this point, we can put that data into a temporary table and analyze further, or we can find a set-based solution that provides what we want in a single step. For this problem, we will do both!

总共返回了137行,尽管这是有用的数据,但是如果不进行进一步分析,它不会回答很多问题。 从这一点出发,我们可以将数据放入临时表中并进行进一步分析,或者可以找到一个基于集合的解决方案,可在一步之内提供所需的内容。 对于这个问题,我们都会做!

Let’s say that we placed the above results into a temporary table called #order_dates and to start, we want to summarize the data into a set of dates indicating any periods of time in which the sales goal was not met. In order to do this effectively, we need to be able to order these dates and determine how far from the starting date each one is. By collecting this data, we can determine which dates are adjacent and comprise a period of poor sales, which dates stand alone, and which periods have no corresponding rows. To allow for cleaner TSQL, we’ll put the start and end dates for our data set into scalar variables, which could later be parameterized to ensure good performance:

比方说,我们把上面的结果到一个临时表名为#order_dates并开始,我们希望将数据汇总到一组日期到表示销售目标没有实现任何时间段的。 为了有效地做到这一点,我们需要能够对这些日期进行排序,并确定每个日期离开始日期有多远。 通过收集此数据,我们可以确定哪些日期是相邻的,并包括一个销售不佳的期间,哪些日期是独立的,哪些期间没有相应的行。 为了获得更整洁的TSQL,我们将数据集的开始日期和结束日期放入标量变量中,以后可以对其进行参数化以确保良好的性能:


DECLARE @start_date DATE = '2005-07-01';
DECLARE @end_date DATE = '2008-07-31';
SELECTOrderDate,DATEDIFF(DAY, @start_date, OrderDate) AS DaysFromStart
FROM #order_dates
ORDER BY OrderDate

The result set provides a relationship between the start date for our data and any given row:

结果集提供了数据的开始日期与任何给定行之间的关系:

Using this information, we can compare this measurement of days from the start of our data with its relative position within the set. The result will be a numeric identifier that is identical for any adjacent rows that occur for consecutive days:

使用此信息,我们可以将数据开始以来的天数测量值与其在集合中的相对位置进行比较。 结果将是一个数字标识符,该数字标识符对于连续几天出现的所有相邻行都是相同的:


WITH CTE_DAYSFROMSTART AS (SELECTOrderDate,DATEDIFF(DAY, @start_date, OrderDate) AS DaysFromStartFROM #order_dates)
SELECTOrderDate,DaysFromStart,ROW_NUMBER() OVER (ORDER BY DaysFromStart) AS rownum,DaysFromStart - ROW_NUMBER() OVER (ORDER BY DaysFromStart) AS OffsetDays
FROM CTE_DAYSFROMSTART

The results show the DaysFromStart value that we created earlier, as well as a row number, and the difference between these values:

结果显示了我们之前创建的DaysFromStart值以及行号,以及这些值之间的差:

Note that whenever there are two consecutive days in the list, OffSetDays is equal for those rows. This is the basis for tying together these queries into a set of common table expressions that can solve our original question:

请注意,只要列表中连续两天,这些行的OffSetDays就相等。 这是将这些查询绑定到一组通用表表达式中的基础,这些表表达式可以解决我们原来的问题:


WITH CTE_DAYSFROMSTART AS (SELECTOrderDate,DATEDIFF(DAY, @start_date, OrderDate) AS DaysFromStartFROM #order_dates),
CTE_DATEDIFF AS (
SELECTOrderDate,DaysFromStart,DaysFromStart - ROW_NUMBER() OVER (ORDER BY DaysFromStart) AS OffsetDays
FROM CTE_DAYSFROMSTART)
SELECTDATEADD(DAY, MIN(DaysFromStart), @start_date) AS StartDate,DATEADD(DAY, MAX(DaysFromStart), @start_date) AS EndDate
FROM CTE_DATEDIFF
GROUP BY OffsetDays;

Since OffsetDays is in chronological order, grouping by it allows us to analyze each individual set of rows that share a value. For each subset, the minimum and maximum values will provide a start and end date for each block of dates:

由于OffsetDays是按时间顺序排列的,因此按它进行分组可以让我们分析共享值的每个行集合。 对于每个子集,最小值和最大值将为每个日期块提供开始和结束日期:

The end result is a set of date ranges, each of which identify a day or days in which sales were below our threshold of $10k for any particular day. If we’d like to remove a CTE and shorten the above TSQL a bit, we can roll the definition of DaysFromStart into the remainder of the query:

最终结果是一组日期范围,每个日期范围都可以确定某一天某天或某天的销售额低于我们的$ 10k阈值。 如果我们想删除CTE并略微缩短上面的TSQL,可以将DaysFromStart的定义放入查询的其余部分:


WITH CTE_DATEDIFF AS (
SELECTOrderDate,DATEDIFF(DAY, @start_date, OrderDate) AS DaysFromStart,DATEDIFF(DAY, @start_date, OrderDate) - ROW_NUMBER() OVER (ORDER BY DATEDIFF(DAY, @start_date, OrderDate)) AS OffsetDays
FROM #order_dates)
SELECTDATEADD(DAY, MIN(DaysFromStart), @start_date) AS StartDate,DATEADD(DAY, MAX(DaysFromStart), @start_date) AS EndDate
FROM CTE_DATEDIFF
GROUP BY OffsetDays;

This query is logically equivalent to the previous one, and also performs similarly. It is also possible to solve this problem using the original data, without storing the results in a temporary table:

该查询在逻辑上与上一个查询等效,并且执行类似。 也可以使用原始数据解决此问题,而无需将结果存储在临时表中:


WITH CTE_SALES AS (
SELECTOrderDate,SUM(SubTotal) AS Daily_Order_Total
FROM Sales.SalesOrderHeader
WHERE OrderDate BETWEEN @start_date AND @end_date
GROUP BY OrderDate),
CTE_DATEDIFF AS (
SELECTOrderDate,DATEDIFF(DAY, @start_date, OrderDate) AS DaysFromStart,DATEDIFF(DAY, @start_date, OrderDate) - ROW_NUMBER() OVER (ORDER BY DATEDIFF(DAY, @start_date, OrderDate)) AS OffsetDays
FROM CTE_SALES
WHERE Daily_Order_Total < 10000)
SELECTDATEADD(DAY, MIN(DaysFromStart), @start_date) AS StartDate,DATEADD(DAY, MAX(DaysFromStart), @start_date) AS EndDate
FROM CTE_DATEDIFF
GROUP BY OffsetDays;

As before, the results are equivalent, and performance of each of these options is also similar. Which one to use would depend on readability, as well as if we’d like to reuse the temp table for any other operations.

和以前一样,结果是相同的,并且每个选项的性能也相似。 使用哪一个取决于可读性,以及我们是否想将temp表重新用于其他任何操作。

There are other ways to solve this problem. For example, if we wanted to get a list of all date ranges in which we did meet our sales goal, we could create start points and end points as we did above, but in separate steps:

还有其他方法可以解决此问题。 例如,如果我们想获取确实达到销售目标的所有日期范围的列表,则可以像上面一样创建起点和终点,但是可以分步进行:


DECLARE @start_date DATE = '2005-07-01';
DECLARE @end_date DATE = '2008-07-31';SELECTOrderDate
FROM Sales.SalesOrderHeader
WHERE OrderDate BETWEEN @start_date AND @end_date
GROUP BY OrderDate
HAVING SUM(SubTotal) >= 10000
order by OrderDate;WITH CTE_SALES AS (SELECTOrderDateFROM Sales.SalesOrderHeaderWHERE OrderDate BETWEEN @start_date AND @end_dateGROUP BY OrderDateHAVING SUM(SubTotal) >= 10000),
CTE_StartDates AS (SELECTOrderDate,ROW_NUMBER() OVER (ORDER BY OrderDate) AS rownumFROM CTE_SALES AS SalesBaseDayWHERE NOT EXISTS (SELECT*FROM CTE_SALES AS SalesAdjacentDayWHERE DATEDIFF(DAY, SalesAdjacentDay.OrderDate, SalesBaseDay.OrderDate) = 1))
SELECT*
FROM CTE_StartDates

CTE_StartDates selects from the same base data set that we worked with earlier, but performs an existence check to determine if an adjacent row exists after the current row. The original data is selected above this to serve as a reference:

CTE_StartDates从我们之前使用过的相同基础数据集中进行选择,但是执行存在性检查以确定当前行之后是否存在相邻行。 在此之上选择原始数据作为参考:

Note the breaks in the OrderDate list that signify days in which the sales goal was not met and the corresponding rows below that indicate the start of each continuous set of days in which the goal was met. We can perform an almost identical task in order to retrieve the end dates for each group of days by changing the DATEDIFF to check for -1 rather than 1:

请注意, OrderDate列表中的中断表示未达到销售目标的日期 ,并且下面的相应行指示了达到目标的每个连续天数的开始。 为了更改每一天的结束日期,我们可以执行几乎相同的任务,方法是更改DATEDIFF来检查-1而不是1:

 WHERE DATEDIFF(DAY, SalesAdjacentDay.OrderDate, SalesBaseDay.OrderDate) = -1))

The results are as follows:

结果如下:

Note that each row in the second result set corresponds to the end of a set of good sales days in the first result set. With starting and ending points defined, the task of combining them is the same as it was in the previous analysis:

请注意,第二个结果集中的每一行都对应于第一个结果集中一组良好销售日的结束。 定义起点和终点后,将它们组合起来的任务与之前的分析相同:


DECLARE @start_date DATE = '2005-07-01';
DECLARE @end_date DATE = '2008-07-31';WITH CTE_SALES AS (SELECTOrderDateFROM Sales.SalesOrderHeaderWHERE OrderDate BETWEEN @start_date AND @end_dateGROUP BY OrderDateHAVING SUM(SubTotal) >= 10000),
CTE_StartDates AS (SELECTOrderDate,ROW_NUMBER() OVER (ORDER BY OrderDate) AS rownumFROM CTE_SALES AS SalesBaseDayWHERE NOT EXISTS (SELECT*FROM CTE_SALES AS SalesAdjacentDayWHERE DATEDIFF(DAY, SalesAdjacentDay.OrderDate, SalesBaseDay.OrderDate) = 1)),
CTE_EndDates AS (SELECTOrderDate,ROW_NUMBER() OVER (ORDER BY OrderDate) AS rownumFROM CTE_SALES AS SalesBaseDayWHERE NOT EXISTS (SELECT*FROM CTE_SALES AS SalesAdjacentDayWHERE DATEDIFF(DAY, SalesAdjacentDay.OrderDate, SalesBaseDay.OrderDate) = -1))
SELECTCTE_StartDates.OrderDate AS StartDate,CTE_EndDates.OrderDate AS EndDate
FROM CTE_StartDates
INNER JOIN CTE_EndDates
ON CTE_StartDates.rownum = CTE_EndDates.rownum

By joining each set of dates based on the row number, we end up with the inverse of the StartDate/EndDate list from earlier. This time we report on the dates in which sales met the goal, rather than those dates where it wasn’t achieved:

通过基于行号加入每组日期,我们得到了前面的StartDate / EndDate列表的倒数。 这次,我们报告的是达成销售目标的日期,而不是未达到目标的日期:

Given this data, we can determine a huge variety of metrics that could be useful for further analysis. Each of these metrics can be used in place of the final SELECT statement above:

有了这些数据,我们可以确定各种各样的指标,这些指标对于进一步分析很有用。 这些指标中的每一个都可以代替上面的最终SELECT语句使用:

Average length in days in a successful sales streak:

成功销售连胜的平均天数:


SELECTAVG(CAST(DATEDIFF(DAY, CTE_StartDates.OrderDate, CTE_EndDates.OrderDate) + 1 AS DECIMAL)) AS AverageSuccessDays
FROM CTE_StartDates
INNER JOIN CTE_EndDates
ON CTE_StartDates.rownum = CTE_EndDates.rownum

Average length in days of unsuccessful sales streaks:

销售条纹失败的平均天数:


CTE_ISLANDS AS (SELECTCTE_StartDates.OrderDate AS StartDate,CTE_EndDates.OrderDate AS EndDate,ROW_NUMBER() OVER (ORDER BY CTE_StartDates.OrderDate ASC) AS rownumFROM CTE_StartDatesINNER JOIN CTE_EndDatesON CTE_StartDates.rownum = CTE_EndDates.rownum)
SELECTAVG(CAST(DATEDIFF(DAY, LeadingRow.EndDate, LaggingRow.StartDate) - 1 AS DECIMAL)) AS AverageFailureDays
FROM CTE_ISLANDS LeadingRow
INNER JOIN CTE_ISLANDS LaggingRow
ON LaggingRow.rownum = LeadingRow.rownum + 1

Number of successful sales streaks started in each month for all data:

所有数据在每个月开始的成功销售条纹数:


SELECTDATEPART(MONTH, CTE_StartDates.OrderDate) AS StartMonth,COUNT(DATEPART(MONTH, CTE_StartDates.OrderDate)) AS NumberOfWinningStreaksStarted
FROM CTE_StartDates
INNER JOIN CTE_EndDates
ON CTE_StartDates.rownum = CTE_EndDates.rownum
GROUP BY DATEPART(MONTH, CTE_StartDates.OrderDate)

Note that the number of winning streaks started per month is also indicative of the number of poor sales days per month. As a result, a metrics such as this may provide multiple roads for analysis.

请注意,每月开始的连胜数也表示每月销售天数不佳。 结果,诸如此类的度量可以提供用于分析的多条道路。

Any of these metrics (and others) could be calculated by inserting the initial results into a temp table, in order to remove potential performance problems when nesting so many CTEs together.

这些指标(及其他指标)中的任何一个都可以通过将初始结果插入临时表中来计算,以消除将这么多CTE嵌套在一起时的潜在性能问题。

交易记录 (Transaction History)

Another scenario in which boundaries may be useful to analyze is when looking at order history for a specific product. When are the largest gaps in sales, and what events lead up to them? How do we translate specific data requests into a query for which we can analyze it as we did previously?

边界可能对分析有用的另一种情况是查看特定产品的订单历史记录。 什么时候最大的销售差距,是什么原因导致的? 我们如何像以前一样将特定的数据请求转换为查询,以对其进行分析?

Let’s consider the situation where we are interested in mountain tire sales. Specifically, we want to understand the quantity sold per day and understand those periods in which we sell the least per day. To do this, we will create a TSQL query that filters the transaction history data specifically for mountain tires where the transaction type is a Sale, and where the number of distinct transactions is less than 15:

让我们考虑一下我们对山地轮胎销售感兴趣的情况。 具体来说,我们想了解每天的销售量,并了解我们每天销售最少的那些时期。 为此,我们将创建一个TSQL查询,该查询专门针对交易类型为Sale且不同交易数量小于15的山地轮胎过滤交易历史记录数据:


SELECTTransactionHistory.TransactionDate,COUNT(*) AS TransactionsCount
FROM Production.TransactionHistory
INNER JOIN Production.Product
ON TransactionHistory.ProductID = Product.ProductID
WHERE Product.Name LIKE '%Mountain Tire%'
AND TransactionHistory.TransactionType = 'S'
GROUP BY TransactionHistory.TransactionDate
HAVING COUNT(TransactionID) < 15
ORDER BY TransactionHistory.TransactionDate;

This returns a row with a date and the number of transactions for that particular day:

这将返回一行,其中包含日期和该特定日期的交易次数:

With this data set defined, we can turn this data into date ranges in which we provide some aggregate transaction details:

定义此数据集后,我们可以将这些数据转换为日期范围,在其中提供一些汇总的交易明细:


DECLARE @start_date DATE = '2007-09-01';
DECLARE @end_date DATE = '2008-07-31';WITH CTE_TRANSACTIONS AS (
SELECTTransactionHistory.TransactionDate,COUNT(*) AS TransactionsCount
FROM Production.TransactionHistory
INNER JOIN Production.Product
ON TransactionHistory.ProductID = Product.ProductID
WHERE Product.Name LIKE '%Mountain Tire%'
AND TransactionHistory.TransactionType = 'S'
GROUP BY TransactionHistory.TransactionDate
HAVING COUNT(TransactionID) < 15),
CTE_DATEDIFF AS (
SELECTTransactionDate,DATEDIFF(DAY, @start_date, TransactionDate) AS DaysFromStart,DATEDIFF(DAY, @start_date, TransactionDate) - ROW_NUMBER() OVER (ORDER BY DATEDIFF(DAY, @start_date, TransactionDate)) AS OffsetDays,TransactionsCount
FROM CTE_TRANSACTIONS)
SELECTDATEADD(DAY, MIN(DaysFromStart), @start_date) AS StartDate,DATEADD(DAY, MAX(DaysFromStart), @start_date) AS EndDate,SUM(TransactionsCount) AS TransactionsCount
FROM CTE_DATEDIFF
GROUP BY OffsetDays;

This query returns the groups of days in which the transaction count was under 15, as well as the number of transactions for that given period:

此查询返回交易计数低于15的天数,以及该给定期间的交易数:

The ability to carry aggregate metrics such as count, sum, min, or max through these data transformations can be a time and resource saver later on, in the event that we need transaction counts, revenue, or other data, in addition to dates. Since Production.TransactionHistory is a large table, I will place the above results in a temporary table, #Transaction_Data, and calculate our last few metrics using this small data set instead:

如果我们需要除日期以外的交易计数,收入或其他数据,那么通过这些数据转换来携带诸如数量,总和,最小值或最大值之类的汇总指标的功能可能会在以后节省时间和资源。 由于Production.TransactionHistory是一大桌,我会放在一个临时表,#Transaction_Data上述结果,并计算我们用这个小的数据集,而不是过去的几年指标:

Top losing streaks by fewest transactions per day:

每天交易最少的损失最大的条纹:


WITH CTE_AVERAGES AS (SELECTStartDate,EndDate,DATEDIFF(DAY, StartDate, EndDate) + 1 AS TimeSpan,TransactionsCountFROM #Transaction_Data)
SELECT TOP 10CAST(CAST(TransactionsCount AS DECIMAL) / CAST(TimeSpan AS DECIMAL) AS DECIMAL(10,2)) AS AveragePerTimeSpan,StartDate,EndDate,TransactionsCount
FROM CTE_AVERAGES
ORDER BY CAST(TransactionsCount AS DECIMAL) / CAST(TimeSpan AS DECIMAL) ASC

This returns an ordered list of the time spans with the fewest transactions:

这将返回事务最少的时间的有序列表:

Months with the most transactions accounted for via low activity periods that started in that month:

交易次数最多的月份是从该月开始的低活动期开始的:


WITH CTE_MONTH_DATA AS (SELECTDATEPART(MONTH, StartDate) AS TransactionMonth,TransactionsCount,DATEDIFF(DAY, StartDate, EndDate) + 1 AS DaysIncludedFROM #Transaction_Data
)
SELECTTransactionMonth,SUM(TransactionsCount) AS AggregateTransactionCount,COUNT(*) AS NumberOfDateRanges,SUM(DaysIncluded) AS DaysIncludedInAggregate
FROM CTE_MONTH_DATA
GROUP BY TransactionMonth

Looking at this data, it would appear that the most volatile month with the largest percentage of the data appearing in low-activity periods is July, whereas December has the fewest periods of low activity. Note that August is excluded in this data as the underlying results include no entries from that month.

查看此数据,似乎活动最活跃的月份是数据最活跃的月份,它出现在活动少的时期是7月,而12月的活动少的时期最少。 请注意,此数据中排除了8月,因为基础结果不包含该月的条目。

棒球统计 (Baseball Statistics)

There are few areas more suitable for statistical analysis than in sports, and of the many sports played around the world, few rival baseball for the volume, accuracy, and complexity of available data. Before diving into a few additional examples, let’s summarize a basic method for effectively analyzing data in order to work with data boundaries:

很少有比体育运动更适合统计分析的领域,在世界范围内进行的许多体育运动中,就可用数据的数量,准确性和复杂性而言,几乎没有竞争对手能与之竞争。 在深入研究其他示例之前,让我们总结一种有效分析数据以使用数据边界的基本方法:

  1. Determine the criteria needed to enumerate a desired boundary.
  2. 确定 枚举所需边界所需 条件
  3. Use that criteria to order the data based on whatever dates, times, or sequences are to be analyzed.使用该条件可以根据要分析的日期,时间或序列对数据进行排序。
  4. If necessary, convert that order into an integer sequence, which will allow for much easier analysis.如有必要,将该顺序转换为整数序列,这将使分析变得更加容易。
  5. Calculate start and end points for gaps or islands formed by those boundaries.计算由这些边界形成的间隙或孤岛的起点和终点。
  6. Use the resulting data ranges to calculate whatever metrics you are looking for.
  7. 使用结果数据范围来计算 查找的 任何 指标

Using data from Retrosheet.org, a non-profit organization that collects and makes publically available baseball statistics, I’ve created a new database called BaseballStats and loaded game log data into a new table called dbo.GameLog. This table contains a row per game played, and includes data on each team, who played, and a plethora of statistics about how the game was played. This includes everything from the name of the third base umpire to how many triples were scored by each team. The data used here covers all games played from 1871 through 2015: 212,435 games in total.

使用来自非营利组织Retrosheet.org的数据,该组织收集并公开提供棒球统计数据,我创建了一个名为BaseballStats的新数据库,并将游戏日志数据加载到名为dbo的新表中 GameLog 。 该表包含每场比赛的行,并包括每支球队,谁参加比赛的数据以及有关比赛进行方式的大量统计数据。 这包括从第三名裁判员的名字到每支球队打进多少个三连冠的一切。 此处使用的数据涵盖了从1871年到2015年的所有比赛:总共212,435场比赛。

A sample of data from this table looks like this:

该表中的数据样本如下所示:

While this table contains 163 columns, we’ll only focus on a small subset of them, in an effort to keep this simple.

尽管此表包含163列,但为了简化起见,我们仅关注其中的一小部分。

Some of the most commonly requested statistics in sports involves winning streaks. Not only are we interested in knowing how many games a team currently has won in a row, but we may also have more involved questions, such as:

运动中最常用的一些统计数据涉及连胜纪录。 我们不仅想知道一个团队目前连续赢得了多少场比赛,而且还可能涉及更多涉及的问题,例如:

  1. What is the longest winning stream a team has had in the past 5 years?一支球队在过去5年中获得的最长胜利是什么?
  2. What is the longest losing streak a team has had versus a specific other team?与一支其他球队相比,一支球队最长的连败纪录是什么?
  3. How many games has a specific pitcher won in a row?一个特定的投手连续赢得了多少场比赛?
  4. How much longer are losing streaks by a team when at least one error was made?至少犯一个错误时,一支球队输掉条纹的时间还要多长?
  5. What is the longest winning streak by a team, at home, on Sundays, at night?一支球队在家里,星期天,晚上最长的连胜纪录是什么?

While the last one is a bit of a joke, statistical analysis of sports can get this in-depth. There are players that legitimately pitch better at night, hit more home runs after a day of rest, or hit significantly worse against lefties. Managers that are looking for an edge in a close matchup are more than happy to make minor changes for small gains overall.

尽管最后一个是个玩笑,但对体育运动的统计分析可以深入了解这一点。 有一些球员可以合理地在晚上提高音调,或者在休息一天后击出更多的本垒打,或者对左撇子打得更差。 希望在近距离对决中占优势的经理们很乐意为总体收益进行微小的改变而做出微小的改变。

For our first metric, let’s determine when the longest regular season winning streak in the history of the Yankees was, and how many games they won before their streak came to an end. In order to do this, we need to create an ordered list of games in which they played:

对于我们的第一个指标,让我们确定洋基队历史上最长的常规赛连胜纪录是什么,以及连胜结束之前他们赢得了多少场比赛。 为了做到这一点,我们需要创建一个他们玩过的游戏的有序列表:


SELECT*,CASE WHEN HomeScore > VisitingScore AND HomeTeamName = 'NYA' THEN 'W'WHEN HomeScore > VisitingScore AND VisitingTeamName = 'NYA' THEN 'L'WHEN VisitingScore > HomeScore AND VisitingTeamName = 'NYA' THEN 'W'WHEN VisitingScore > HomeScore AND HomeTeamName = 'NYA' THEN 'L'END AS Result
FROM dbo.GameLog
WHERE (HomeTeamName = 'NYA' OR VisitingTeamName = 'NYA')
AND GameType = 'REG'
ORDER BY GameDate ASC, GameNumber ASC;

This returns all 17658 games they have played in since 1903:

这将返回他们自1903年以来参加的所有17658场比赛:

The Result column we create will be greatly simplify calculations later on as we’ll have an easy way to identify winning vs. losing games. With this information, we’ll build a sequence where wins signify a bounded set of interest, and losses are gaps in that data:

我们创建的“结果”列将在以后大大简化计算,因为我们将有一种简单的方法来识别获胜与失胜的游戏。 借助这些信息,我们将构建一个序列,其中获胜表示有限的利益,而损失就是该数据中的缺口:


WITH CTE_BASEBALL_GAMES AS (SELECT*,CASE WHEN HomeScore > VisitingScore AND HomeTeamName = 'NYA' THEN 'W'WHEN HomeScore > VisitingScore AND VisitingTeamName = 'NYA' THEN 'L'WHEN VisitingScore > HomeScore AND VisitingTeamName = 'NYA' THEN 'W'WHEN VisitingScore > HomeScore AND HomeTeamName = 'NYA' THEN 'L'END AS Result,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS rownumFROM dbo.GameLogWHERE (HomeTeamName = 'NYA' OR VisitingTeamName = 'NYA')AND GameType = 'REG'AND (HomeScore > VisitingScore OR VisitingScore > HomeScore)),
CTE_StartofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS StartNumFROM CTE_BASEBALL_GAMES AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_BASEBALL_GAMES AS PreviousGameWHERE GameDay.rownum = PreviousGame.rownum + 1AND PreviousGame.Result = 'L')),
CTE_EndofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS EndNumFROM CTE_BASEBALL_GAMES AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_BASEBALL_GAMES AS NextGameWHERE GameDay.rownum = NextGame.rownum - 1AND NextGame.Result = 'L'))
SELECTCTE_StartofWinningStreak.GameDate AS StartDate,CTE_EndofWinningStreak.GameDate AS EndDate,*
FROM CTE_StartofWinningStreak
INNER JOIN CTE_EndofWinningStreak
ON CTE_StartofWinningStreak.StartNum = CTE_EndofWinningStreak.EndNum
ORDER BY CTE_StartofWinningStreak.GameDate, CTE_StartofWinningStreak.GameNumber;

Our base data set (CTE_BASEBALL_GAMES) checks for only regular season games, only those where the result was a win or loss—ties are excluded. The second and third CTEs each determine a start and end boundary for each winning streak. Since multiple games can be played in a single day, we consistently order data by GameDate and GameNumber. The results of this query are the start and end dates of each winning streak, as well as all of the data associated with the starting game in the streak:

我们的基本数据集(CTE_BASEBALL_GAMES)仅检查常规赛季的比赛,仅检查结果是赢还是输的比赛-并列关系。 第二和第三CTE分别确定每个获胜连胜的起点和终点。 由于一天可以玩多个游戏,因此我们通过GameDateGameNumber一致地订购数据。 该查询的结果是每个获胜连胜的开始和结束日期,以及与连胜中的起始游戏相关的所有数据:

From this data set, we can calculate the length of their longest winning streak as follows:

根据此数据集,我们可以计算出最长最长连胜记录的长度,如下所示:


WITH CTE_BASEBALL_GAMES AS (SELECT*,CASE WHEN HomeScore > VisitingScore AND HomeTeamName = 'NYA' THEN 'W'WHEN HomeScore > VisitingScore AND VisitingTeamName = 'NYA' THEN 'L'WHEN VisitingScore > HomeScore AND VisitingTeamName = 'NYA' THEN 'W'WHEN VisitingScore > HomeScore AND HomeTeamName = 'NYA' THEN 'L'END AS Result,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS rownumFROM dbo.GameLogWHERE (HomeTeamName = 'NYA' OR VisitingTeamName = 'NYA')AND GameType = 'REG'AND (HomeScore > VisitingScore OR VisitingScore > HomeScore)),
CTE_StartofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS StartNumFROM CTE_BASEBALL_GAMES AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_BASEBALL_GAMES AS PreviousGameWHERE GameDay.rownum = PreviousGame.rownum + 1AND PreviousGame.Result = 'L')),
CTE_EndofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS EndNumFROM CTE_BASEBALL_GAMES AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_BASEBALL_GAMES AS NextGameWHERE GameDay.rownum = NextGame.rownum - 1AND NextGame.Result = 'L'))
SELECTCTE_EndofWinningStreak.rownum - CTE_StartofWinningStreak.rownum + 1 AS WinningStreak,CTE_StartofWinningStreak.GameDate AS StartGameDate, CTE_StartofWinningStreak.GameNumber AS StartGameNumber,CTE_EndofWinningStreak.GameDate AS EndGameDate, CTE_EndofWinningStreak.GameNumber AS EndGameNumber,*
FROM CTE_StartofWinningStreak
INNER JOIN CTE_EndofWinningStreak
ON CTE_StartofWinningStreak.StartNum = CTE_EndofWinningStreak.EndNum
ORDER BY CTE_EndofWinningStreak.rownum - CTE_StartofWinningStreak.rownum DESC;

This returns all winning streaks sorted by their length, with the top winning streak appearing first in the list:

这将返回按长度排序的所有获胜条纹,其中最高的获胜条纹在列表中排名第一:

From this data, we can conclude that the longest winning streak in the history of the New York Yankees was 19 games, starting in the second game of a double header on 6/29/1947 versus the Washington Senators and ending after a double header on 7/17/1947.

根据这些数据,我们可以得出结论,纽约洋基队历史上最长的连胜纪录是19场比赛,始于1947年6月29日对阵华盛顿参议员的第二场双打头球,直到第二场对1947年7月17日。

Using this method, we could calculate winning streaks under any conditions by modifying the initial SELECT statement in CTE_BASEBALL_STATS, allowing us to customize the date span, filter on the presence of specific players, teams, or anything else we can dream up.

使用这种方法,我们可以通过修改CTE_BASEBALL_STATS中的初始SELECT语句来计算在任何条件下的获胜连胜,从而使我们能够自定义日期跨度,筛选特定玩家,球队或其他我们可以梦想的东西。

For one final example, let’s analyze pitching performance and determine the distribution of winning streaks by legend Nolan Ryan. In order to do this, we’ll need to start with a data set of all games he has pitched, flagged with whether he won or lost. The data from Retrosheet includes the names of all players on each team, as well as the winning pitcher’s name, as well as losing pitcher’s name. Using this, we can flag each game he pitched as a Win (W), Loss (L), or No decision (N). Winning streaks will only be counted if all games were won, and if Nolan Ryan started the game (games where he was brought in as a reliever won’t be included):

对于最后一个示例,让我们分析传奇人物Nolan Ryan的投球表现并确定获胜条纹的分布。 为了做到这一点,我们需要从他已经投球的所有比赛的数据集开始,标记出他是赢还是输。 Retrosheet中的数据包括每个团队中所有球员的姓名,以及获胜的投手的名字和输球的名字。 使用此功能,我们可以将他的每场比赛标记为赢(W),输(L)或无决定(N)。 只有在所有比赛都获胜并且诺兰·瑞安(Nolan Ryan)开始比赛的情况下,才会计算获胜条纹(不包括他作为后援的比赛):


DECLARE @PitcherName VARCHAR(100) = 'Nolan Ryan';
SELECTCASE WHEN WinningPitcherName = @PitcherName THEN 'W'WHEN LosingPitcherName = @PitcherName THEN 'L'ELSE 'N'END AS Result,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS rownum,*
FROM dbo.GameLog
WHERE (VisitingStartingPitcherName = @PitcherName OR HomeStartingPitcherName = @PitcherName)
ORDER BY GameDate ASC, GameNumber ASC;

The results are similar to earlier and provide win/loss details for every game that he started:

结果与之前相似,并提供了他开始的每场比赛的赢/输详细信息:

We can take this query and determine winning and losing streaks almost identically to our previous calculations for the Yankees:

我们可以使用此查询来确定赢取和输掉的条纹,几乎与之前对洋基队的计算相同:


DECLARE @PitcherName VARCHAR(100) = 'Nolan Ryan';
WITH CTE_PITCHING AS (SELECTCASE WHEN WinningPitcherName = @PitcherName THEN 'W'WHEN LosingPitcherName = @PitcherName THEN 'L'ELSE 'N'END AS Result,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS rownum,*FROM dbo.GameLogWHERE (VisitingStartingPitcherName = @PitcherName OR HomeStartingPitcherName = @PitcherName)),
CTE_StartofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS StartNumFROM CTE_PITCHING AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_PITCHING AS PreviousGameWHERE GameDay.rownum = PreviousGame.rownum + 1AND PreviousGame.Result <> 'W')),
CTE_EndofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS EndNumFROM CTE_PITCHING AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_PITCHING AS NextGameWHERE GameDay.rownum = NextGame.rownum - 1AND NextGame.Result <> 'W'))
SELECTCTE_EndofWinningStreak.rownum - CTE_StartofWinningStreak.rownum + 1 AS WinningStreak,CTE_StartofWinningStreak.GameDate AS StartGameDate, CTE_StartofWinningStreak.GameNumber AS StartGameNumber,CTE_EndofWinningStreak.GameDate AS EndGameDate, CTE_EndofWinningStreak.GameNumber AS EndGameNumber,*
FROM CTE_StartofWinningStreak
INNER JOIN CTE_EndofWinningStreak
ON CTE_StartofWinningStreak.StartNum = CTE_EndofWinningStreak.EndNum
ORDER BY CTE_EndofWinningStreak.rownum - CTE_StartofWinningStreak.rownum DESC;

The strategy is the same: Start with the core data we wish to analyze, determine the start and end points of each winning streak, then join them together to find the number of games in each streak. Once we have that data calculated, grouping it in order to get the distribution of winning streaks is a piece of cake:

策略是相同的:从我们要分析的核心数据开始,确定每个获胜连胜的起点和终点,然后将它们结合在一起以找到每个连胜的游戏数。 一旦我们计算出数据,将其分组以获取获胜条纹的分布就很容易了:


DECLARE @PitcherName VARCHAR(100) = 'Nolan Ryan';
WITH CTE_PITCHING AS (SELECTCASE WHEN WinningPitcherName = @PitcherName THEN 'W'WHEN LosingPitcherName = @PitcherName THEN 'L'ELSE 'N'END AS Result,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS rownum,*FROM dbo.GameLogWHERE (VisitingStartingPitcherName = @PitcherName OR HomeStartingPitcherName = @PitcherName)),
CTE_StartofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS StartNumFROM CTE_PITCHING AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_PITCHING AS PreviousGameWHERE GameDay.rownum = PreviousGame.rownum + 1AND PreviousGame.Result <> 'W')),
CTE_EndofWinningStreak AS (SELECT*,ROW_NUMBER() OVER (ORDER BY GameDate ASC, GameNumber ASC) AS EndNumFROM CTE_PITCHING AS GameDayWHERE Result = 'W'AND EXISTS (SELECT*FROM CTE_PITCHING AS NextGameWHERE GameDay.rownum = NextGame.rownum - 1AND NextGame.Result <> 'W'))
SELECTCTE_EndofWinningStreak.rownum - CTE_StartofWinningStreak.rownum + 1 AS WinningStreak,CTE_StartofWinningStreak.GameDate AS StartGameDate, CTE_StartofWinningStreak.GameNumber AS StartGameNumber,CTE_EndofWinningStreak.GameDate AS EndGameDate, CTE_EndofWinningStreak.GameNumber AS EndGameNumber
INTO #NolanRyanWinningStreaks
FROM CTE_StartofWinningStreak
INNER JOIN CTE_EndofWinningStreak
ON CTE_StartofWinningStreak.StartNum = CTE_EndofWinningStreak.EndNum
ORDER BY CTE_EndofWinningStreak.rownum - CTE_StartofWinningStreak.rownum DESC;SELECTWinningStreak AS NumberOfGames,COUNT(*) AS NumberOfWinningStreaks
FROM #NolanRyanWinningStreaks
GROUP BY WinningStreak
ORDER BY COUNT(*) DESC;DROP TABLE #NolanRyanWinningStreaks;

The result set shows exactly what we were looking for:

结果集准确显示了我们正在寻找的内容:

The beauty of data analysis such as this is that many of the questions we ask are not things that can be looked up on Google, nor will specific statistics sources have it readily available. In order to dig into some of the wackier, more in-depth questions we can ask, the only solution is to load the raw data into a database and use our SQL knowledge to find what we are looking for.

诸如此类的数据分析之所以如此,是因为我们提出的许多问题都不是可以在Google上查询到的东西,也无法通过特定的统计数据来源获得。 为了深入研究我们可以提出的一些更怪异,更深入的问题,唯一的解决方案是将原始数据加载到数据库中,并使用我们SQL知识来查找所需的内容。

结论 (Conclusion)

Finding sequences of events and analyzing them can be an extremely effective way to answer difficult questions with ease. Whether it be sales performance or wins by the Mets on Wednesdays in July at night, there is always a way to determine boundary conditions and use them in order to quantify their contents.

查找事件序列并对其进行分析可能是一种轻松回答难题的极为有效的方法。 无论是销售业绩,还是大都会在7月的星期三晚上获胜,总有一种方法可以确定边界条件并使用它们来量化其内容。

These tactics can allow us to do much more than just find breaks in sequences of integers. We can order and analyze data by date or time, even when a uniquely identifying identity does not already exist for that data set. The strategies employed above can be extended to any ordered data in order to learn insight into groupings of events over time, as well as how that data is bounded.

这些策略不仅可以发现整数序列中的中断,还可以使我们做更多的事情。 我们可以按日期或时间排序和分析数据,即使该数据集还没有唯一标识的身份。 可以将上面采用的策略扩展到任何有序数据,以了解随着时间推移对事件分组的了解以及该数据如何绑定。

Incidentally, the Mets have played 119 games on Wednesday nights in July, of which they won 63, with their longest winning streak of those games being 6, starting on Wednesday, July 28, 1993, and ending after their win on July 3, 1996.

顺便说一句,大都会队在7月的星期三晚上打了119场比赛,其中他们赢得了63场比赛,其中最长的连胜纪录是6场,从1993年7月28日星期三开始,到1996年7月3日获胜后结束。

看更多 (See more)

Consider these free tools for SQL Server that improve database developer productivity.

考虑使用这些免费SQL Server工具来提高数据库开发人员的生产力。

参考资料和进一步阅读 (References and further reading)

  • Islands and Gaps in Sequential Numbers序列号中的孤岛和空白
  • Sabermetrics Research & Baseball Data AnalysisSabermetrics研究与棒球数据分析
  • Gaps & Islands with Overlapping Data重叠数据的缺口和孤岛

翻译自: https://www.sqlshack.com/data-boundaries-finding-gaps-islands-and-more/

opencv查找边界

opencv查找边界_数据边界:查找差距,孤岛等相关推荐

  1. python二分法查找程序_基础算法——查找(二分法)(python)

    基础算法--查找(二分法)(python).二分查找法(BinarySearchST)应用十分广泛,是必须要掌握的查找方法.适用对象是有序数组.具有最优的查找效率和空间需求,能够进行有序相关的操作.但 ...

  2. python 画出决策边界_决策边界可视化,让你的分类合理有序

    作者--Navoneel Chakrabarty 在数据科学领域,分类问题是一个非常普遍和重要的问题.例如:糖尿病视网膜病变.情绪分析.数字识别.癌症类型预测(恶性或良性)等.这些问题往往通过机器学习 ...

  3. python二分查找时间复杂度_时间复杂度 二分查找

    时间复杂度 时间复杂度是用来估计算法运行时间的一个式子(单位). 一般来说,时间复杂度高的算法比复杂度低的算法慢. 常见的时间复杂度(按效率排序) O(1) < O(log n) < O( ...

  4. java 查找排序_数据结构(Java)——查找和排序(1)

    1.查找的定义 查找是这样一个过程,即在某个项目组中寻找某一指定目标元素,或者确定该组中并不存在该目标元素. 对其进行查找的项目的组有时也成为查找池. 两种常见的查找方式:线性查找和二分查找. 为了能 ...

  5. java查找图片_用java查找图片中的图片?

    这个问题一般很难回答,因为人们通常对于重要的图像匹配有不同的要求.有些人可能想要搜索一个可能具有与它们提供的模板图像不同的大小或方向的图像,在这种情况下,需要缩放或旋转不变的方法.有各种选择,例如寻找 ...

  6. VB6基本数据库应用(五):数据的查找与筛选

    同系列的第五篇,上一篇在:http://blog.csdn.net/jiluoxingren/article/details/9633139 数据的查找与筛选 第4篇发布到现在已经过了4天,很抱歉,学 ...

  7. python 查找算法_七大查找算法(Python)

    查找算法 -- 简介 查找(Searching)就是根据给定的某个值,在查找表中确定一个其关键字等于给定值的数据元素. 查找表(Search Table):由同一类型的数据元素构成的集合 关键字(Ke ...

  8. 经典算法之左边界二分查找法(俗称左边界二分搜索法)

    经典算法之左边界二分查找法(俗称左边界二分搜索法) 文章目录 经典算法之左边界二分查找法(俗称左边界二分搜索法) 前言 一.什么左边界二分查找法? 二.代码实现 总结 前言 就算法而言,我们主要学习的 ...

  9. 二分查找及查找左、右边界

    目录 二分查找 概念: 详细举例: 那么问题来了 实现代码: 查找左右边界: 概念: 举例: 编写思路: 实现代码: 二分查找 概念: 通过将一组"有序"的'组'不断的一分为二的方 ...

最新文章

  1. 分布式动态配置后浪推前浪 -- Nacos
  2. pygame 笔记-8 背景音乐子弹音效
  3. 一张图说明 函数, 实例(对象), 原型之间的关系
  4. solr5的基本操作
  5. 马化腾:用户体验的10/100/1000法则
  6. C#程序只允许运行一个实例的解决方案
  7. 虎年第一篇-CCNP-第九篇-BGP(一)
  8. 解决tomcat的undeploy
  9. python module
  10. Leetcode 509. Fibonacci Number
  11. recv函数阻塞_socket缓冲区以及阻塞模式详解
  12. 在VS中安装nuget离线包nupkg文件
  13. Office 2007:另存为PDF或XPS
  14. MaskRCNN源码解析1:整体结构概述
  15. 人工神经网络概念梳理与实例演示
  16. 多标签图像分类总结(转载)
  17. 计算机硬盘的参数错误,电脑提示移动硬盘参数错误的解决方法
  18. Latex排版,表格标题总是出现在下方的解决方案
  19. 条件求和:SUMIF、SUMIFS函数
  20. 华为服务器2285怎么做系统,华为服务器2285进bios

热门文章

  1. webservice 调用慢问题解决
  2. 1043 幸运号码 数位DP
  3. 一个小小的抽奖活动测试脚本(python2.7)
  4. 零基础快速开发全栈后台管理系统(Vue3+ElementPlus+Koa2)—项目概述篇(一)
  5. ASP.NET 是什么
  6. python%20是什么东西
  7. htaccess配置文件是什么
  8. 为什么有的人有心事就容易失眠?
  9. 山东省德州市有哪些明星?
  10. “单于夜遁逃”的上一句诗是什么?这句诗出自哪里?