sql server作业_在SQL Server中报告作业失败并发出警报

sql server作业

SQL Server Agent can be used to run a wide variety of tasks within SQL Server. The built-in monitoring tools, though, are not well-suited for environments with many servers or many databases.

SQL Server代理可用于在SQL Server中运行各种任务。但是，内置的监视工具不适用于具有许多服务器或数据库的环境。

Removing reliance on default notifications and building our own processes can allow for greater flexibility, less alerting noise, and the ability to track failure conditions that are not typically tracked by SQL Server!

消除对默认通知的依赖并建立我们自己的流程可以提高灵活性，减少警报噪音，并能够跟踪SQL Server通常不跟踪的故障情况！

介绍 (Introduction)

At the heart of the SQL Server Agent service is the ability to create, schedule, and customize jobs. These jobs can be given schedules that determine at what times of day a task should execute. Jobs can also be given triggers, such as a server restart or alert to respond to. Jobs can also be called via TSQL from anywhere that has the appropriate access and permissions to SQL Server Agent.

SQL Server代理服务的核心是创建，计划和自定义作业的功能。可以为这些作业提供时间表，这些时间表可以确定任务应在一天的什么时间执行。还可以为作业提供触发器，例如服务器重新启动或响应警报。也可以通过TSQL在对SQL Server代理具有适当访问权限的任何位置通过TSQL调用作业。

The built-in notification system allows you to define operators and contact them when a job fails. While convenient, this does not scale well when large numbers of servers are involved, large numbers of jobs exist, or when more complex jobs are devised.

内置的通知系统使您可以定义操作员，并在工作失败时与他们联系。虽然方便，但是当涉及到大量服务器，存在大量作业或设计了更复杂的作业时，这种方法无法很好地扩展。

Customizing a job failure notification process is far simpler than it may seem, and in this article, we will walk through how to gather the necessary SQL Server Agent history data and use it to alert operators in far more meaningful ways than the default tools allow.

自定义作业失败通知过程要比看起来简单得多，在本文中，我们将逐步介绍如何收集必要SQL Server代理历史记录数据，并使用它以比默认工具允许的更有意义的方式警告操作员。

SQL Server代理如何跟踪作业状态？ (How Does SQL Server Agent Track Job Status?)

SQL Server Agent maintains job, schedule, and execution details in tables within the MSDB database. The following tables provide what we will need to capture to track and alert on failures:

SQL Server代理在MSDB数据库的表中维护作业，日程表和执行详细信息。下表提供了我们需要捕获以跟踪和警告故障的内容：

MSDB.dbo.sysjobs

This table contains a row per SQL Server Agent job defined on a given SQL Server instance:

该表在给定SQL Server实例上定义的每个SQL Server代理作业均包含一行：

The job_id is a UNIQUEIDENTIFIER that ensures a unique primary key for each job. This table also provides the name, description, create/modify dates, and a variety of other useful information about the job. This table can be queried to determine how many jobs exist on a server or to search based on a specific string in job names or descriptions. We could also search based on owner_sid to determine if any jobs are owned by the wrong login (such as a departing employee or job creator).

job_id是UNIQUEIDENTIFIER，可确保每个作业的唯一主键。该表还提供了名称，描述，创建/修改日期以及有关该作业的各种其他有用信息。可以查询该表以确定服务器上存在多少个作业，或基于作业名称或描述中的特定字符串进行搜索。我们还可以根据owner_sid进行搜索，以确定是否有任何工作归错误的登录名所有（例如离职的员工或工作创建者）。

MSDB.dbo.syscategories

From within a SQL Server, you may select a category for each job. This allows for classification of reports, data collection processes, alerts, etc…When acting on jobs, we can take that category into account in order to increase the priority of some jobs or decrease the priority of others. For example, we could set some jobs as unmonitored and ignore them. Alternatively, we could set some jobs as critical and have them hit up the on-call operator’s cell phone the moment they fail.

您可以从SQL Server中为每个作业选择一个类别。这样就可以对报告，数据收集过程，警报等进行分类……在执行工作时，我们可以考虑该类别，以增加某些工作的优先级或降低其他工作的优先级。例如，我们可以将某些作业设置为不受监控，并忽略它们。或者，我们可以将一些任务设置为关键任务，并在失败时立即将其打到话务员的手机上。

This table is simple enough, and is mostly useful to pull the category name. The category class & type are hard-coded literals that determine where an alert can be used:

该表非常简单，并且对于提取类别名称很有用。类别和类型是硬编码的文字，它们确定可以在哪里使用警报：

Class: 1 = job, 2 = alert, 3 = operator. 类别：1 =工作，2 =警告，3 =操作员。
Type: 1 = local, 2 = multiserver, 3 = none. 类型：1 =本地，2 =多服务器，3 =无。

MSDB.dbo.sysjobhistory

This table contains a row per SQL Server Agent job or job step. Both step status and overall job status are contained in this table:

该表在每个SQL Server代理作业或作业步骤中包含一行。该表同时包含步骤状态和总体作业状态：

In addition to the corresponding job_id, step, and error information (if applicable), this table provides details on job runtime, run status, run date and time, and who was notified (if anyone).

除了相应的job_id ，step和错误信息（如果适用）之外，此表还提供有关作业运行时，运行状态，运行日期和时间以及被通知谁的详细信息（如果有的话）。

Note that a row with step_id = 0 corresponds to the overall job status, and not that of any one step. It is possible for a job with no steps to fail or a job to fail prior to executing any job steps, therefore we can end up with an overall job status with no corresponding step success/failure details.

请注意， step_id = 0的行对应于整体作业状态，而不是任何一步的状态。在执行任何作业步骤之前，没有步骤的作业可能会失败，或者某个作业可能失败，因此我们最终可能会获得总体作业状态，而没有相应的步骤成功/失败详细信息。

Sysjobhistory is only populated when a job or job step completes. In the unlikely event that a server restarts abnormally or SQL Server Agent crashes, then it is possible for a job to not be completely logged in this table (or logged at all).

仅在作业或作业步骤完成时才填充Sysjobhistory 。如果服务器异常重启或SQL Server代理崩溃，这种情况极有可能发生，则有可能作业无法完全记录在该表中（或根本没有记录）。

Because this data exists and is readily accessible to us, we can collect, analyze, and use it for notification purposes.

由于此数据存在并且易于我们使用，因此我们可以收集，分析和使用这些数据以进行通知。

内置通知的局限性 (Limitations of Built-In Notifications)

By default, you can add notification steps to any SQL Server Agent jobs, either via the GUI or TSQL:

默认情况下，可以通过GUI或TSQL将通知步骤添加到任何SQL Server代理作业中：

Whether a job fails, succeeds, or completes, you may add an email/page/logging to the end of it. While useful, this only addresses the state of a job when it completes, and not of any specific steps. In reality, we may care about the status of individual steps and wish to act on them on a more granular basis.

无论作业失败，成功还是完成，您都可以在其末尾添加电子邮件/页面/日志。尽管有用，但这仅解决作业完成时的状态，而不涉及任何特定步骤。实际上，我们可能关心各个步骤的状态，并希望在更细微的基础上采取行动。

A significant (and somewhat confusing) caveat to how SQL Server Agent handles job step failures is that you may configure a job to continue executing, even if a step fails. In the event that a job step fails, the job continues, and future steps succeed, the job will report success, despite one or more steps failing. This is not easy to alert on in SQL Server and common solutions involve either breaking a job into numerous smaller jobs or adding customized alerting steps as needed.

SQL Server代理如何处理作业步骤失败的一个重要警告（有些令人困惑）是，您可以将作业配置为即使步骤失败也可以继续执行。如果作业步骤失败，作业继续并且以后的步骤成功，则即使一个或多个步骤失败，该作业也将报告成功。这在SQL Server中很难发出警报，常见的解决方案包括将一个作业分解为多个较小的作业，或根据需要添加自定义的警报步骤。

For our purposes, we’d like to build a failure notification job that is simple and all-encompassing. The following are all possible failure conditions that we may wish to report on:

为了我们的目的，我们希望构建一个简单而全面的故障通知作业。以下是我们可能希望报告的所有可能的故障情况：

A job fails due to one or more steps failing. 由于一个或多个步骤失败，作业失败。
A job succeeds, but one or more steps fail and we wish to report on those failures. 一项工作成功了，但是一个或多个步骤失败了，我们希望报告这些失败。
All steps in a job succeed, but the job itself fails. This is often the result of a job configuration issue. 作业中的所有步骤都成功，但是作业本身失败了。这通常是作业配置问题的结果。
A job fails that has no steps defined for it. 作业失败，没有为其定义任何步骤。
A job fails prior to any steps executing. This is often the result of a job configuration issue. 在执行任何步骤之前，作业将失败。这通常是作业配置问题的结果。

To alert on all of these effectively, we will need to write some of our own code to detect, log, and report on them.

为了有效地警告所有这些，我们将需要编写一些自己的代码以检测，记录和报告它们。

Another limitation of built-in notifications are the contents of the alert that you receive. Selecting a notification as shown above will result in a single pre-fabricated notification whenever it is triggered. This is good for letting you know that a job failed, but the included information is often not enough to troubleshoot without going back to the job and reading more details of the failure. The following is an example of what the subject and body of a default SQL Server Agent notification would look like:

内置通知的另一个限制是您收到的警报的内容。如上所示选择通知将在触发时生成单个预制通知。这有助于使您知道作业失败，但是所包含的信息通常不足以在不返回作业并读取失败的更多详细信息的情况下进行故障排除。以下是默认SQL Server代理通知的主题和正文的示例：

SQL Server Job System: ‘Test Job’ completed on EdSQLServer
JOB RUN: ‘Test Job’ was run on 3/5/2018 at 07:00:00 AM
DURATION: 0 hours, 0 minutes, 17 seconds
STATUS: Failed
MESSAGES: The job failed. The Job was invoked by Schedule 3 (Daily at 7am). The last step to run was step 1 (Run the test script!).

SQL Server作业系统：在EdSQLServer上完成了“测试作业”
作业运行：``测试作业''于2018年3月5日上午7:00:00运行
持续时间：0小时0分钟17秒
状态：失败
消息：作业失败。作业由计划3（每天早上7点）调用。运行的最后一步是步骤1（运行测试脚本！）。

This is useful, but could be far more useful. For starters, the error message is only the job failure message and does not include the detailed failure info from any failed job steps. Ideally, enough details would be provided in the alert to ensure that you could respond immediately, and not need to dig further for error messages every single time this happens. Customizing an alert process allows us to include as much (or as little) detail as we wish in order to make the notifications we receive as actionable as possible!

这很有用，但可能会有用得多。对于初学者，该错误消息仅是作业失败消息，并且不包括任何失败的作业步骤中的详细失败信息。理想情况下，警报中将提供足够的详细信息，以确保您可以立即做出响应，而不必每次都发生进一步的错误消息。自定义警报过程使我们可以根据需要包含尽可能多（或更少）的详细信息，以使我们收到的通知尽可能可行！

建立更好的通知系统 (Building A Better Notification System)

To build as simple of a job failure alert system as possible, we’ll follow a handful of steps to plan out and execute this project:

为了尽可能简单地构建作业失败警报系统，我们将遵循一些步骤来计划和执行该项目：

Create tables to store job and job failure details. 创建表以存储作业和作业失败详细信息。
Create a stored procedure that logs recent job and failure details to these tables. 创建一个存储过程，将最近的作业和失败详细信息记录到这些表中。
Create a job that regularly calls this stored procedure. 创建一个定期调用此存储过程的作业。

A goal in this process is to keep things as basic as possible. There are many opportunities available here for overengineering a perpetual motion machine, but alerting on important failures is ideally simple so as to be as reliable as possible.

此过程的目标是使事物尽可能基本。在这里，有很多机会可以对永动机进行过度工程设计，但是在理想情况下，对重要故障进行警告是很简单的，以便尽可能地可靠。

We’ll start by building a table that will store a list of SQL Server Agent jobs. Why build a table when MSDB already includes the sysjobs table? If a job is deleted, we want to retain the old job record for posterity. This allows us to report on failures for jobs that may have been recently deleted. It also allows us to retain information about past failures for jobs that no longer exist. Similarly, if we ever were to migrate this database to a new server or install a new version of SQL Server, then having the old job data will ensure that all of our job failure details will remain useful and not be associated with orphaned/unavailable job data in MSDB.

我们将从构建一个表开始，该表将存储SQL Server代理作业列表。当MSDB已包含sysjobs表时，为什么要构建表？如果作业被删除，我们希望保留旧的作业记录以供后代使用。这使我们能够报告可能最近被删除的作业失败。它还使我们可以保留有关不再存在的作业的过去故障的信息。同样，如果我们曾经将此数据库迁移到新服务器或安装新版本SQL Server，则拥有旧的作业数据将确保我们所有的作业失败详细信息仍然有用，并且不会与孤立的/不可用的作业相关联MSDB中的数据。

CREATE TABLE dbo.sql_server_agent_job(  sql_server_agent_job_id INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_sql_server_agent_job PRIMARY KEY CLUSTERED,sql_server_agent_job_id_guid UNIQUEIDENTIFIER NOT NULL,sql_server_agent_job_name NVARCHAR(128) NOT NULL,job_create_datetime_utc DATETIME NOT NULL, job_last_modified_datetime_utc DATETIME NOT NULL,is_enabled BIT NOT NULL,is_deleted BIT NOT NULL,job_category_name VARCHAR(100) NOT NULL);

This table, in addition to keys, includes the job name, create/modify times, its category, and flags to indicate whether the job has been disabled or deleted. Feel free to add additional columns for any job metadata that is useful to you, but might not be listed here. We create a surrogate integer clustered primary key to avoid the need to have to index, join, and filter on the larger UNIQUEIDENTIFIER data type. If you are working with servers that have a very short and stable SQL Server Agent job list, then you could easily use a SMALLINT or even a TINYINT for the primary key ID column.

除键外，该表还包括作业名称，创建/修改时间，其类别以及用于指示该作业是已禁用还是已删除的标志。可以为您有用的任何作业元数据添加其他列，但此处未列出。我们创建了一个代理整数簇主键，以避免需要对较大的UNIQUEIDENTIFIER数据类型进行索引，联接和过滤。如果使用的服务器SQL Server Agent作业列表非常短且稳定，则可以轻松地将SMALLINT或TINYINT用作主键ID列。

With a SQL Server Agent job table available, we can now build a table to store job failure metrics:

有了可用SQL Server Agent作业表，我们现在可以构建一个表来存储作业失败指标：

CREATE TABLE dbo.sql_server_agent_job_failure(   sql_server_agent_job_failure_id INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_sql_server_agent_job_failure PRIMARY KEY CLUSTERED,sql_server_agent_job_id INT NOT NULL CONSTRAINT FK_sql_server_agent_job_failure_sql_server_agent_job FOREIGN KEY REFERENCES dbo.sql_server_agent_job (sql_server_agent_job_id),sql_server_agent_instance_id INT NOT NULL,job_start_time_utc DATETIME NOT NULL,job_failure_time_utc DATETIME NOT NULL,job_failure_step_number SMALLINT NOT NULL,job_failure_step_name VARCHAR(250) NOT NULL,job_failure_message VARCHAR(MAX) NOT NULL,job_step_failure_message VARCHAR(MAX) NOT NULL,job_step_severity INT NOT NULL,job_step_message_id INT NOT NULL,retries_attempted INT NOT NULL,has_email_been_sent_to_operator BIT NOT NULL);CREATE NONCLUSTERED INDEX NCI_sql_server_agent_job_failure_sql_server_agent_job_id ON dbo.sql_server_agent_job_failure (sql_server_agent_job_id);CREATE NONCLUSTERED INDEX NCI_sql_server_agent_job_failure_sql_server_agent_instance_id ON dbo.sql_server_agent_job_failure (sql_server_agent_instance_id);

This table contains a foreign key back to our newly created job table above. It also contains details of the failed job, including the start/fail time of the job and details of the error. The last column is a flag that will be used to signify when a failed job has been alerted on successfully, so that we do not repeatedly spam an operator with alerts on it. Note that we will include both the job failure and the step failure messages, allowing for easier troubleshooting directly from the alert, without the need to dig back into SQL Server Agent prior to resolving the problem.

该表包含一个外键，该外键返回到我们上面新创建的作业表。它还包含失败作业的详细信息，包括作业的开始/失败时间和错误的详细信息。最后一列是一个标志，用于指示成功警告失败的作业的时间，因此我们不会重复向操作员发送警告。请注意，我们将同时包含作业失败消息和步骤失败消息，从而可以直接从警报中轻松地进行故障排除，而无需在解决问题之前回溯到SQL Server代理。

The next step is to create a stored procedure that will check for job failures and place data into sql_server_agent_job_failure accordingly. This will be composed of a handful of steps:

下一步是创建一个存储过程，该过程将检查作业失败并将数据相应地放入sql_server_agent_job_failure 。这将由几个步骤组成：

sql_server_agent_job with any new, deleted, or changed jobs. sql_server_agent_job 。
Collect data on new job failures. 收集有关新作业失败的数据。
Collect data on new job step failures. 收集有关新作业步骤失败的数据。
Email an operator the details of these failures. Email can be replaced with another communication medium, if desired. 通过电子邮件将错误的详细信息发送给操作员。如果需要，可以用其他通讯介质代替电子邮件。

CREATE PROCEDURE dbo.monitor_job_failures@minutes_to_monitor SMALLINT = 1440
AS
BEGINSET NOCOUNT ON;-- Determine UTC offset so that all times can easily be converted to UTC.DECLARE @utc_offset INT;SELECT@utc_offset = -1 * DATEDIFF(HOUR, GETUTCDATE(), GETDATE());

Here is the stored proc declaration. @minutes_to_monitor tells it how far back to check for job failures. This will be set depending on how often you plan on running the monitoring job that calls this proc. My preference is to pull one day’s worth of data. This ensures that in the event of server maintenance, an outage, or some other interruption, we won’t miss any job failures. We’ll filter out already-alerted-on failures as we go, so that won’t be a problem.

这是存储的proc声明。 @minutes_to_monitor告诉它检查作业失败的时间。将根据您计划运行调用此proc的监视作业的频率进行设置。我更喜欢提取一天的数据。这样可以确保在服务器维护，中断或其他中断的情况下，我们不会错过任何作业故障。我们将不断过滤掉已经发出警报的故障，因此这不会成为问题。

We also pull the UTC offset and will store all DATE/TIME data in UTC time. This will result in more math needed up front, but more consistency for anyone that views this data. UTC can be converted to local time by determining the offset and adding it to the UTC time.

我们还将拉出UTC偏移量，并将所有以UTC时间存储的DATE / TIME数据。这将导致需要更多的数学运算，但是对于查看此数据的任何人来说，一致性更高。通过确定偏移量并将其添加到UTC时间，可以将UTC转换为本地时间。

MERGE INTO dbo.sql_server_agent_job AS TARGETUSING (SELECTsysjobs.job_id AS sql_server_agent_job_id_guid,sysjobs.name AS sql_server_agent_job_name,sysjobs.date_created AS job_create_datetime_utc,sysjobs.date_modified AS job_last_modified_datetime_utc,sysjobs.enabled AS is_enabled,0 AS is_deleted,ISNULL(syscategories.name, '') AS job_category_nameFROM msdb.dbo.sysjobsLEFT JOIN msdb.dbo.syscategoriesON syscategories.category_id = sysjobs.category_id) AS SOURCEON (SOURCE.sql_server_agent_job_id_guid = TARGET.sql_server_agent_job_id_guid)WHEN NOT MATCHED BY TARGETTHEN INSERT(sql_server_agent_job_id_guid, sql_server_agent_job_name, job_create_datetime_utc, job_last_modified_datetime_utc,is_enabled, is_deleted, job_category_name)VALUES   (SOURCE.sql_server_agent_job_id_guid,SOURCE.sql_server_agent_job_name,SOURCE.job_create_datetime_utc,SOURCE.job_last_modified_datetime_utc,SOURCE.is_enabled,SOURCE.is_deleted,SOURCE.job_category_name)WHEN MATCHED AND SOURCE.job_last_modified_datetime_utc > TARGET.job_last_modified_datetime_utcTHEN UPDATESET sql_server_agent_job_name = SOURCE.sql_server_agent_job_name,job_create_datetime_utc = SOURCE.job_create_datetime_utc,job_last_modified_datetime_utc = SOURCE.job_last_modified_datetime_utc,is_enabled = SOURCE.is_enabled,is_deleted = SOURCE.is_deleted,job_category_name = SOURCE.job_category_name;

This TSQL performs a MERGE into our SQL Server Agent Job table. It matches on job_id, which is unique on each SQL Server and a reliable key for this purpose.

该TSQL将MERGE合并到我们SQL Server代理作业表中。它在job_id上匹配， job_id在每个SQL Server上都是唯一的，并且为此目的是可靠的密钥。

UPDATE sql_server_agent_jobSET is_enabled = 0,is_deleted = 1FROM dbo.sql_server_agent_jobLEFT JOIN msdb.dbo.sysjobsON sysjobs.Job_Id = sql_server_agent_job.sql_server_agent_job_id_guidWHERE sysjobs.Job_Id IS NULL;

This additional query will check to see if any jobs no longer exist. That is, they are in sql_server_agent_job, but no longer in MSDB. We’ll flag any deleted jobs as disabled and deleted so that anyone reading this data knows that it is stored there for posterity and no longer references an active job.

此附加查询将检查是否不再存在任何作业。也就是说，它们位于sql_server_agent_job中 ，但不再位于MSDB中。我们会将所有已删除的作业标记为已禁用和已删除，以便任何读取此数据的人都知道该数据存储在该位置以供后代使用，并且不再引用活动的作业。

A warning before we proceed: Dates, times, and statuses within many of the MSDB tables are stored using some outdated (read: scary) conventions. They were not stored as dates, times, datetimes, or even strings. Instead, times were stored as integers. For example, 8:41:53am is stored as 84153. Run durations were stored as integers. For example, a job that ran for 00:12:53 (twelve minutes and fifty-three seconds) will be stored as 1253. Lastly, dates are stored as VARCHAR(8) strings in the format YYYYMMDD. Converting this into more useful data types is critical to being able to meaningfully report on it.

在继续之前有一个警告：许多MSDB表中的日期，时间和状态都是使用一些过时的（读作：scraker）约定存储的。它们没有存储为日期，时间，日期时间甚至字符串。相反，时间存储为整数。例如，上午8:41:53存储为84153。运行持续时间存储为整数。例如，运行时间为00:12:53（十二分钟和五十三秒）的作业将被存储为1253。最后，日期以VARCHAR（8）字符串的形式存储，格式为YYYYMMDD。将其转换为更有用的数据类型对于能够对其进行有意义的报告至关重要。

As a result, we’re going to need some math and string manipulation to clean up dates and times that are stored as integer literals:

结果，我们将需要一些数学和字符串操作来清理存储为整数文字的日期和时间：

WITH CTE_NORMALIZE_DATETIME_DATA AS (SELECTsysjobhistory.job_id AS sql_server_agent_job_id_guid,CAST(sysjobhistory.run_date AS VARCHAR(MAX)) AS run_date_string, REPLICATE('0', 6 - LEN(CAST(sysjobhistory.run_time AS VARCHAR(MAX)))) + CAST(sysjobhistory.run_time AS VARCHAR(MAX)) AS run_time_string,REPLICATE('0', 6 - LEN(CAST(sysjobhistory.run_duration AS VARCHAR(MAX)))) + CAST(sysjobhistory.run_duration AS VARCHAR(MAX)) AS run_duration_string,sysjobhistory.run_status,sysjobhistory.message,sysjobhistory.instance_idFROM msdb.dbo.sysjobhistory WITH (NOLOCK)WHERE sysjobhistory.run_status = 0AND sysjobhistory.step_id = 0),CTE_GENERATE_DATETIME_DATA AS (SELECTCTE_NORMALIZE_DATETIME_DATA.sql_server_agent_job_id_guid,CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_date_string, 5, 2) + '/' + SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_date_string, 7, 2) + '/' + SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_date_string, 1, 4) AS DATETIME) +CAST(STUFF(STUFF(CTE_NORMALIZE_DATETIME_DATA.run_time_string, 5, 0, ':'), 3, 0, ':') AS DATETIME) AS job_start_datetime,CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_duration_string, 1, 2) AS INT) * 3600 +CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_duration_string, 3, 2) AS INT) * 60 + CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_duration_string, 5, 2) AS INT) AS job_duration_seconds,CASE CTE_NORMALIZE_DATETIME_DATA.run_statusWHEN 0 THEN 'Failure'WHEN 1 THEN 'Success'WHEN 2 THEN 'Retry'WHEN 3 THEN 'Canceled'ELSE 'Unknown'END AS job_status,CTE_NORMALIZE_DATETIME_DATA.message,CTE_NORMALIZE_DATETIME_DATA.instance_idFROM CTE_NORMALIZE_DATETIME_DATA)SELECTCTE_GENERATE_DATETIME_DATA.sql_server_agent_job_id_guid,DATEADD(HOUR, @utc_offset, CTE_GENERATE_DATETIME_DATA.job_start_datetime) AS job_start_time_utc,DATEADD(HOUR, @utc_offset, DATEADD(SECOND, ISNULL(CTE_GENERATE_DATETIME_DATA.job_duration_seconds, 0), CTE_GENERATE_DATETIME_DATA.job_start_datetime)) AS job_failure_time_utc,ISNULL(CTE_GENERATE_DATETIME_DATA.message, '') AS job_failure_message,CTE_GENERATE_DATETIME_DATA.instance_idINTO #job_failureFROM CTE_GENERATE_DATETIME_DATAWHERE DATEADD(HOUR, @utc_offset, CTE_GENERATE_DATETIME_DATA.job_start_datetime) > DATEADD(MINUTE, -1 * @minutes_to_monitor, GETUTCDATE());

Much of this TSQL is devoted to converting numeric representations of dates and times into an actual DATETIME that we can compare against. The first CTE cleans up those integers so that the run time and run duration string has leading zeroes to ensure it is 6 characters long. The second CTE converts these now uniform values into DATETIMEs using some very ugly string manipulation. The final SELECT converts those DATETIME values into UTC and places the results into a temporary table for use later in this stored proc.

此TSQL的大部分致力于将日期和时间的数字表示形式转换为可以与之进行比较的实际DATETIME。第一个CTE清除这些整数，以便运行时间和运行持续时间字符串具有前导零以确保其为6个字符长。第二个CTE使用非常难看的字符串操作将这些现在统一的值转换为DATETIME。最终的SELECT将这些DATETIME值转换为UTC并将结果放入一个临时表中，以供以后在此存储过程中使用。

Our next step is to build a very similar query that will return data on job step failures. We intentionally do this work in a separate query as we will need to join these data sets together, and having them in separate temporary tables will make this significantly easier:

我们的下一步是建立一个非常相似的查询，该查询将返回有关作业步骤失败的数据。我们有意在单独的查询中完成此工作，因为我们需要将这些数据集连接在一起，并将它们放在单独的临时表中将大大简化此工作：

WITH CTE_NORMALIZE_DATETIME_DATA AS (SELECTsysjobhistory.job_id AS sql_server_agent_job_id_guid,CAST(sysjobhistory.run_date AS VARCHAR(MAX)) AS run_date_string, REPLICATE('0', 6 - LEN(CAST(sysjobhistory.run_time AS VARCHAR(MAX)))) + CAST(sysjobhistory.run_time AS VARCHAR(MAX)) AS run_time_string,REPLICATE('0', 6 - LEN(CAST(sysjobhistory.run_duration AS VARCHAR(MAX)))) + CAST(sysjobhistory.run_duration AS VARCHAR(MAX)) AS run_duration_string,sysjobhistory.run_status,sysjobhistory.step_id,sysjobhistory.step_name,sysjobhistory.message,sysjobhistory.retries_attempted,sysjobhistory.sql_severity,sysjobhistory.sql_message_id,sysjobhistory.instance_idFROM msdb.dbo.sysjobhistory WITH (NOLOCK)WHERE sysjobhistory.run_status = 0AND sysjobhistory.step_id > 0),CTE_GENERATE_DATETIME_DATA AS (SELECTCTE_NORMALIZE_DATETIME_DATA.sql_server_agent_job_id_guid,CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_date_string, 5, 2) + '/' + SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_date_string, 7, 2) + '/' + SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_date_string, 1, 4) AS DATETIME) +CAST(STUFF(STUFF(CTE_NORMALIZE_DATETIME_DATA.run_time_string, 5, 0, ':'), 3, 0, ':') AS DATETIME) AS job_start_datetime,CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_duration_string, 1, 2) AS INT) * 3600 +CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_duration_string, 3, 2) AS INT) * 60 + CAST(SUBSTRING(CTE_NORMALIZE_DATETIME_DATA.run_duration_string, 5, 2) AS INT) AS job_duration_seconds,CASE CTE_NORMALIZE_DATETIME_DATA.run_statusWHEN 0 THEN 'Failure'WHEN 1 THEN 'Success'WHEN 2 THEN 'Retry'WHEN 3 THEN 'Canceled'ELSE 'Unknown'END AS job_status,CTE_NORMALIZE_DATETIME_DATA.step_id,CTE_NORMALIZE_DATETIME_DATA.step_name,CTE_NORMALIZE_DATETIME_DATA.message,CTE_NORMALIZE_DATETIME_DATA.retries_attempted,CTE_NORMALIZE_DATETIME_DATA.sql_severity,CTE_NORMALIZE_DATETIME_DATA.sql_message_id,CTE_NORMALIZE_DATETIME_DATA.instance_idFROM CTE_NORMALIZE_DATETIME_DATA)SELECTCTE_GENERATE_DATETIME_DATA.sql_server_agent_job_id_guid,DATEADD(HOUR, @utc_offset, CTE_GENERATE_DATETIME_DATA.job_start_datetime) AS job_start_time_utc,DATEADD(HOUR, @utc_offset, DATEADD(SECOND, ISNULL(CTE_GENERATE_DATETIME_DATA.job_duration_seconds, 0), CTE_GENERATE_DATETIME_DATA.job_start_datetime)) AS job_failure_time_utc,CTE_GENERATE_DATETIME_DATA.step_id AS job_failure_step_number,ISNULL(CTE_GENERATE_DATETIME_DATA.message, '') AS job_step_failure_message,CTE_GENERATE_DATETIME_DATA.sql_severity AS job_step_severity,CTE_GENERATE_DATETIME_DATA.retries_attempted,CTE_GENERATE_DATETIME_DATA.step_name,CTE_GENERATE_DATETIME_DATA.sql_message_id,CTE_GENERATE_DATETIME_DATA.instance_idINTO #job_step_failureFROM CTE_GENERATE_DATETIME_DATAWHERE DATEADD(HOUR, @utc_offset, CTE_GENERATE_DATETIME_DATA.job_start_datetime) > DATEADD(MINUTE, -1 * @minutes_to_monitor, GETUTCDATE());

Note that the only significant difference between these queries is that we check for step_id = 0 when looking for overall job notification data and step_id > 0 for job steps that are associated with these jobs. Since sysjobhistory stored both job failures and job step failures, we will need to separate them from each other, and checking for step_id = 0 is a quick and easy way to do so.

请注意，这两个查询之间的唯一显着区别是，在查找总体作业通知数据时，我们检查step_id = 0，对于与这些作业相关联的作业步骤， step_id > 0。由于sysjobhistory同时存储了作业失败和作业步骤失败，因此我们需要将它们彼此分开，并且检查step_id = 0是一种快速简便的方法。

Now that we have some data on job and step failures, we can begin generating failure data to insert into sql_server_agent_job_failure for the various failure scenarios that we identified earlier:

现在我们有了一些有关作业和步骤故障的数据，我们可以开始针对先前确定的各种故障情况生成故障数据，以插入到sql_server_agent_job_failure中：

Jobs that Fail Due to Failed Steps

因步骤失败而失败的作业

WITH CTE_FAILURE_STEP AS (SELECT*,ROW_NUMBER() OVER (PARTITION BY job_step_failure.sql_server_agent_job_id_guid, job_step_failure.job_failure_time_utc ORDER BY job_step_failure.job_failure_step_number DESC) AS recent_step_rankFROM #job_step_failure job_step_failure)INSERT INTO dbo.sql_server_agent_job_failure(sql_server_agent_job_id, sql_server_agent_instance_id, job_start_time_utc, job_failure_time_utc, job_failure_step_number, job_failure_step_name,job_failure_message, job_step_failure_message, job_step_severity, job_step_message_id, retries_attempted, has_email_been_sent_to_operator)SELECTsql_server_agent_job.sql_server_agent_job_id,CTE_FAILURE_STEP.instance_id,job_failure.job_start_time_utc,CTE_FAILURE_STEP.job_failure_time_utc,CTE_FAILURE_STEP.job_failure_step_number,CTE_FAILURE_STEP.step_name AS job_failure_step_name,job_failure.job_failure_message,CTE_FAILURE_STEP.job_step_failure_message,CTE_FAILURE_STEP.job_step_severity,CTE_FAILURE_STEP.sql_message_id AS job_step_message_id,CTE_FAILURE_STEP.retries_attempted,0 AS has_email_been_sent_to_operatorFROM #job_failure job_failureINNER JOIN dbo.sql_server_agent_jobON job_failure.sql_server_agent_job_id_guid = sql_server_agent_job.sql_server_agent_job_id_guidINNER JOIN CTE_FAILURE_STEPON job_failure.sql_server_agent_job_id_guid = CTE_FAILURE_STEP.sql_server_agent_job_id_guidAND job_failure.job_failure_time_utc = CTE_FAILURE_STEP.job_failure_time_utcWHERE CTE_FAILURE_STEP.recent_step_rank = 1AND CTE_FAILURE_STEP.instance_id NOT IN (SELECT sql_server_agent_job_failure.sql_server_agent_instance_id FROM dbo.sql_server_agent_job_failure)AND sql_server_agent_job.job_category_name <> 'Unmonitored';

The CTE above will group job steps by job and job execution time, placing the last failure first. This allows us to determine which step failure was the direct cause of the job itself failing.

上面的CTE将按作业和作业执行时间对作业步骤进行分组，将最后的故障放在首位。这使我们能够确定哪个步骤失败是作业本身失败的直接原因。

Jobs that Failed Without Any Failed Steps

失败而没有任何失败步骤的作业

INSERT INTO dbo.sql_server_agent_job_failure(sql_server_agent_job_id, sql_server_agent_instance_id, job_start_time_utc, job_failure_time_utc, job_failure_step_number, job_failure_step_name,job_failure_message, job_step_failure_message, job_step_severity, job_step_message_id, retries_attempted, has_email_been_sent_to_operator)SELECTsql_server_agent_job.sql_server_agent_job_id,job_failure.instance_id,job_failure.job_start_time_utc,job_failure.job_failure_time_utc,0 AS job_failure_step_number,'' AS job_failure_step_name,job_failure.job_failure_message,'' AS job_step_failure_message,-1 AS job_step_severity,-1 AS job_step_message_id,0 AS retries_attempted,0 AS has_email_been_sent_to_operatorFROM #job_failure job_failureINNER JOIN dbo.sql_server_agent_jobON job_failure.sql_server_agent_job_id_guid = sql_server_agent_job.sql_server_agent_job_id_guidWHERE job_failure.instance_id NOT IN (SELECT sql_server_agent_job_failure.sql_server_agent_instance_id FROM dbo.sql_server_agent_job_failure)AND NOT EXISTS (SELECT * FROM #job_step_failure job_step_failure WHERE job_failure.sql_server_agent_job_id_guid = job_step_failure.sql_server_agent_job_id_guid   AND job_failure.job_failure_time_utc = job_step_failure.job_failure_time_utc);

It is possible for a job to fail without any steps executing. This can be caused by a configuration error, a permissions problem, or some other high-level job or SQL Server Agent issue. We definitely want to know about these, so we check for all job failures that have no corresponding job step failures and insert them into sql_server_agent_job_failure.

如果不执行任何步骤，作业可能会失败。这可能是由于配置错误，权限问题或某些其他高级作业或SQL Server代理问题引起的。我们绝对想知道这些，因此我们检查所有没有相应作业步骤失败的作业失败，并将它们插入sql_server_agent_job_failure中 。

Jobs Steps that Fail, but for Jobs that Succeed

作业失败的步骤，但成功的作业

Depending on the logic built into job steps, we may allow the job to continue even when a step fails. If this is the case, then SQL Server will not report failure, assuming the remainder of the job succeeds or follows similar rules.

根据作业步骤中内置的逻辑，即使步骤失败，我们也可以允许作业继续。在这种情况下，假定作业的其余部分成功或遵循类似的规则，则SQL Server将不会报告失败。

WITH CTE_FAILURE_STEP AS (SELECT*,ROW_NUMBER() OVER (PARTITION BY job_step_failure.sql_server_agent_job_id_guid, job_step_failure.job_failure_time_utc ORDER BY job_step_failure.job_failure_step_number DESC) AS recent_step_rankFROM #job_step_failure job_step_failure)INSERT INTO dbo.sql_server_agent_job_failure(sql_server_agent_job_id, sql_server_agent_instance_id, job_start_time_utc, job_failure_time_utc, job_failure_step_number, job_failure_step_name,job_failure_message, job_step_failure_message, job_step_severity, job_step_message_id, retries_attempted, has_email_been_sent_to_operator)SELECTsql_server_agent_job.sql_server_agent_job_id,CTE_FAILURE_STEP.instance_id,CTE_FAILURE_STEP.job_start_time_utc,CTE_FAILURE_STEP.job_failure_time_utc,CTE_FAILURE_STEP.job_failure_step_number,CTE_FAILURE_STEP.step_name AS job_failure_step_name,'' AS job_failure_message,CTE_FAILURE_STEP.job_step_failure_message,CTE_FAILURE_STEP.job_step_severity,CTE_FAILURE_STEP.sql_message_id AS job_step_message_id,CTE_FAILURE_STEP.retries_attempted,0 AS has_email_been_sent_to_operatorFROM CTE_FAILURE_STEPINNER JOIN dbo.sql_server_agent_jobON CTE_FAILURE_STEP.sql_server_agent_job_id_guid = sql_server_agent_job.sql_server_agent_job_id_guidLEFT JOIN #job_failure job_failureON job_failure.sql_server_agent_job_id_guid = CTE_FAILURE_STEP.sql_server_agent_job_id_guidAND job_failure.job_failure_time_utc = CTE_FAILURE_STEP.job_failure_time_utcWHERE CTE_FAILURE_STEP.recent_step_rank = 1AND job_failure.sql_server_agent_job_id_guid IS NULLAND CTE_FAILURE_STEP.instance_id NOT IN (SELECT sql_server_agent_job_failure.sql_server_agent_instance_id FROM dbo.sql_server_agent_job_failure);

The TSQL above will check for all failed job steps that do not have a corresponding failed job and report on them as well.

上面的TSQL将检查没有相应失败作业的所有失败作业步骤，并对其进行报告。

通知 (Notification)

For this demo, we’ll use sp_send_dbmail to email a notification to an operator. If this is not your preferred method of alerting, feel free to substitute this with something else.

对于此演示，我们将使用sp_send_dbmail通过电子邮件将通知发送给操作员。如果这不是您首选的警报方法，请随时用其他方法替代。

DECLARE @profile_name VARCHAR(MAX) = 'Default Public Profile';DECLARE @email_to_address VARCHAR(MAX) = 'ed@myemailaddress.com';DECLARE @email_subject VARCHAR(MAX);DECLARE @email_body VARCHAR(MAX);DECLARE @job_failure_count INT;SELECT@job_failure_count = COUNT(*)FROM dbo.sql_server_agent_job_failureWHERE sql_server_agent_job_failure.has_email_been_sent_to_operator = 0;-- Send an email to an operator if any new errors are found.IF EXISTS (SELECT * FROM dbo.sql_server_agent_job_failure WHERE sql_server_agent_job_failure.has_email_been_sent_to_operator = 0)BEGINSELECT @email_subject = 'Failed Job Alert: ' + ISNULL(@@SERVERNAME, CAST(SERVERPROPERTY('ServerName') AS VARCHAR(MAX)));SELECT @email_body = 'At least one failure has occurred on ' + ISNULL(@@SERVERNAME, CAST(SERVERPROPERTY('ServerName') AS VARCHAR(MAX))) + ':
<html><body><table border=1>
<tr><th colspan="6" bgcolor="#F29C89" align="left">Total Failed Jobs: ' + CAST(@job_failure_count AS VARCHAR(MAX)) + '</th>
</tr>
<tr><th bgcolor="#F29C89">Job Name</th><th bgcolor="#F29C89">Server Job Start Time</th><th bgcolor="#F29C89">Server Job Failure Time</th><th bgcolor="#F29C89">Failure Step Name</th><th bgcolor="#F29C89">Job Failure Message</th><th bgcolor="#F29C89">Job Step Failure Message</th>
</tr>';SELECT @email_body = @email_body + CAST((SELECT CAST(sql_server_agent_job.sql_server_agent_job_name AS VARCHAR(MAX)) AS 'td', '',CAST(DATEADD(HOUR, -1 * @utc_offset, sql_server_agent_job_failure.job_start_time_utc) AS VARCHAR(MAX)) AS 'td', '',CAST(DATEADD(HOUR, -1 * @utc_offset, sql_server_agent_job_failure.job_failure_time_utc) AS VARCHAR(MAX)) AS 'td', '',sql_server_agent_job_failure.job_failure_step_name AS 'td', '',sql_server_agent_job_failure.job_failure_message AS 'td', '',sql_server_agent_job_failure.job_step_failure_message AS 'td'FROM dbo.sql_server_agent_job_failureINNER JOIN dbo.sql_server_agent_jobON sql_server_agent_job.sql_server_agent_job_id = sql_server_agent_job_failure.sql_server_agent_job_idWHERE sql_server_agent_job_failure.has_email_been_sent_to_operator = 0ORDER BY sql_server_agent_job_failure.job_failure_time_utc ASCFOR XML PATH('tr'), ELEMENTS) AS VARCHAR(MAX));SELECT @email_body = @email_body + '</table></body></html>';SELECT @email_body = REPLACE(@email_body, '<td>', '<td valign="top">');EXEC msdb.dbo.sp_send_dbmail@profile_name = @profile_name,@recipients = @email_to_address,@subject = @email_subject,@body_format = 'html',@body = @email_body;UPDATE sql_server_agent_job_failureSET has_email_been_sent_to_operator = 1FROM dbo.sql_server_agent_job_failureWHERE sql_server_agent_job_failure.has_email_been_sent_to_operator = 0;END

Some effort is taken here to structure the job failure details into a formatted HTML email with the failure count at the top and a table of failure data below. This makes reading the emails relatively easy and straight-forward. The following is an example of a job failure that was sent using this process:

此处需要进行一些努力，以将作业失败详细信息组织为格式化HTML电子邮件，其中失败计数在顶部，失败数据表在下面。这使得阅读电子邮件相对容易和直接。以下是使用此过程发送的作业失败的示例：

These failures were staged, with the first being a scenario where a step fails, but the job succeeds and the second when a job fails due to a failed step. This email format consolidates all failures into a single table that includes details on what failed and why. While only 6 columns are included in this table, you can add more for any additional data that could be useful, such as job duration, number of retries, or further details about the job itself.

这些失败是分阶段进行的，第一种是场景失败，但步骤成功的情况，第二种场景是由于步骤失败而导致工作失败的场景。此电子邮件格式将所有失败合并到一个表中，该表包含有关失败原因和原因的详细信息。虽然此表仅包含6列，但是您可以为可能有用的任何其他数据添加更多列，例如作业持续时间，重试次数或有关作业本身的更多详细信息。

The goal of an alert such as this is to reduce the amount of homework that you need to do whenever something breaks. Instead of having to return to SQL Server Agent and read all of the information that could be presented here, you can begin troubleshooting right away.

这样的警报的目的是减少发生故障时需要做的家庭作业。您不必立即返回SQL Server代理并阅读此处可能显示的所有信息，而可以立即开始进行故障排除。

SQL Server Agent Job

SQL Server代理作业

The stored procedure that was created above can be called from anywhere (Scheduled task, Powershell, SQLCMD, etc…), but for simplicity, I’ll use a SQL Server agent job:

上面创建的存储过程可以从任何地方调用（预定任务，Powershell，SQLCMD等），但是为了简单起见，我将使用SQL Server代理作业：

USE msdb;
GODECLARE @jobId BINARY(16);EXEC msdb.dbo.sp_add_job @job_name=N'Job Failure Collection and Notification',@enabled=1,@notify_level_eventlog=0,@notify_level_email=2,@notify_level_netsend=0,@notify_level_page=0,@delete_level=0,@description=N'No description available.',@category_name=N'[Uncategorized (Local)]',@owner_login_name=N'sa',@notify_email_operator_name=N'Ed', @job_id = @jobId OUTPUT;EXEC msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'monitor_job_failures',@step_id=1,@cmdexec_success_code=0,@on_success_action=1,@on_success_step_id=0,@on_fail_action=2,@on_fail_step_id=0,@retry_attempts=0,@retry_interval=0,@os_run_priority=0, @subsystem=N'TSQL',@command=N'EXEC dbo.monitor_job_failures',@database_name=N'AdventureWorks2016CTP3',@flags=0;EXEC msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1;EXEC msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'Every 5 Minutes', @enabled=1, @freq_type=4, @freq_interval=1, @freq_subday_type=4, @freq_subday_interval=5, @freq_relative_interval=0, @freq_recurrence_factor=0, @active_start_date=20180208, @active_end_date=99991231, @active_start_time=100, @active_end_time=59;EXEC msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)';
GO

This job executes every 5 minutes and sticks to the default of 1440 minutes (one day) for the monitoring interval. Any failures in sql_server_agent_job_failure that have not been flagged as sent to the operator will be sent out as part of any given run. Feel free to adjust the job run frequency to whatever meets your needs.

该作业每5分钟执行一次，并在监视间隔内坚持默认值1440分钟（一天）。 sql_server_agent_job_failure中任何未标记为已发送给操作员的故障将作为任何给定运行的一部分发送出去。随意调整作业频率以适应您的需求。

If a job is deemed mission critical and you need to know of its failure immediately, then you may wish to consider a separate notification on it if waiting up to 5 minutes is too long, or have this process run more frequently. Most processes are flexible and can allow for a short waiting period before we respond, though that priority is determined by you. For example, if a SQL Server instance restarts, we’d probably want to know right now, but if an overnight reporting process fails, waiting a few minutes is probably fine.

如果一项任务被认为是关键任务，并且您需要立即知道它的失败，那么如果等待时间长达5分钟或此过程运行得更频繁，则不妨考虑另行通知。尽管您确定了优先级，但大多数流程都非常灵活，可以在等待之前等待很短的时间。例如，如果SQL Server实例重新启动，我们可能现在想知道，但是如果通宵报告过程失败，则等待几分钟可能就可以了。

Note that I have included a built-in notification on this job that will email me if it fails. This is intentional and answers the question of, “What notifies us of a failure if the job failure notification process breaks”. The only larger issue than this would be if SQL Server Agent would become unavailable. Alerting on this is beyond the scope of this article, but could be accomplished by a service monitor pointed at SQL Server Agent.

请注意，我在此作业中包含了内置通知，如果通知失败，它将通过电子邮件通知我。这是有意的，并回答了以下问题：“如果作业失败通知过程中断，什么会通知我们失败”。唯一比这大的问题是，如果SQL Server代理不可用。关于此的警报超出了本文的范围，但是可以通过针对SQL Server代理的服务监视器来完成。

Cleanup

清理

Once in place and tested, this alerting system can take the place of any existing alerts. The benefits of this process are:

一旦就位并经过测试，该警报系统可以代替任何现有警报。此过程的好处是：

Failures are stored in tables that can be queried and reported on later, if need be. 故障存储在表中，可以在以后查询并在需要时报告。
Failures are grouped into individual notifications. This prevents a flood of alerts if a job fails frequently. 故障被分为单独的通知。如果作业频繁失败，这可以防止大量警报。
Additional details are provided that built-in notifications do not report on. 提供了其他详细信息，内置通知不会报告。
All failure conditions can be reported on, including unusual ones that SQL Server Agent may not catch. 可以报告所有失败情况，包括SQL Server代理可能无法捕获的异常情况。

结论 (Conclusion)

Ultimately, reporting and alerting on failures are done on a case-by-case basis. This process was written to be as simple as possible, and therefore can be customized until it fully meets your needs. As an added bonus, we can create well-designed schema that relies on easy-to-understand data types for dates, times, statuses, and duration.

最终，对故障的报告和警报将根据具体情况进行。该过程被编写为尽可能简单，因此可以自定义，直到完全满足您的需求。另外，我们可以创建设计良好的架构，该架构依赖于日期，时间，状态和持续时间的易于理解的数据类型。

Creating your own job failure alerting process allows you to take charge of alerting and make the resulting notifications as meaningful and useful as possible. This is important as a major goal of alerting is to make the messages we get as actionable and informative as possible, without producing noise or distractions. We also do not want to miss potentially important failures that result from a misconfigured job.

通过创建自己的作业失败警报过程，您可以负责警报并使产生的通知尽可能有意义和有用。这很重要，因为发出警报的主要目的是使我们获得的消息尽可能具有可操作性和信息性，而不会产生噪音或干扰。我们也不想错过因配置错误而导致的潜在重要故障。

Given the limitations of the build-in alerting options in SQL Server, this also provides us functionality that is not possible otherwise. We can adjust notifications to react to failure states such as failed steps or misconfigured jobs. We can also customize the notifications we receive to include additional information that allows us to jump straight into troubleshooting, without the need to revisit SQL Server Agent and collect more troubleshooting data.

鉴于SQL Server中内置警报选项的局限性，这还为我们提供了其他方式无法实现的功能。我们可以调整通知以响应失败状态，例如失败的步骤或配置错误的作业。我们还可以自定义收到的通知，以包括其他信息，这些信息使我们可以直接进入故障排除，而无需重新访问SQL Server代理并收集更多的故障排除数据。

Effective alerting improves our productivity, decreases distractions, but most importantly, it reduces late-night wake-up-calls, which is something we can all get behind!

有效的警报可以提高我们的生产力，减少干扰，但最重要的是，它可以减少深夜叫醒电话，这是我们所有人都可以落后的！

资料下载 (Downloads)

Reporting and alerting on job failure报告作业失败并发出警报

翻译自: https://www.sqlshack.com/reporting-and-alerting-on-job-failure-in-sql-server/

sql server作业