Knowledgebase

How to use Datadog alerts and Thresholds to fail your load test


Background

Datadog is a monitoring and analytics platform that can help you to get full visibility of the performance of your applications. Here at LoadImpact we use Datadog to monitor various different services of our platform. Datadog alerts give the ability to know when critical changes in your system are occurring. These triggered alerts appear in Datadog’s Event Stream, allowing collaboration around active issues in your applications or infrastructure.

One potential performance issue is that a System Under Test(SUT) has high CPU consumption when under stress. This tutorial will show you how to fail your load test for this type of condition by using Datadog’s API and thresholds in LoadImpact.


Requirements

  • A site/system to test. In this example, we will test a site already running as a ECS Service. This site is available at https://httpbin.test.loadimpact.com.
  • An already configured Datadog integration with a platform your site is running on. In our case it is Datadog integration with AWS, please refer to the official Datadog AWS Integration Guide for details.
  • An account in LoadImpact with an appropriate subscription.
  • An account in Datadog that allows us to create monitors.

Create a Monitor in Datadog

First, we want to create a monitor in Datadog which triggers an alert if CPU utilization reaches 100 units or more on the ECS Service. You may wish to monitor something else, so feel free to adjust this to meet your needs. While creating a monitor make next actions:

  • Choose Threshold alert as a detection method
  • Choose aws.ecs.service.cpuutilization metric from servicename:<your_service_name> in “Define the metric” step
  • Configure “Alert threshold” to be 100
  • Edit message and notification steps and save Monitor

datadog_pull_2 datadog_pull_1

Now the monitor will appear in the Datadog Event Stream if the metric threshold is reached. This is what we will look for when we evaluate the LoadImpact Thresholds later.


Create a test in LoadImpact

Next, we will need a test script to run. Here is our example that we will use in this test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import http from "k6/http";
import { Counter } from "k6/metrics";
import { check, group, sleep } from "k6";

export const datadogHttpbinCpu = new Counter("Httpbin_CPU_Alert");
export let options = {
    stages: [
        { duration: "30s", target: 50 },
        { duration: "600s", target: 50 }
    ],
    thresholds: {
        "http_req_duration": ["p(95)<200"],
        "Httpbin_CPU_Alert": ["count < 1"]
    }
};

const datadogApi = "https://api.datadoghq.com/api/v1/"; // DataDogs API endpoint
const datadogApiKey = "<YOUR_DATADOG_API_KEY>";         // DataDogs API key, read below how to get it
const datadogAppKey = "<YOUR_DATADOG_APPLICATION_KEY>"; // DataDogs application key, read below how to get it
const getDataDogHeader = tagName => {
    return {
        headers: { ["Content-Type"]: "application/x-www-form-urlencoded" },
        tags: { name: tagName }
    };
};

export function setup() {  // function for getting start time of test, executed before actual load testing
    let time = Date.now();
    return time;
}

export default function () {
    let res = http.get("https://httpbin.test.loadimpact.com/");
    check(res, {
        "is status 200": (r) => r.status === 200
    });
    sleep(1);
}

export function teardown(time) { // function which queries DataDogs event stream for alerts in time window of test run, executed after actual load testing
    let endTime = Math.floor(Date.now() / 1000);
    let startTime = Math.floor(time / 1000);
    let monitorTags = [
        "servicename:demosites-httpbin"
    ];

    let reqString =
        `events?api_key=` + datadogApiKey +
        `&application_key=` + datadogAppKey +
        "&start=" + startTime +
        "&end=" + endTime +
        "&tags=";

    monitorTags.forEach(tag => {
        let response = http.get(
            datadogApi + reqString + tag,
            getDataDogHeader("Datadog Event Stream")
        );
        let body = JSON.parse(response.body);
        body.events.forEach(event => {
            if (event.tags.includes("servicename:demosites-httpbin")) {
                datadogHttpbinCpu.add(true);
            }
        });
    });

}

You can find how to manage your Datadog API and Application keys here.

Go to LoadImpact and push button “CREATE NEW TEST”. Choose “SRIPTING” in “WEBSITE/APP TESTING” section: datadog_pull_3

Paste your script body, make a suitable name for test and save.

datadog_pull_9


Run a test in LoadImpact

Run your test and in our case since we configured our script to run only 50 VUs, that load will be not enough to trigger a CPU alert, therefore 1st run will be passed:

datadog_pull_7

Let’s update our script to produce more load on our system under test. For this example, we achieve this by increasing number of VUs from 50 to 150:

1
2
3
4
5
6
7
8
9
10
export let options = {
    stages: [
        { duration: "30s", target: 150 }, // increasing number of VUs from 50 to 150
        { duration: "600s", target: 150 } // increasing number of VUs from 50 to 150
    ],
    thresholds: {
        "http_req_duration": ["p(95)<200"],
        "Httpbin_CPU_Alert": ["count < 1"]
    }
};

As we can see, after increasing load our test was failed due to exceeding our defined threshold value:

datadog_pull_10


See also