CI/CD Lesson 15 – Automated Testing | Dataplexa
Section II · Lesson 15

Automated Testing

In this lesson

Testing in CI/CD The Test Pyramid Test Speed & Reliability Coverage & Thresholds Tests in the Pipeline

Automated testing is the practice of running a suite of programmatic checks against an application — without human involvement — to verify that it behaves correctly after every change. In a CI/CD pipeline, automated tests are the mechanism that converts a green build into a deployable release. A build that compiles successfully but fails its tests is not ready for production. A build that passes every test is a verified candidate. The test suite is what gives the pipeline its confidence, and that confidence is what makes frequent, low-risk deployment possible.

Testing as a Pipeline Gate, Not an Afterthought

In teams without CI/CD, testing is often a phase — something that happens after development is done, usually performed by a dedicated QA team, usually close to a release deadline. The problem with this model is timing. A bug found the day after it was written takes minutes to fix. The same bug found three weeks later, after the developer has moved on to different work and the codebase has changed significantly, can take hours or days — and may require changes across multiple files to untangle.

CI/CD moves testing to the point of change. Every commit triggers the test suite. Every pull request must pass before it can be merged. The feedback loop collapses from weeks to minutes, and the cost of each bug collapses with it. The test stage is not a quality checkpoint at the end of the pipeline — it is a continuous quality signal running at every step of it.

The Aircraft Pre-Flight Checklist Analogy

A pilot does not perform a safety check once a month and then fly every day trusting that nothing has changed. The pre-flight checklist runs before every single flight — regardless of how many times the plane has flown before, regardless of how experienced the pilot is. Automated tests are the pre-flight checklist for software. They run before every deployment, on every change, not because the team distrusts their developers, but because the cost of finding a problem in the air is categorically higher than the cost of finding it on the ground.

The Test Pyramid — Balancing Speed, Coverage, and Cost

Not all tests are equal in cost or speed. The test pyramid is the mental model that guides how a healthy test suite is structured. It describes three layers of tests, each with a different scope, speed, and quantity. The pyramid shape is intentional — many fast, cheap tests at the base; fewer slow, expensive tests at the top.

The Test Pyramid — Three Layers

🔬
Unit Tests — the base
Test a single function, method, or class in complete isolation. No database, no network, no external services. Run in milliseconds. A well-maintained codebase has hundreds or thousands of them. They are the cheapest form of verification and the first to run in any pipeline.
🔗
Integration Tests — the middle
Test how multiple components work together — a service calling a database, an API handler calling a downstream service. Slower than unit tests because they require real or simulated dependencies. Run in seconds to minutes. Catch problems that unit tests cannot: incorrect SQL, mismatched API contracts, configuration errors.
🌐
End-to-End Tests — the tip
Test the entire application from the user's perspective — a real browser driving real UI flows against a running instance. Slowest and most expensive. Catch problems that only appear when all components are running together. Keep these to critical paths only — login, checkout, core user journeys. Covered further in Lesson 16.

Test Speed and Reliability in the Pipeline

A slow test suite is a pipeline that developers learn to ignore or work around. If every PR triggers a 45-minute test run, developers stop waiting for results, start merging speculatively, and lose the tight feedback loop that CI/CD is supposed to provide. Test suite speed is not a vanity metric — it is a direct measure of how useful the pipeline is in practice.

Flaky tests — tests that pass or fail intermittently without any change to the code — are the other major reliability threat. A flaky test erodes trust in the entire suite. When developers see a red pipeline, their first instinct should be "something broke" — not "probably just a flaky test, let me re-run it." Every flaky test that is tolerated trains the team to distrust the pipeline, and a distrusted pipeline is an ineffective one. Flaky tests must be fixed or quarantined immediately.

Techniques for Keeping Tests Fast

Run unit tests first
Unit tests are fastest and catch the most common errors. Fail early and often on them before spending time on slower integration or E2E suites.
Parallelise test jobs
Split the test suite across multiple pipeline jobs running concurrently. GitHub Actions supports job-level parallelism natively — a 10-minute suite can become 2 minutes across 5 parallel runners.
Cache test dependencies
Restore the dependency cache (as covered in Lesson 12) so the install step does not eat into the test budget on every run.
Reserve E2E for merge to main
Run expensive end-to-end tests only on merge to the main branch, not on every PR push. PRs get fast feedback from unit and integration tests; the full suite runs as a final gate.

Code Coverage and Thresholds

Code coverage measures the percentage of the codebase that is exercised by the test suite — which lines, branches, and functions are touched during a test run. It is a useful signal, but it is not the same as quality. A codebase can have 100% coverage and still have terrible tests if the assertions are weak. Coverage tells you what the tests touch; it says nothing about whether those tests would catch a real bug.

Coverage thresholds in the pipeline — failing the build if coverage drops below a minimum — are a reasonable guard against regressions. A threshold of 80% does not mean the code is 80% correct; it means the team has agreed not to let untested code accumulate unchecked. Setting the threshold too high (99%+) incentivises writing tests for the sake of coverage rather than for correctness. Most teams find 70–85% a practical range for enforced pipeline thresholds.

Automated Test Stage — GitHub Actions

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci                           # Clean install from lock file

      - run: npm test -- --coverage           # Run tests and collect coverage data
        env:
          CI: true                            # Many test runners behave differently in CI mode

      - name: Enforce coverage threshold
        run: npx jest --coverage --coverageThreshold='{"global":{"lines":80}}'
        # Pipeline fails if line coverage drops below 80%

      - uses: actions/upload-artifact@v4      # Store coverage report for review
        with:
          name: coverage-report
          path: coverage/

What just happened?

The pipeline ran the full test suite, collected coverage data, and enforced an 80% line coverage threshold — failing the build if any PR reduces coverage below that floor. The coverage report was uploaded as an artifact so developers can review exactly which lines are untested without leaving the pipeline UI.

Warning: Tolerating Flaky Tests Destroys Pipeline Trust

The moment a team accepts "just re-run it" as the response to a failing test, the pipeline stops being a reliable quality signal. Flaky tests teach developers to distrust red builds — and a distrusted pipeline is one that gets bypassed. Every flaky test must be treated as a production incident: investigated, fixed, or explicitly quarantined with a tracking ticket. The standard for pipeline reliability must be that a red build always means something is wrong, without exception.

Key Takeaways from This Lesson

Automated tests are the confidence mechanism behind frequent deployment — they turn a green build into a deployable release by verifying behaviour after every single change, not just before major releases.
The test pyramid guides suite composition — many fast unit tests at the base, fewer integration tests in the middle, a small number of critical end-to-end tests at the top. Inverting this pyramid produces a slow, brittle suite.
Flaky tests must be fixed or quarantined immediately — tolerating intermittent failures trains developers to distrust the pipeline, and a distrusted pipeline gets bypassed. A red build must always mean something is wrong.
Code coverage is a signal, not a guarantee — 100% coverage does not mean 100% correct. Coverage thresholds in the pipeline guard against untested code accumulating, not against weak assertions.
Test suite speed is a pipeline health metric — parallelising jobs, running unit tests first, and reserving E2E tests for merge-to-main keep feedback loops tight and developer trust high.

Teacher's Note

Track your pipeline's test stage duration over time. If it crosses 10 minutes on a PR run, it is time to parallelise — developers waiting more than 10 minutes for feedback start multitasking, and multitasking is where context is lost and bugs slip through.

Practice Questions

Answer in your own words — then check against the expected answer.

1. What is the term for tests that pass or fail intermittently without any change to the code — the category of test that most damages developer trust in the pipeline and must be fixed or quarantined immediately?



2. What is the name of the mental model that guides how a healthy test suite is structured — describing three layers of tests from fast, isolated unit tests at the base to slow, broad end-to-end tests at the tip?



3. What metric measures the percentage of lines, branches, and functions in a codebase that are exercised by the test suite — useful as a pipeline threshold guard but not a direct measure of test quality?



Lesson Quiz

1. A developer wants to add a test that verifies a single utility function returns the correct output for a given input, with no database or network involved. What type of test is this and what makes it valuable in a pipeline?


2. A test fails intermittently on the CI pipeline but always passes when run locally. The team decides to just re-run the pipeline when it happens. What is the long-term consequence of this decision?


3. A codebase has 95% line coverage but still ships a critical bug. A manager asks how this is possible if the tests cover almost everything. What is the correct explanation?


Up Next · Lesson 16

Test Types in Pipelines

Beyond the pyramid — smoke tests, contract tests, performance tests, and security tests each have a specific role in the pipeline and a specific point at which they deliver the most value.