Designing CI/CD pipelines in 2023

24 Dec 2022 - Giulio Vian - ~17 Minutes

Here’s a long article just in time for the Christmas holidays.

Today I want to give you an overview of what to expect in a modern Continuous Integration / Continuous Delivery (CI/CD) pipeline. As usual, I won’t focus on specific technologies but on the process, hinting at products and technical solutions to exemplify concepts.

In my humble opinion, the best approach to designing a process, even an architecture, is to move backwards, from the final goal with its fundamental requirements, all the way back to product evolution planning. In practice, it is difficult to build a path without clear objectives to achieve; in all likelihood, something that satisfies someone’s ego will be built, maybe unsatisfying the product users or those who design and implement the product.

I’ve broken down the building of a software-based product or service into nine general steps. In the following, “pipeline” is a logical object that may not correspond to a single object, it may be a concatenation of physical pipelines, for example split between CI and CD. So let’s start from the last phase, moving backward, and see what is needed…

9. After releasing in production

Five elements to consider for the system in operation are:

operational monitoring metrics (business metrics) such as number of unique active users, anonymous ones, how many complete a certain transaction, etc.;
technical monitoring (technical operational metrics) with common metrics like resource consumption (CPU, memory, storage, network) and more complex like calculating running costs (server rental, licenses, etc.), not forgetting any fine metric such as number of concurrent connections, threads, transaction duration, etc.;
log analysis to identify anomalies, in particular attacks, and trends in the behavior of both the system and the users;
the user reporting system and technical support;
user documentation.

The composition of business requirements and technical architecture constraints the possible implementations of monitoring. For example, a desktop application will employ telemetry, while a website can be monitored directly. The user documentation for a library, an SDK, will be a site with a filter for the version of the library; for a mobile application they could be animations, videos, embedded in the same app.

Recommendation: Regardless of the kind of system or application, there will always be the need to cover all five elements above with relatable tools and data.

Examples: Prometheus, Grafana, Open Telemetry, Information technology service management (ITSM), JIRA, ServiceNow, PagerDuty.

Let’s not forget other operating element, that the pipeline might configure or deploy. For example the space for backup storage and the installation of backup agent via infrastructure code.

8. Releasing in production

The main objectives specific for this phase (tactical objectives) are:

minimize the inconvenience for the user (interruption of service, duration of the update, etc.),
be sure that what has been released has no major problems than the version it replaces (smoke tests),
inform interested parties (stakeholders) of the release.

Recommendation: adopt architectures that allow deploying releases with minimal disruption and limit the damage in the event of a failure (radio blast). Analyse what architectural decisions made earlier (upstream) may be hindering the release and work to improve the situation.

The more a component is persistent, the minimal should be the change, subject to more checks and revisions. At the top of the persistence scale we find databases, relational, no-SQL or other: modifications can be destructive and completely block the service. The changes to the infrastructure are a notch below, assuming that storage and databases are protected elements in infrastructure definitions. At the bottom of the scale we find running processes; they are mostly volatile, whether they are containers, virtual machines or batch processes. The latter may have a certain degree of persistence to consider when designing a release.

It is essential to have both pre- and post-release testing. The tests before deploying should verify the state of the target system, for example the version of the operating system or the version of Kubernetes or the schema version of a database. There must be a test after any change to the target system to assess that the configuration matches the post-conditions of the deployment procedure, lastly an general test which verifies the health of the target system and the most important feature are working correctly.

Each deployment step must have a counter-action to apply in case of failure in order to restore operations.

Both initiation and completion of the release must post data into the Service Management and Operations systems. Through ITSM or directly from the pipeline, users are informed and receive documentation updates including release notes. This is all the more essential if your organisation is subject to regulations.

Finally, the passage into production must be traced in details and may require approval steps. In addition to informing the ITSM tool, monitoring tools also get notified of important events such as: important release dates, start of deployment, warm-up phases, completion of deployment, etc.

Architectures: Blue-green, canary deployments, API gateways, reverse proxies.

Examples: GitHub Actions, Azure DevOps Pipelines, GitLab, Octopus Deploy, Atlassian Bitbucket/Bamboo, ArgoCD, Flux.

Explore: micro-service architectures should not be adopted lightly. While they reduce the impact of a release, they also complicate troubleshooting as it is for all distributed systems.

7. Pre-production Releases

The purpose of pre-production releases is always (and only) to validate and verify: they have no intrinsic value, their only purpose is to prepare us to go into production: because only production releases give value to users.

For the releasing team, each pre-production release is an opportunity to validate the automated process and understand its timing. For teams that depend on our releases, it’s time to re-run their integration tests and verify that your APIs are compatible.

Recommendation: The release process should be automated as much as possible, preferably with a declarative approach, leaving almost nothing to be done manually. Particularly:

the infrastructural resources necessary for the system (Infrastructure-as-Code),
the configurations required for the resources (Configuration-as-Code),
storage and database modifications such as SQL scripts,
changes to application components such as binaries and their launch.

The release pipeline must include many categories of automated tests such as:

integration testing;
API-level testing;
validation of the User Interface;
a variety of functional (e.g. generative) tests;
non-functional tests including:
- Dynamic application security testing (DAST),
- Interactive application security testing (IAST),
- Performance, scalability and reliability testing,
- chaos testing,
- Internationalization (I18N),
- accessibility (A11Y);
probes and statistics to optimize future test executions (test impact analysis) and minimize release times.

Obviously the presence of automated tests does not exclude manual testing.

The release pipeline and related scripts must not change depending on the target environment, i.e. the scripts must be identical for all environments except for parameter values.

Essential objectives of pre-production releases are:

validate the automatic release process,
the timing of the process itself to determine outage windows.

Examples: Terraform, Ansible, Docker, scripting languages (bash, PowerShell, Python), database configuration (Flyway, Liquibase, RedGate, dbMaestro).

6. From code to release package

The purpose of this phase is to aggregate in a single place everything required for the release regardless of the environment in which it will be released.

The activities of this phase are:

collecting all sources, scripts, tools and documentation to be included in the release package,
generate, compile and transpile in formats required by deployment, analysis of all source code that, directly or indirectly, will be used in the deployment,
test execution (see previous section), collection of all metadata associated to this release (build logs, analysis results, tests, etc.),
publication of the release package.

More often than not, the aggregation point is a single file (tar, zip, tgz, etc.) dropped in a specific location. In some cases, the package is a collection of files, a single logical object decomposed for convenience. A couple of examples may clarify this latter. Let’s take Node.JS which offers a different installation package with binaries for x64, x32, arm32, arm64, etc. It’s the same release in different versions, so the release isn’t a single file, but many, one for each platform. Another example is a static site that is updated using rsync or robocopy.

Now let’s talk about the package content. It must be the same for all environments, following the Only Build Your Binaries Once principle popularized by Continuous Delivery (p.113 et seq.).

Is there a case that the package may be different for the production? Yes, but he’s also the only case I know of. It is happens when the packet must be cryptographically signed or contains encrypted elements. The most common scenario is a mobile application to be published in a Store. It is understandable that the signing key for production is different from the non-production one, with the production key kept in a safe place. Unsigned data, i.e. the binaries before signing, must always be the same. The simplest technique is to use two stages, in the first the CI produces an unsigned package, in the second stage the package is signed and archived before deploying on the target environment.

In no other case packages can differ between production and non-production.

Don’ts: Retrieve release files from multiple places, leave release scripts out of the package, including database updates.

What to do with technologies assuming the use of different packages for production and for development? For example, React offers react.production.min.js or react.development.js. The general rule is that development (debug) releases… must remain confined to development, ergo the release package has only production releases. compiled in release with related optimizations. Development version must be confined to developers’ machine and never used in any pre-production environment.

But what if I need to debug in a test environment? The correct approach is to include debug information (symbols) in the release package. Installing the symbols in the target environment must be optional (you don’t want an attacker leverage your symbols), and they must be generated at the same time with the rest of the package.

Recommendation: the package construction must include the maximum amount of static checks possible with predefined rules on the allowed tolerance, e.g.:

the compiler must not generate warnings,
parsers or linters, such as Sonar, do not generate warnings,
quality metrics do not exceed alarm thresholds,
no security issues detected by the Static application security testing (SAST) tool,
3rd-party libraries have no security or licensing issues according to the Software Composition Analysis (SCA) tool,
all automated tests (unit and integration test part applicable without installing) pass,
other test metrics are within tolerance (test coverage rate, re-test failures, test duration, etc.).

The list is indicative and we can reasonably foresee that technology will integrate more of these capabilities, so the distinction between the various types of analysis will become more theoretical than practical. Avoid analysing code that is not intended for release (e.g. unit test projects), it is a waste of resource and generates only noise.

The package must have a clear identification and intrinsic traceability (source version, build identifier); intrinsic i.e. part of the package itself. Many formats, Maven JARs, NuGet, npm, allow you to store this information, leaving the responsibility for filling it in to the developer.

Are there other things besides the release package produced in this stage? Sure thing. I personally recommend:

debugging symbols,
release notes,
test and code quality data,
Bill of Materials Software (SBOM).

5. Code integration

Now let’s look at how to include code changes into the pipeline.

In my experience, to achieve Continuous Delivery, the branching strategy isn’t much of importance: trunk-based, release-branches, git-flow. The choice is linked to many factors (team size, update frequency, shared programming practices like pair programming, etc.); it is crucial that the branching model is defined in detail and agreed across all teams that collaborate on the codebase.

To support this phase, modern development platforms offer valuable features under the Pull request (PR) umbrella. Looking closer, a PR is typically made up of different elements:

a communication aspect because it informs the team of a code change;
approval by reviewers, usually excluding the author of the change;
it is a moment of formal verification for audit and regulations;
automation of quality controls through PR-triggered builds and notification to other tools such as Sonar;
standardize the format of commit messages, associate with work-item/ticket/issue, the allowed type of merge and more.

These elements are typically modular, by turning them on and off, the PR can suit the process the team is using. Some Agile enthusiasts insist on the downsides of PR, seen as an anti-pattern, but they often overlook the value of PR to manage quality aspects. In this respect, modern tools are very powerful and flexible. The primitive mechanism of Pre-commit hook allows you to implement part of the aforementioned but decidedly less easily.

Recommendation: Modern platforms like Azure DevOps, Bitbucket, GitHub, GitLab offer many useful features under the Pull-Request umbrella. Study how to take advantage of it without adding burden to your process.

4. The quality of the code

What are the fundamental characteristics that our code must possess to fit in the release process? In my opinion these three:

ease of release and deployment,
the possibility of simple and safe configuration distinct from the distribution package,
finally, it must be implemented in such a way that it is observable once in operation.

A discussion of architectures and they relation to release simplicity is beyond the scope of this article, hope I’ll be able to work on it in the future. Here, I simply assert that it is possible (easy) to make any architecture unnecessarily complex, and thus difficult to deploy. What I refer with “simplicity “? The design of the application or system is simple to release (release friendly) when it is possible to change a single module without releasing/recompiling the entire application/system. In addition, it is simple only if it has a limited number of deployment processes (at most four or five, usually one per node role), i.e. scripts that install the respective portion of the system.

It’s obvious that an application is parametric and configurable, isn’t it? Unfortunately I continue to see sources with hardcoded configuration values: connection strings to databases, URLs, certificate identifiers, up to passwords and API keys. Part of the blame lies with bad examples in languages or frameworks documentation, or with basic programming courses that don’t use production-friendly approaches.

Which data goes into configuration? Basically two categories: external object identifiers (like machine names or URLs) and behavior parameters (e.g. polling frequency or connection timeout).

The configuration format should be simple, intuitive, and self-descriptive as much as possible. Unfortunately many of the formats in common use fall short. The configuration must also be easy to modify automatically, with a scripting language.

The degree of observability for a running system is engraved in code. As much as modern platforms strive to offer rich base metrics (think JMX and OpenTelemetry), there is always a need to expose application specific metrics and logs for our system, careful to the quality of the exposed information (see above). In exposing internal system information, sensitive data must be filtered or masked, either because it is secret or because it represents personal information (PII).

Recommendation: configuration must be designed for the people who will use it: other developers, a production administrator, the author of a deployment script. Take care to explain the measurement units in parameters, insert adequate comments so that it is easy to modify a value. Observability is similar: log messages must be understandable to anyone other than the original programmer. The more you neglect these aspects, the more tiring the releases and the more difficult is resolving the problems that, inevitably, will emerge in the real production exercise.

3. The development environment

But where does the code come from?

It comes from the mind of a programmer today, and in the future, too. Writing code does not require a dedicated machine per-se, while the testing and debugging phases require specific resources. Debuggers, databases, hardware, connections to external systems: the cases are limitless.

The close link between development, build and test environments is evident. The current market offers only a partial solution to the problem of defining an environment for developers, for example Dev containers which, however, cannot be used in a pipeline to launch an environment and run tests, much less to setup a production environment.

As I’m writing, there are no simple and clean solutions, except for pure web applications. Setting up an environment for a new team member still requires a poorly reusable jumble of scripts and configuration files. I think is worth the effort anyway, as explained in Automate? Always! .

Recommendation: Automate developers’ environment setup, do not stop at release environments.

2. Before the code

We have reached the crucial junction between the programmer and the rest of the world: analysts, product managers, project managers, customers, managers.

Teams use an amazing variety of work management techniques (backlog), from post-its to Jira. In this context we are interested to understand which information we want (must?) to capture in these tools and transmit it along the pipeline. The information must be organized in a way to make easy knowing which high-level features are present in a given release package (e.g. through automatic generation of release notes). A modern development platform allow us to link a user story to a branch or commit and thus trace the evolution of the story through the pipeline. Another way platforms create these links is a special annotation in a commit comment. This works even in the post-it scenario, with no backlog management tool. A further step is to further enrich this information using, for example, a commit message specially formatted, such as Conventional Commits .

If we don’t create these associations at the beginning, even before writing a line of code, it will be extremely difficult to rebuild them afterwards. The keyword here is traceability.

The amount of information we can put in a post-it, real or virtual, is limited, and it must be! Where do we write down everything else, whether it’s information internal to the teams or external, usable by end users? We need a system that allows us to manage documentation, hierarchically organized, catalogued, capable of tracking changes. A Wiki is a great starting point, and in fact modern platforms include it (GitHub Pages, Azure Wiki, GitLab Wiki). More complex needs can be met with a publishing pipeline: CI/CD automation generates websites for immediate use upon changing a file in a repository.

However, all this technical marvel is not able to compensate for the absence of the figure of librarian, or of a technical writer, who takes care of the organization, the format, and above all the coherence of the contents.

Recommendation: do not wait defining standards for documentation and traceability: the tools to do this are copious, cheap, and easy to use.

1. Start from a distance

Always bearing in mind that DevOps is the result of applying Lean principles to the technology value stream ( The DevOps Handbook , p. 3), i.e. the result obtained by applying Lean principles to the technological value chain, the product (or service) must be the starting point of the pipeline.

At the beginning of the pipeline we find tools used by Product Owners (POs) to build the Product Backlog. The PO thinks in terms of quarters, years, and beyond, while a team thinks in terms of weeks or months. While the PO describes features with a low degree of accuracy, the team needs precision and clarity. The product backlog must take into account factors such as seasonality, market changes, budget and so on. It is not a tool designed to organize work, but to define business priorities.

In many projects we have to consider content that comes from other teams such as designers or hardware engineers. If you don’t want to fall back into a waterfall process, you have to define shared standards and tools between groups so that all artifacts are version-stamped, the information is published in a well-known and easily accessible place, and, finally, there is full traceability among the many artifacts.

Recommendation: Avoid the pitfall of using the same tool for POs and developers. Instead, some kind of automatic data synchronisation is desirable so that a subset of records and properties in POs’ tool are projected into the developer backlog with a two-way update. Barring trade secrets, the POs’ tool must be accessible for all downstream groups, developers, data administrators, system administrators, all the way down to user support, allowing a collective view of important deadlines and upcoming changes.

Examples: Aha!, Miro, Asana, Productboard.

Conclusions

At the end of this long overview I want to summarize the guiding principles to use in designing or improving CI/CD pipelines or the process of our software life cycle. In my opinion the three pillars are:

introspection of the process,
traceability,
automation.

What do you think?

Architecture
DevOps CI/CD Automation Pipelines