Skip to main content

7. Standardize log output

Β· One min read
Mithril Team

Status​

Accepted

Context​

  • ADR 2 is not completely relevant now, we have migrated recently the logs in the client to stderr. Only the result of the command execution is in stdout. This makes it possible to exploit the result, see our blog post.
  • Mithril aggregator logs are always redirected to stdout but it mixes 2 types of CLI commands, some of which would benefit from the logs output to stderr.
  • Mithril aggregator and Mithril client CLI have not a consistent log strategy, that's why we need to standardize them.

Decision​

  • For commands that provide a result or execute an action, logs are sent to stderr. Only the result of the command is sent to stdout.
  • For commands that launch a program without an expected result (server), logs are sent to stdout.

Consequences​

  • End users who use stdout logs would have a breaking change. They will have to retrieve the logs that come from stderr in addition.
  • Commands genesis, era and tools from Mithril aggregator now send their logs to stderr.

6. Errors implementation Standard

Β· 6 min read
Mithril Team

Status​

Accepted

Context​

Error handling is difficult with Rust:

  • Many ways of implementing them with different crates (thiserror, anyhow, ...)
  • No exception like handling of errors
  • No stack trace or context available by default
  • Backtrace uniquely when a panic occurs and if RUST_BACKTRACE environment variable is set to 1 or full

We think the errors handling should be done in a consistent way in the project. Thus we have worked on a standardization of their implementation and tried to apply it to the whole repository. This has enabled us to have a clear vision of the do and don't that we intend to summarize in this ADR.

Decision​

Therefore

  • We have decided to use thiserror and anyhow crates to implement the errors:
    • thiserror is used to create module or domain errors that come from our developments and can be easily identified (as they are strongly typed).
    • anyhow is used to add a context to an error triggered by a sub-system. The context is a convenient way to get 'stack trace' like debug information.

Here is a Rust playground that summarizes the usage of thiserror:

#[allow(unused_imports)]
use anyhow::{anyhow, Context, Result}; // 1.0.71
use thiserror::Error; // 1.0.43

#[derive(Error, Debug)]
#[error("Codec error: {msg}")]
pub struct CodecError {
msg: String,
#[source] // optional if field name is `source`
source: anyhow::Error,
}

#[derive(Error, Debug)]
pub enum DomainError {
#[error("Error with codec: {0:?}")]
CodecWithOnlyDebug(CodecError),

#[error("Error with codec")]
CodecWithSource(#[source] CodecError),

#[error("Error with codec: {0}")]
CodecWithoutAnything(CodecError),

#[error("Anyhow error: {0:?}")]
AnyhowWrapWithOnlyDebug(anyhow::Error),

#[error("Anyhow error")]
AnyhowWrapWithSource(#[source] anyhow::Error),

#[error("Anyhow error: {0}")]
AnyhowWrapWithoutAnything(anyhow::Error),
}

fn anyhow_result() -> Result<()> {
"invalid_number"
.parse::<u64>()
.map(|_| ())
.with_context(|| "Reading database failure")
}

fn thiserror_struct() -> Result<(), CodecError> {
Err(CodecError {
msg: "My message".to_string(),
source: anyhow!("Could not decode config"),
})?;
Ok(())
}

fn print_error(title: &str, error: anyhow::Error) {
println!("{title:-^80}");
println!("{error:?}\n",);
}

fn main() {
println!("1 - Printing errors from enum variant that contains a error struct\n");
// Debug the inner error struct: "normal" debug without the anyhow touch
print_error(
"DomainError::CodecWithOnlyDebug",
anyhow!(DomainError::CodecWithOnlyDebug(
thiserror_struct().unwrap_err()
)),
);
// marking the inner error struct as source: anyhow will be able to make a
// stacktrace out of this error. Nice !
print_error(
"DomainError::CodecWithSource",
anyhow!(DomainError::CodecWithSource(
thiserror_struct().unwrap_err()
)),
);
// without debugging the inner error: only show the error text
print_error(
"DomainError::CodecWithoutAnything",
anyhow!(DomainError::CodecWithoutAnything(
thiserror_struct().unwrap_err()
)),
);

println!("\n2 - Printing errors from enum variant that contains a anyhow error\n");
// using only debug: the first two errors of the stack will be merged
print_error(
"DomainError::AnyhowWrapWithOnlyDebug",
anyhow!(DomainError::AnyhowWrapWithOnlyDebug(
anyhow_result().with_context(|| "context").unwrap_err()
)),
);
// using #[source] attribute: each error of the stack will have a line
print_error(
"DomainError::AnyhowWrapWithSource",
anyhow!(DomainError::AnyhowWrapWithSource(
anyhow_result().with_context(|| "context").unwrap_err()
)),
);
// without debug nor source: only the uppermost error is print
print_error(
"DomainError::AnyhowWrapWithoutAnything",
anyhow!(DomainError::AnyhowWrapWithoutAnything(
anyhow_result().with_context(|| "context").unwrap_err()
)),
);
}

Which will output errors this way:

1 - Printing errors from enum variant that contains a error struct

------------------------DomainError::CodecWithOnlyDebug-------------------------
Error with codec: CodecError { msg: "My message", source: Could not decode config }

--------------------------DomainError::CodecWithSource--------------------------
Error with codec

Caused by:
0: Codec error: My message
1: Could not decode config

-----------------------DomainError::CodecWithoutAnything------------------------
Error with codec: Codec error: My message


2 - Printing errors from enum variant that contains a anyhow error

----------------------DomainError::AnyhowWrapWithOnlyDebug----------------------
Anyhow error: context

Caused by:
0: Reading database failure
1: invalid digit found in string

-----------------------DomainError::AnyhowWrapWithSource------------------------
Anyhow error

Caused by:
0: context
1: Reading database failure
2: invalid digit found in string

---------------------DomainError::AnyhowWrapWithoutAnything---------------------
Anyhow error: context

Here is a Rust playground that summarizes the usage of the context feature form anyhow:

#[allow(unused_imports)]
use anyhow::{anyhow, Context, Result}; // 1.0.71

fn read_db() -> Result<()> {
"invalid_number"
.parse::<u64>()
.map(|_| ())
.with_context(|| "Reading database failure")
}

fn do_work() -> Result<()> {
read_db().with_context(|| "Important work failed while reading database")
}

fn do_service_work() -> Result<()> {
do_work().with_context(|| "Service could not do the important work")
}

fn main() {
let error = do_service_work().unwrap_err();

println!("Error string:\n {error}\n\n");
println!("Error debug:\n {error:?}\n\n");
println!("Error pretty:\n {error:#?}\n\n");
}

Which will output errors this way:

Error string:
Service could not do the important work


Error debug:
Service could not do the important work

Caused by:
0: Important work failed while reading database
1: Reading database failure
2: invalid digit found in string


Error pretty:
Error {
context: "Service could not do the important work",
source: Error {
context: "Important work failed while reading database",
source: Error {
context: "Reading database failure",
source: ParseIntError {
kind: InvalidDigit,
},
},
},
}

Consequences​

  • We have defined the following aliases that should be used by default:
    • StdResult: the default result that should be returned by a function (unless a more specific type is required).
    • StdError: the default error that should be used (unless a more specific type is required).
/* Code extracted from mithril-common::lib.rs */
/// Generic error type
pub type StdError = anyhow::Error;

/// Generic result type
pub type StdResult<T> = anyhow::Result<T, StdError>;
  • The function that returns an error from a sub-system should systematically add a context to the error with the with_context method, in order to provide clear stack traces and ease debugging.

  • When printing an StdError we should use the debug format without the pretty modifier, ie:

println!("Error debug:\n {error:?}\n\n");
  • When wrapping an error in a thiserror enum variant we should use the source attribute that will provide a clearer stack trace:
/// Correct usage with `source` attribute
#[derive(Error, Debug)]
pub enum DomainError {
#[error("Anyhow error")]
AnyhowWrapWithSource(#[source] StdError),
}
/// Incorrect usage without `source` attribute
#[derive(Error, Debug)]
pub enum DomainError {
#[error("Anyhow error: {0}")]
AnyhowWrapWithoutAnything(StdError),
}
  • Here are some tips on how to discriminate between creating a new error using thiserror or using an StdResult:
    • If you raise an anyhow error which only contains a string this means that you are creating a new error that doesn't come from a sub-system. In that case you should create a type using thiserror intead, ie:
// Avoid
return Err(anyhow!("my new error"));

// Prefer
#[derive(Debug,Error)]
pub enum MyError {
MyNewError
}
return Err(MyError::MyNewError);
  • (Still undecided) You should avoid wrapping a StdError in a thiserror type. This breaks the stack trace and makes it really difficult to retrieve the innermost errors using downcast_ref. When the thiserror type is itself wrapped in a StdError afterward, you would have to downcast_ref twice: first to get the thiserror type and then to get the innermost error. This should be restricted to the topmost errors of our system (ie the state machine errors).

5. Use rfc3339 for date formatting

Β· 2 min read
Mithril Team

Status​

Accepted

Context​

Previously, on the Mithril project we did not have a preferred format for the dates in our applications, leading to multiple formats being used.

For example when querying a certificate from an aggregator, the initiated_at field did not specify the timezone, timezone that could be found in the sealed_at field:

{
"initiated_at": "2023-05-26T00:02:23",
"sealed_at": "2023-05-26T00:03:23.998753492Z"
}

Same problem in our databases where a date could be stored without timezone and milliseconds (ie: 2023-06-13 16:35:28) in one table column and with them in another (ie: 2023-06-13T16:35:28.143292875Z).

The RFC 3339 is a widely used, easily readable, mostly numeric (no translation is needed to parse the day or the month), format. Also, it always includes the timezone meaning that our client can convert such date to their local time if needed.

Decision​

Therefore

  • We commit to use RFC 3339 compatible date and time whenever we need to store or show a date and time.

Consequences​

  • All dates and time must use a dedicated type in the application, ie: the DateTime<Utc> type from chrono crate.
    • This means that dates must never be stored in our types using Strings.
  • Internally, we will always use the UTC timezone, to avoid useless conversions between timezones.
  • Users or scripts querying dates from our applications or from our databases will be able to parse all of them using the same format.

4. Mithril Network Upgrade Strategy

Β· 4 min read
Mithril Team

Status​

Accepted

Context​

When we will run Mithril on mainnet there will be thousands of signers running. Upgrading the version of the nodes has an impact as different versions of API, messages, signature may lead to loss of a significant part of the signers population over one epoch or more. In any case we must prevent a gap in the certificate chain while upgrading critical parts.

We need to be able to keep enough of signer nodes and the aggregator able to work together in order to produce at least one certificate per epoch.

Examples of such changes:

  • change in the message structure
  • change in the cryptographic algorithm
  • change in communication channels

Decision​

In order to synchronize all nodes behavior transition, the Release Team will define Eras that start at a given Cardano Epoch and lasts until the next Era begins. When nodes detect an Era change, they switch from old to new behavior hence all transitioning at almost the same time.

Consequences​

Release Team​

The release team is the team responsible of releasing new versions of Mithril software. The Release Team will be responsible to set the Epoch at which Eras change using an Era Activation Marker. In order to be able to determine when the new Era will begin, the Release Team has to know what is the share of the total Mithril stake that can run the new behavior. Signer node software versions has to be monitored.

Version monitoring​

The Release Team must be aware of the software version run by the Signer nodes and their associated stake. The version is going to be added to all HTTP headers in inter-node communication. In a first step, the Aggregator nodes will record this information, and provide the mapping of stakes to Signer nodes.

This configuration works in the case where there is a centralized Aggregator Node (as it is today). In the future, there may be several Aggregator nodes working in a decentralized manner. This would mean having a separate monitoring service, and also monitor the aggregators node versions.

Era Activation Marker​

An Era Activation Marker is an information shared among all the nodes. For every upgrade, there are two phases:

  • a first marker is set on the blockchain that just indicates a new Era will start soon and softwares shall be updated.
  • a second marker is set that specifies the Epoch when they must switch from old to new behavior.

Every Era Activation Marker will be a transaction in the Cardano blockchain. This implies the nodes must be able to read transactions of the blockchain. Era Activation Markers can be of the same type, the first maker does not hold any Epoch information whereas the second does.

Node will check the blockchain for Markers at startup and for every new Epoch. When a node detects a Marker, it will warn the user if it does not support the incoming Era that he must upgrade his node. If the node detects it does not support the current Era, it will stop working with an explicit error message. To ease that operation, Era Activation Marker will be made sortable.

Behavior Switch​

The nodes must be able to switch from one behavior to another when the Era Epoch is reached. This means the software must embed both behaviors. The switch is developed as a one time operation, there is no rollback mechanism available. Once the Epoch is transitioned and the switch has occurred, a new software release can remove the old behavior from the codebase.

3. Release process and versioning

Β· 3 min read
Mithril Team

Status​

Accepted

Context​

In order to deliver regularly the software to our users, we should implement a release process based on a predictable versioning scheme.

Versioning​

A Release Version determines a distribution of determined node versions and underlying libraries.

  • Our softwares must be able to interact seamlessly with other Mithril software.
  • Our softwares must be able to be hosted on crates.io.
  • Our softwares must clearly indicate compatibility with other Mithril components to end users.

Release process​

A Release is a software package that is built once and then promoted from the testing environment to the production environment. It can be signed.

  • Keep it simple.
  • Automated as much as possible: all points not requiring human decision shall be automated.
  • Minimize the mean time to release.

Decision​

There are 3 versioned layers in the Mithril stack:

  • HTTP API protocol to ensure compatibility in the communication between nodes (use Semver).
  • Crate version: each node & library has its own version (use Semver). The commit digest is automatically added to the version by the CI pipeline.
  • Release Version: the distribution version (use version scheme YYWW.patch | YYWW.patch-name). The VERSION file is computed by the pipeline from the tag release.

The documentation is tied to a Release Version.

Release Process​

Starting just after a new release has been made:

  1. Develop on a dedicated development branch.
  2. When merging PR on main: update the Cargo.toml files with version of the updated nodes.
  3. Once merged, the CI creates an unstable tag & release which is deployed on testing environment.
  4. Push a tag using the distribution version format on this commit with a -prerelease suffix.
  5. The CI gets the built artifacts associated with this commit and generates a named pre-release which is deployed on pre-release for testing.
  6. Push a tag using the distribution version format on this commit without the -prerelease suffix.
  7. The CI gets the built artifacts associated with this commit and generates a named release which is deployed on pre-release for testing.
  8. In the release GitHub interface, edit the newly generated release, uncheck the This is a pre-release checkbox.
  9. The CI gets the built artifacts associated with this commit and generates a named release which is deployed on release.
  10. Create a commit:
    1. to promote the documentation website from future to current.
    2. to update the SQL schema with alterations from the previous release.

Release Process

Hotfix Release​

​ In case of a blocking issue (following a distribution release) on the release environment that requires an immediate fix: ​

  1. Create a branch on the last release tag with the following scheme: hotfix/{last_distribution-version}.{last_patch_number + 1}.
  2. Development of the fix is done on this branch.
  3. After each commit on this branch, the CI creates an unstable tag & release which is not deployed on testing environment (testing must be done on an ad hoc environment manually created).
  4. Push a tag on the branch last commit using the branch distribution version with a -hotfix suffix.
  5. The CI gets the built artifacts associated with this commit and generates a named pre-release which is deployed on pre-release for testing.
  6. In the release GitHub interface, edit the newly generated release, uncheck the This is a pre-release checkbox.
  7. The CI gets the built artifacts associated with this commit and generates a named release which is deployed on release.
  8. Merge the hotfix branch on main branch (and adapt the changes if they are not compatible with the current main branch).

2. Use simple structured logging

Β· One min read
Mithril Team

Status​

Superseded by ADR 7

Context​

  • Logs are a critical tool for operating any software system, enabling observability of the system.
  • Following 12 Factor Apps principles, providing the needed components and tools to be able to configure logging and monitoring should not be the responsibility of the software components

Decision​

Therefore

  • Each component of the system use Structured logging using documented and standardised JSON format for its logs
  • Logs are always emitted to stdout of the process the component is part of

Consequences​

  • The schema of the logged items should be properly documented in a JSON schema
  • It is the responsibility of the node operator to consume the logs and process them
  • We use existing libraries to provide needed log infrastructure, like slog for Rust

1. Record Architecture Decisions

Β· One min read
Mithril Team

Status​

Accepted

Context​

We are in search for a means to describe our technical architecture.

We are a small team working in a very lean and agile way (XP), so we naturally prefer also light-weight documentation methods which also accomodate change easily.

Decision​

  • We will use Architecture Decision Records, as described by Michael Nygard in this article.
  • We will follow the convention of storing those ADRs as Markdown formatted documents stored under docs/adr directory, as exemplified in Nat Pryce's adr-tools. This does not imply we will be using adr-tools itself.

Consequences​

See Michael Nygard's article, linked above.