The Fast Finality (F3) protocol upgrade for the Filecoin network introduces a new layer of consensus designed to improve transaction finality times and enhance the network’s responsiveness. While F3 has the potential to significantly advance Filecoin's capabilities, its activation introduces several potential risks and challenges that could impact network stability, data integrity, storage provider (SP) operations, and overall protocol efficiency. To mitigate these risks and ensure a smooth transition, this document outlines potential failure scenarios, organised by related categories, and a structured disaster recovery plan for each.

These categories are:

  1. Network Bootstrapping and Consensus Initialisation — Issues related to the initial bootstrapping and consensus establishment of the F3 protocol.
  2. Finality Processing and Performance — Failures associated with finality-related processes, including potential congestion and delays in achieving finality.
  3. Chain Quality and Data Integrity — Risks that may impact the overall quality of the chain, including issues in data retrieval and the potential for data corruption.
  4. Security and Protocol Stability — Security vulnerabilities, protocol incompatibilities, and forking risks that could compromise network stability.
  5. Storage Provider (SP) Operations and Rewards — Failures affecting SPs' ability to produce blocks, earn rewards, and maintain operational efficiency.
  6. Operational Efficiency and Resource Utilisation — Resource-related challenges, such as increased power consumption and resource contention, affecting the efficiency of F3 processes.

Each category addresses specific risks and failure modes, accompanied by recovery steps and mitigation strategies. A severity ranking is provided at the end to prioritise the most critical failures, ensuring that high-risk scenarios are addressed promptly.

Failure Scenarios

Category Scenario ID Threat Impact Mitigation (reduce the probability of occurrence) Indicator Recovery Steps Severity Likelihood
1. Network Bootstrapping and Consensus Initialization 1.1 Insufficient SP participation (<66%) prevents F3 from bootstrapping. Network fails to bootstrap; F3 activation fails. Pre-activation communication with SPs; monitor participation rates. No F3 progress. F3 Observer. None. Bootstrap should automatically occur once sufficient power participates. 🟠 Moderate 🟠 Moderate
1.2 Network halts post-bootstrapping, preventing further progression, and eventually loses OhShitStore state. Network stagnation, slowing down finalization of transactions and SP operations.
In the worst case we fallback on existing 900 epoch finality. No new blocks Putting out snapshots.
Presumably we’ll need to manually intervene to switch back to F3 consensus, which could involve code changes without a network upgrade or performing a network upgrade with new F3 code/parameters. 🔴 High 🟢 Low
1.3 F3 consensus halt (solvable or unsolvable within 900 epochs). Delay in finalisation of tipsets.
In the worst case we fallback on existing 900 epoch finality. Passive testing of edge cases.
Active testing of edge cases on non-mainnet networks. No F3 progress. F3 Observer. Presumably we’ll need to manually intervene to switch back to F3 consensus, which could involve code changes without a network upgrade or performing a network upgrade with new F3 code/parameters. 🔴 High 🟠 Moderate
2. Finality Processing and Performance 2.1 Additional F3 message processing causes network congestion. Increased latency in block production. Loss of WindowPost. Loss of Miner rewards. F3 message/validation caching. High libp2p bandwidth consumption over pubsub
High CPU/Memory usage Notify community about disabling F3.

Make new release congestion improvements. | 🟠 Moderate | 🟠 Moderate | | | 2.2 | Bug in core F3 protocol results in extended delay and eventual halt of F3 | Delay in finalisation of tipsets (EC-protocol). | Simulation testing, Passive testing, Active testing on non-mainnet. | High rate of equivocations No F3 progress | Notify community about disabling F3.

Fallback to EC.

Re-activate F3 at later date via netwok upgrade. | 🔴 High | 🟢 Low | | | 2.3 | Finality processing times regularly exceeds expected 4 epoch average. | Delays in finalization, affecting usability. | Monitor finality times. | Number of tipsets finalised per F3 instance | Adjust F3 parameters such as lookback, Delta, backoffs.

Rolling this change will involve a network upgrade. | 🟠 Moderate | 🟠 Moderate | | | 2.4 | By the time finality processing ends a heavier chain exists that is selected by non-participating SPs | Loss of miner rewards. Delay in finalised tipsets. Network instability | Monitor chain quality metrics. Increase lookback when selecting candidate chin for F3 finalisation | Chain quality metrics. F3 Observer | Roll back to recent snapshot Diagnose root cause before resuming. | 🔴 High | 🟢 Low | | 3. Chain Quality and Data Integrity | 3.1 | Protocol inconsistencies or increased load degrade chain quality. | Reduced reliability of transactions. Loss of miner rewards. Extended null blocks | Chain quality metrics and monitoring. | Existing chain quality monitoring (Avg Blocks in past 900 epochs) | 1. Assess chain quality. 2. Adjust F3 parameters. | 🟠 Moderate | 🟠 Moderate | | | 3.2 | Finality state corruption in Lotus. | Extended SP downtime. Loss of SP Trust. | Lotus built-in state protection. SP backup. | Loss of WindowPost. Chain quality metrics | Roll back to recent snapshot; diagnose root cause before resuming. | 🔴 High | 🟢 Low | | 4. Security and Network Stability | 4.1 | New vulnerabilities introduced by F3 increase risks of consensus critical issues | Network manipulation risks; potential loss of trust. | Security audit (internal/external). Fuzz Testing | Chain quality metrics. Reported security incidents. Excessive slashing. | Rapid deployment of patches; notify SPs and users of security measures. | 🔴 High | 🟢 Low | | | 4.2 | Conflicting forks result in diverging chain views ("forever" forks). | Inconsistent finality, network instability.Irreversible damage to the chain; risk of lost SP trust. | Simulation testing, Passive testing, Active testing on non-mainnet. | Chain quality metrics. inconsistent F3 instance progress across nodes. F3 Observer | Disable F3. Roll back to a single fork; coordinate with SPs to restore unified chain state. | 🔴 High | 🟢 Low | | 5. Storage Provider (SP) Operations and Rewards | 5.1 | Resource competition from finality processing disrupts SP block production. | Reduced SP participation and network functionality. SP revenue loss. | Adjust/fence resource allocation to support SP block production. | Chain quality metrics. | Disable F3. Deploy performance patches; update SP configurations. | 🟠 Moderate | 🟢 Low | | | 5.2 | Changes to finality processes inadvertently reduce SP rewards.

Note: This is a potential outcome of all the other chain quality items discussed previously. | Lowered incentives may reduce SP participation. | Review F3’s impact on rewards pre-launch. | Chain quality metrics. | Adjust reward parameters to maintain SP engagement. | 🔴 High | 🟠 Moderate | | | 5.3 | F3 affects Lotus node syncing, causing data inconsistencies for SPs.

Scenario: we snapshot the wrong thing | Operational disruptions for SPs. | Debug syncing issues pre-launch; monitor post-launch syncing. | | Deploy syncing patches; provide SP support. | 🟢 Low | 🟢 Low | | 6. Operational Efficiency and Resource Utilization | 6.1 | F3’s computational demands increase CPU consumption for SPs. | Higher operational costs and sustainability concerns. | Optimize finality processes to minimize CPU use. | F3 monitoring metrics. Self-hosted node monitoring. | Provide SPs with power optimization guidelines; explore alternative algorithms if needed. | 🟢 Low | 🟠 Moderate | | | 6.2 | Resource contention between finality and block production leads to inefficiencies. | Delays in block production and reduced efficiency. | Implement load balancing; adjust protocol parameters as needed. | | Apply resource management fixes to maintain SP productivity. | 🟢 Low | 🟢 Low | | Chain Exchange | | Lack of progress in F3 | | | | Halt recovery has nothing special. | | | | | | Resource consumption | | | | We have parameters and configuration options that can help here. | | | | | | Spam vector exploitable by DDoS | | | | Optimized the compute (reduce potency) | | |

Graceful Failure Strategies

Graceful failure strategies aim to reduce the impact radius of potential failures: they aim to ensure that the F3 protocol can handle disruptions with minimal impact on network stability, SP operations, and data integrity. These strategies help the system manage errors in a controlled way, allowing continued operations and preventing cascading failures. Below are several key graceful failure strategies tailored to Filecoin’s F3 upgrade that have been implemented (or at least sufficiently implemented for activation).

Fallback to 900 epoch finality

Status: Sufficiently implemented for activation per https://github.com/filecoin-project/go-f3/issues/718#issuecomment-2612995271

When F3 takes longer than X epochs to reach finality, where X is within some vicinity of 900 epochs, the system should:

  1. finalise on base chain
  2. await consensus stabilisation
  3. start a new F3 instance with a fresh candidate