# Analysis of link failures in an IP backbone

## Introduction

Internet is composed by thousands of network operators, like the Internet 
Service Providers. Theses operators gives some guarantees to their customers, 
as the delivery delay or losses that could occur through their network.
Theses guarantees are called « Service Level Agreements », and they were already
great by the time the paper was published. The problem was that theses metrics 
don't cover every usages a customer can do on the network. Voice over IP is an 
example of usage that can be disturbed even if the SLA are well respected. The 
reason is simple: in real time communication, there is a need of stability in 
the network. 
Link failures and downtime due to a maintenance happen and are common on an
network. But it can introduce some delay in the delivery of a message, which 
causes annoyances.

The first step is to do a realistic analysis of the network, about the duration
of the failure events and the convergence time after a breakdown.
Doing this helps to create a realistic simulation model.
The paper is focused on several measures that were done in the infrastructure 
of a big networking operator in America.

## Main part

## Method

The measures were done with some probes in several Point of Presences across 
America over a four months period. The probes were listening for control 
messages in the inter-POP network. The protocol used to know which were the 
links up and down was ISIS (a standardized routing protocol for large dynamic 
networks).

With theses messages they detected which were the failing links, they did a 
distribution of their failures for knowing how much time they failed.
With the collected data they also learnt if there is a correlation between a 
link failure, and a failing event in the routing, and quantified this.

Having several probes helps to determine the time that it takes for a message to
reach every part of the network (important for path recalculation).

## Impact of a failure event on a router

When a router is overloaded, reboots, or a link is unplugged without
precautions, routers in the network need to recalculate a route.

It speaks about the route reestablishing time on a router after a failure,
like a link that is unplugged, destroyed or that causes flapping.

## Conclusion 

The article points that only 10% of failure events last longer than 20 minutes 
and 50% last less than a minute. Almost the half of failures are due to 
maintenance operations (between 10PM and 6 AM). These operations are planned,
which means that we can avoid a routing failure playing with the routing
protocol : announcing on the network that we shouldn't use a link anymore then
the network will automatically and gracefully find an another path (without 
introducing loops in the network or temporary unreachable route).

With the cumulative distribution of times between failures we see that they are 
close to each other. The mean time between successive failures for every link
shows that links are different regarding their failure characteristics, some of
them experience failures frequently and some other don't (or rarely) fail.

It is the first step to a deeper research on network configuration 
optimizations at the scale of an national or international operator.
For instance, some default configuration wait a certain time before trying to
recalculate a route, that is not necessary. This paper gives an hint to what are
theses configurations. 
This article is simple, but explains a lot of notions that are still relevant.

## Further work

This paper is still used as a reference, especially for micro-loop research at
the university, because it describes very well the router's behavior in
case of routing problem on the network.


[1]: Analysis of link failures in an IP backbone, _Gianluca Iannaccone, Chen-nee Chuah, Richard Mortier, Supratik Bhattacharyya, Christophe Diot_
[2]: https://en.wikipedia.org/wiki/IS-IS
[3]: https://tools.ietf.org/html/rfc1142
