## Switch Fabrics

## Switching Technology $\mathbf{3} 38.3165$

http://www.netlab.hut.fi/opetus/s383165

## Switch fabrics

- Multi-point switching
- Self-routing networks
- Sorting networks
- Fabric implementation technologies
- Fault tolerance and reliability


## Sorting networks

- Types of blocking
- Internal blocking
- Output blocking
- Head of line blocking
- Sorting to remove internal blocking
- Resolving output conflicts
- Easing of HOL blocking


## Internal blocking

- Internal blocking occurs at the internal links of a switch fabric
- In a switch fabric, which implements synchronous slot timing, internal blocking implies that some input (i) to output (j) connection cannot be established (even if both are idle ones)
- Internally non-blocking switch makes all requested connections $\left(i, j_{i}\right)$, provided that there are no multiple request to the same output ( $j_{i} \neq j_{i}$, if $i \neq i, 1 \leq i, j \leq M$ )


Connection pattern $=\{(2,1),(3,4),(4,3)\}$

## Output blocking

- Internally non-blocking switch can block at an output of a switch fabric due to conflicting requests, i.e., $j_{i}=j_{i}$ for some $i \neq i$ '
- When an output conflict occurs, switch should connect one of the conflicting inputs to requested output => output conflict resolution
- Major distinction between a circuit and packet switching node
- a packet switching node must solve output conflicts per time-slot (timeslots are not assigned beforehand)
- a circuit switching node solves possible output conflicts and assigns a time-slot for entire duration of a connection beforehand



## Head of line (HOL) blocking

- Packets not forwarded due to output conflict are buffered => more delay experienced
- Buffered packets normally served in a FCFS (First Come First Served) manner => HOL blocking introduced at the input queues
- Packet facing HOL blocking may prevent the next packet in the queue to be delivered to a non-contended output => throughput of a switch reduced



## Sorting to remove internal blocking

- If connection requests at the inputs of a banyan network are compact and in strictly increasing order
=> input-output paths are link-disjoint
=> banyan internally non-blocking
- A method for building an internally non-blocking network is to apply a sorting network in front of a banyan network to generate a strict increasing order of destination addresses for the banyan network
- A sorting network connects an input $i$, which has a connection request to output $\boldsymbol{j}_{\text {; }}$, to an output of a sorting network according to the position of $\boldsymbol{j}_{i}$ in the sorted list of destination requests (see figure)
- Sorting networks can be formed by interconnecting nodes of smaller sorting networks (such as 2x2)
- Self-routing should be applied in the sorting network


## Internally non-blocking and self-routing switch



## Sorting to remove internal blocking

- A permuted list $\left(a_{1}, a_{2}, \ldots, a_{N}\right)$ can be restored to its original order by sorting
- A switching network for a maximal connection pattern can be obtained from a sorting network by treating $2 \times 2$ sorting elements as $2 \times 2$ switching elements
- Asymptotic lower bound for $2 \times 2$ sorting elements to build a NxN sorting network is $\mathrm{Nlog}_{2} \mathbf{N}$ (as for a respective switching network) - however, no sorting network found so far to obtain this bound
- Sequential merge-sorting process can be used to obtain $\mathrm{Nlog}_{2} \mathrm{~N}$ bound for the number of binary sorts


## Merge-sorting algorithm

## Merge-sorting algorithm

- Input : unsorted list $A_{N}=\left(a_{1}, a_{2}, \ldots, a_{N}\right)$
- Sort procedure:

Sort $\left(A_{N}\right)=$ Merge $\left\{\operatorname{Sort}\left(a_{1}, \ldots, a_{1 / 2 N}\right)\right.$, Sort $\left.\left(a_{1 / 2 N+1}, \ldots, a_{N}\right)\right\}$

- Merge procedure:

$$
\text { Merge } \begin{aligned}
& \left\{\left(a_{1}, \ldots, a_{m}\right),\left(a_{1}, \ldots, a_{m}^{\prime}\right)\right\} \\
& =\left\{a_{1},\left(\operatorname{Merge}\left(\left(a_{2}, \ldots, a_{m}\right),\left(a_{1}, \ldots, a_{m}^{\prime},\right)\right)\right\} \text { if } a_{1} \leq a_{1}^{\prime}\right. \\
& =\left\{a_{1}^{\prime},\left(\operatorname{Merge}\left(\left(a_{1}, \ldots, a_{m}\right),\left(a_{2}^{\prime}, \ldots, a_{m}^{\prime}\right)\right)\right\} \text { if } a_{1}>a_{1}^{\prime}\right.
\end{aligned}
$$

- Procedure Merge, called by procedure Sort, takes two sorted lists and merges them by comparing the smallest elements in each of the two sorted lists


## Merge-sorting algorithm (cont.)

- Merging of two sorted lists ( $\mathrm{N} / 2$ numbers in each) requires $N$ binary sorts
- Total complexity of sorting $N$ numbers, which are in any order, is $C(N)=2 C(N / 2)=N+2(N / 2+2 C(N / 4))=\ldots=N \log _{2} N$
- Due to sequential nature of procedure Merge the sorting takes at least $O(N)$ time



## Merge-sorting example

Sort the set $\{8,3,1,4,6,2,9,5\}$

Step 1: 4 times 2 sorts

Step 2: 2 times 4 sorts

Step 3: 1 time 8 sorts


## Odd-even merging

Recursive construction of an odd-even merger

- number of sorting stages is $\log _{2} N$
- number of sorting elements is $0.5 N\left[\log _{2} N-1\right]+1$



## Bitonic list

- Bitonic list $\boldsymbol{A}_{N}=\left(a_{1}, a_{2}, \ldots, a_{N}\right)$ is a list for which it holds that $a_{1} \leq a_{2} \leq \ldots \leq a_{k-1} \leq a_{k}$ and $a_{k} \geq a_{k+1} \geq \ldots \geq a_{N-1} \geq a_{N} \quad(1 \leq k \leq N)$
- Unique cross-over property - when comparing a monotonically increasing list with a monotonically decreasing list, there is at most one position where the two lists cross-over in their values (see figures)



## Bitonic merging

Recursive construction of a bitonic merger

- number of sorting stages is $\log _{2} N$
- number of sorting elements is $0.5 N \log _{2} N$



## Sorting by merging

Recursive construction of a sorting by merging network

- number of sorting stages is $0.5 \mathrm{Nog}_{2} N\left(\log _{2} N+1\right)$



## Odd-even sorting network example

- Number of sorting stages is $0.5 \log _{2} N\left(\log _{2} N+1\right)$
- Number of sorting elements is $0.25 N\left[\log _{2} N\left(\log _{2} N-1\right)+4\right]-1$

$\uparrow$ 2x2 UP SORTER



## Bitonic sorting network example

- Number of sorting stages is $0.5 \log _{2} N\left(\log _{2} N+1\right)$
- Number of sorting elements is $0.25 \mathrm{Nog}_{2} N\left(\log _{2} N+1\right)$



## Batcher-Banyan self-routing network



## Resolving output blocking

- Packet switches do not maintain a scheduler for dedicating time-slots for packets (at the inputs) => output conflicts possible
=> output conflict resolution needed on slot by slot basis
- Output conflicts solved by
- polling (e.g. round robin, token circulation)
- do not scale for large numbers of inputs
- outputs just served have an unfair advantage in getting a new time-slot
- sorting networks (making a banyan network internally non-blocking)
- An example of sorting networks is sort-purge-concentrate network
- when sorting self-routing addresses, duplicated output requests appear adjacent to each other in the sorted order (see figure)
- either one has to be purged (deleted)
- successful delivery is acknowledged and purged packets are re-sent


## Sort-purge-concatenate network

- A sorting network can easily handle packet priority by
- adding a priority field in the self-routing address
- higher priority packets are placed in a favorable position before purging
- support of priority is an essential feature when integrating circuit and packet switching in a sort-banyan network



## Resolving HOL blocking

- HOL blocking solved by
- allowing packets behind a HOL packet to contend for outputs
- allow multiple delivery of conflicting HOL packets to an output buffer
- multiple rounds of arbitration for sort-banyan network
- multiple planes of sort-banyan networks
- a good solution is to implement multiple input buffers (one for each output if possible) and if the packet in turn cannot be transmitted due to HOL, transmit an other packet from another buffer


## Construction of a multipoint packet switch

- Multipoint switch can be constructed by cascading a copy network and a point-to-point (routing) network
- copy network is a cascade of a compact super-concentrator (e.g. reverse banyan network) and copy distribution network (e.g. multicast banyan network)



## Multipoint packet switch

- In a self-routing multipoint switch
- incoming packets may be destined to multiple outputs
- if packets carry all needed destination addresses, headers would be variable length (and long)
- header problem can be avoided by labeling the packet (to be copied) by a virtual address (Broadcast Channel Number, BCN) and number of copies - each copy (with the same BCN) is given a distinct destination address
- Compact super-concentrator connects active inputs so that their destination addresses form a compact and sorted set at the outputs
- Copy distribution network establishes a multicast tree for each multicast connection
- Copy network becomes a self-routing one by providing
- self-routing address for compact super-concentrator
- self-routing interval address for copy distribution network


## Multipoint packet switch (cont.)

- Running-sum-adder is user for calculating self-routing address and self-routing interval address



## Example of a multicast banyan network

## Multipoint connections:

$X_{1}=7 \Rightarrow y_{1}=\{1,3\}$
$X_{2}=8 \Rightarrow y_{2}=\{4,5,6\}$
$X_{3}=9 \Rightarrow y_{3}=\{7,8,10,13,14\}$


## Batcher-Banyan example


$\uparrow$ 2x2 UP SORTER
$\downarrow$ 2x2 DOWN SORTER

## Switch fabrics

- Multipoint switching
- Self-routing networks
- Sorting networks
- Fabric implementation technologies
- Fault tolerance and reliability


## Fabric implementation technologies

- Time division fabrics
- Shared media
- Shared memory
- Space division fabrics
- Crossbar
- Multi-stage constructions
- Buffering techniques


## Time division fabrics

- Shared media
- Bus architectures
- Ring architectures
- Shared memory


## Shared bus

## Bus architecture

- Switching in time domain, but time and space switching implementations enabled
- Easy to implement and low cost index (= $M$ )
- One time-slot carried through the bus at a time
=> limited throughput (multi-casting possible)
=> low number of line interfaces
=> limited scalability



## Shared bus (cont.)

## Bus architecture

- Internally non-blocking implementations require high capacity switching bus => throughput $\geq$ aggregate capacity of line interfaces
- Inherently a single stage switch, but TST-switching possible if linecards support time division multiplexing (TDM)
- Multiple-bus structures can be used to improve reliability and increase throughput



## Ring architectures

## Ring architecture

- Rings coarsely divided into source and destination release rings
- in source release (SR) rings only one switching operation in progress at a time
=> limited throughput (like a shared bus)
- destination release (DR) rings allow spatial reuse, i.e., multiple time-slots can be carried through the ring simultaneously
=> improved throughput
- Switching in time domain, but time and space switching implementations enabled
- Usually easy to implement and low cost index (= M)
- Scales better than a shared bus



## Ring architectures (cont.)

## Ring architecture

- Internally non-blocking implementations require that throughput of a ring bus $\geq$ aggregate capacity of line interfaces
- Throughput can be improved by implementing parallel ring buses - control usually distributed => MAC implementations may be difficult
- Multi-casting relatively easy to implement
- Inherently a single stage switch, but TSTswitching possible if line-cards support TDM
- Multiple rings can be used to implement switching networks



## Ring architectures (cont.)

## Dual ring architecture

- Multiple rings used to improve throughput, decrease internal blocking, improve scalability and increase reliability



## Shared memory

## Shared memory architecture

- Switching in time domain, but time and space switching implementations enabled
- Inherently a single stage switch, but allows TST-switching if linecards support TDM
- Easy to implement and low cost index (= $M$ )



## Shared memory (cont.)

## Shared memory architecture

- Every time-slot carried twice through the bus
=> low throughput
=> low number of line interfaces
=> limited scalability
- Internally non-blocking if throughput of a switching bus and speed of shared memory $\geq$ aggregate capacity of line interfaces
- Performance can be improved by dual bus architecture or replacing the bus with a space switch (such as crossbar)


## Shared memory (cont.)

## Shared memory architecture

- Dual-bus architecture improves throughput, decreases internal blocking, improves scalability and increases reliability
- Memory speed requirement equal to that of single bus solutions



## Dimensioning example

A shared memory architecture, which uses a shared bus to connect line interfaces to the memory, is used to implement a switching equipment. The bus is 32 bits wide and bus clock is 150 MHz . Three clock cycles are needed to transfer a 32 bit word through the bus and $20 \%$ of the bus capacity is used for other than switching purposes. How many E1 interfaces can be supported by the switch? What is the required memory speed?

## Dimensioning example (cont.)

## Solution:

The bus transfers 32-bit wide data words at the speed of $(150 / 3) \times 10^{6}$ transfers $/ \mathrm{s}=50 \times 10^{6}$ transfers/s.
If the bus carries an eight bit time-slot (of a 64 kbit/s PDH channel) across the bus at a time, a single bus solution can transfer $0.8 \times(150 / 3)$ Mbytes/s = $40 \mathrm{Mbytes} / \mathrm{s}$
In a single bus solution, half of the bus capacity ( $20 \mathrm{Mbytes} / \mathrm{s}$ ) is used for storing time-slots to the memory and another half for reading timeslots from the memory
=> during a $125 \mu \mathrm{~s}$ period (= duration of an E1 frame) the bus witches $\left(125 \times 10^{-6}\right) \times\left(20 \times 10^{6}\right)$ bytes $=2500$ incoming (and outgoing) time-slots and thus the number of supported E1 links is 2500/32 $\approx 78$

## Dimensioning example (cont.)

## Solution (cont.):

If the memory interface logic accesses the buffer memory at the speed of the data bus, then the memory speed requirement is $1 /\left(50 \times 10^{6}\right) \mathrm{s}=20 \mathrm{~ns}$.
If the memory interface logic allows lower access rate than the data bus transfer rate, then the memory speed requirement is $1 /\left(40 \times 10^{6}\right) \mathrm{s}=25 \mathrm{~ns}$.

Throughput of the switching system could be increased by adding a 32 bit receiver-register to the shared switch memory block, which enables to transfer 4 time-slots (in parallel) through the bus at a time. By doing so, the throughput of the bus gets four fold and the number of supported E1 links increases to 312. However, the time-slots are still written one by one to the switch memory, and the corresponding memory speed requirement is (depending on the memory interface logic) either 5.0 ns or 6.25 ns .

## Space division fabrics

- Crossbar
- Multi-stage constructions


## Crossbar

## Crossbar architecture

- Inherently a space division switch
- Allows to build TST-switches if interfaces implement TDM functionality
- Hard to implement large switches due to complicated control schemes
$=>$ high cost index $\left(=N^{2}\right)$
- Commercial high-speed $N x N$ crossbar components enable modular and relatively inexpensive fabric constructions, but still control of the switch is a problem


## Crossbar (cont.)

## Crossbar architecture

- Inherently a strict-sense non-blocking fabric architecture
- Possible to carry $N$ time-slots through the switch at a time
=> high throughput
=> possible to implement a large number of line interfaces
$\Rightarrow>$ scales well within the limits of the available modular components
=> scaling up means increase of cross-point count from $N \times N$ to $(N+k) x(N+k)$
- Multi-casting easy to implement



## Crossbar (cont.)

Example implementation of a crossbar


## Crossbar (cont.)

An $8 \times 8$ switch constructed of four $4 \times 4$ crossbar blocks
Notice that doubling of input/output count increases the number of crossbar components from one to four.


## Multi-stage building blocks

- Multi-stage switches normally constructed of $2 \times 2$ switching blocks
- Implemented usually in FPGAs (Field Programmable Gate Arrays) and/or ASICs (Application Specific Integrated Circuit)
- FPGA for experimental use and low volume production
- ASICs for high volume production
- Batcher-banyan network most popular
- Used to implement space division switching
- Allows to build TST-switches if interfaces support TDM functionality



## Multi-stage building blocks (cont.)

- Hard to implement large circuit switches due to complicated control schemes (especially rearrangeable fabrics)
=> high cost index ( $\sim \mathrm{CNOg}_{2} \mathrm{~N}$ )
- Suitable for packet switching when self-routing functionality included
- Fixed length time-slot implementations favored to obtain strict-sense non-blocking fabrics
- Possible to carry $N$ time-slots through the switch at a time
=> relatively high throughput
=> scalable only if larger networks can be factored using smaller $N \times N$ components
=> scaling up means increase of cross-point count from $\sim \mathrm{CNlog}_{2} N$ to $\sim C(N+K) \log _{2}(N+K)$



## Problems with multi-stages

- Path search required
- Fast connection establishment implies need for fast control system => part of switching capacity is lost if control system is not fast enough
- Multi-cast is not self evident, because multi-cast complicates path search and control scheme and increases blocking probability
- Multi-slot connections (i.e. several slots used for a particular connection) complicate matters
- especially if path delay is not constant, e.g., slots belonging to the same connection may arrive to outputs in different order than they arrived at the inputs
- blocking increases


## Trends in fabric technologies

- Memory technology getting faster and faster
- Current SRAM (Static Random Access Memory) technology allows easy implementations of large PDH switches, e.g., full matrix for 8000 E1 (2M) PDH circuits - bigger fabrics hardly needed in narrow band networks
=> in narrow band networks the trend over the last 10 years has been to build full matrix fabrics based on shared memory
- However, when striving for broadband communications, memory based switch fabrics do not scale to bandwidth needs => multi-stage and crossbar switches have their change


## Trends in fabric technologies (cont.)

- Multistage fabrics were "reinvented" at the advent of ATM
- ATM suits perfectly for fixed length time-slot switching
- self-routing and sorting applies for ATM cell routing
- blocking and buffering causes headache
=> in spite of huge research effort, there have been very few commercial multi-stage fabrics available (mostly proprietary ASICs)
- Development of IC technologies, increased packing density (number of gates/chip) and increased speed have enabled crossbar fabrics suitable for high-speed switching applications ( $\mathrm{N}=2 \ldots 64$ and line rate 2.5 ... $40 \mathrm{Gbit} / \mathrm{s}$ )
- examples: Cx27399/Mindspeed, ETT1/Sierra, CE200/Internet Machines and Pl140xx/Agere
- Packet switching and advent of optical networking favors multistages and crossbars
=> packet switching introduces a new problem - buffering


## Technological tradeoffs in switch fabric design

- When trying to simplify path search and to speed up connection establishment
=> bus speed increases (inside fabric)
=> faster memory required => power consumption increases
=> integration level of a cross-point product needs to be increased
=> faster memory required, etc.
- If fast memory not available, use => crossbar fabrics (for small switches)
=> multistage fabrics (for large switches)
- real switching capacity may be less than theoretical
- minimization of cross-point count often pointless



## Electronic design problems

- Signal skew - caused by long signal lines with varying capacitive load inside switch fabric and/or on circuit boards
- Mismatching line termination - caused by long signal lines combined with varying (high) bit rates
- Varying delay on bus lines - caused by differently routed bus lines (nonuniform capacitive load)
- Crosstalk - caused by electro-magnetic coupling of signals from adjacent signal lines
- Power feeding and voltage-swing - incorrectly dimensioned power source/lines cause non-uniform voltage and lack of adequate filtering causes fluctuation of voltage
- Mismatching timing signals - different line lengths from a centralized timing source cause phase shift and distributed timing may suffer from lack of adequate synchronization


## Some design limitations

- Speed of available components vs. required wire speed and slot time interval
- Component packing density and power consumption vs. heating problem
- Maximum practical fan-out vs. required size of fabric
- Required bus length inside switch fabric
- long buses decrease internal speed of fabric
- diagnostics get difficult
- IPR policy
- whether company wants to use special components or more general all-purpose components


## Design optimization example

- An $N \times N$ switch fabric is to be designed and there are three alternative crossbar components $\boldsymbol{a}, \boldsymbol{b}$ and $\boldsymbol{c}$ available
- $\mathbf{a}$ is an $N_{a} \times N_{a}$ fabric component
- $\boldsymbol{b}$ is an $N_{b} \times N_{b}$ fabric component
- $\boldsymbol{c}$ is an $N_{c} \times N_{c}$ fabric component
and $N_{a}<N_{b}<N_{c} \leq N$
- Component $\boldsymbol{a}$ has entered the market at time $t_{a}, \boldsymbol{b}$ at time $t_{b}$ and $\boldsymbol{c}$ at time $t_{c}$
- Product development starts at $t_{p d}$ and the switch product should come in the market at $t_{m}$. Components are expected to be available when the product development starts $=>t_{a}<t_{b}<t_{c} \leq t_{p d}<t_{m}$
- Price of a component develops with time and is generally given by $\mathrm{P}(\mathrm{t})=C f(t)+D$, where $C f(t)$ is a time dependent and $D$ a constant part of component's price
- Question: Which one of the three components to choose for constructing an $N \times N$ switch fabric?


## Design optimization example (cont.)

- As an example, let's assume that price of each component is a function of time and is given by $\mathrm{P}(\mathrm{t})=\mathrm{Ce}^{-t / T_{+}} D$,
where $C, D$ and $T$ are component specific constants
$=>P_{a}(\mathrm{t})=C_{a} e^{-t / T_{a+}} D_{a}, \mathrm{P}_{b}(\mathrm{t})=C_{b} e^{-t / T_{b}} D_{b}$ and $\mathrm{P}_{c}(\mathrm{t})=C_{c} e^{-t / T_{c+}} D_{c}$
- Number of alternative crossbar components needed to build an $N \times N$ switch $\Rightarrow K_{a}=\operatorname{ceil}\left[N / N_{a}\right]^{2}, K_{b}=\operatorname{ceil}\left[N / N_{b}\right]^{2}, K_{c}=\operatorname{ceil}\left[N / N_{c}\right]^{2}$
- Individual component and total component costs as a function of time $t$
$\Rightarrow \mathrm{P}_{a}(t)=C_{a} e^{-\left(t-t_{a}\right) / T_{a}} C_{a}$ and $\mathrm{T}_{a}(t)=K_{a} \mathrm{P}_{a}(t) \quad, t \geq t_{a}$


- These functions can be used to draw price development curves to make comparisons


## Design optimization example (cont.)

## Numerical example:

- Let $N=64, N_{a}=16, N_{b}=32, N_{c}=64, T_{a}=T_{b}=T_{c}=3$ time units (years), $C_{a}=20, C_{b}=50, C_{c}=100$ and $D_{a}=10, D_{b}=20, D_{c}=40$ price units (euros)
- Product development period is assumed to be 1 time unit (year) and $t_{b}=t_{a}+1.5, t_{c}=t_{a}+3, t_{m}=t_{a}+4 \Rightarrow t_{p d}=t_{a}+3$
- Choosing that $t \geq t_{o}=t_{p d}=0=>t_{a}=-3, t_{b}=-1.5, t_{c}=0$ and $t_{m}=+1$
- Number of components needed $K_{a}=16, K_{b}=4, K_{c}=1$
- Switch fabric component cost functions

$$
\begin{array}{lll}
\Rightarrow P_{a}(t)=\left[20 e^{-(t+3) / 3}+10\right] & \text { and } \quad \mathrm{T}_{a}(t)=16 \times \mathrm{P}_{a}(t) \\
\Rightarrow \mathrm{P}_{b}(t)=\left[50 e^{-(t+1.5) / 3}+20\right] & \text { and } \quad \mathrm{T}_{b}(t)=4 \times \mathrm{P}_{b}(t) \\
\Rightarrow \mathrm{P}_{c}(t)=100 e^{-(t) / 3}+40 & \text { and } \quad \mathrm{T}_{c}(t)=1 \times \mathrm{P}_{c}(t)
\end{array}
$$

## Design optimization example (cont.)

## Numerical example (cont.) :




- Although the price of component $\boldsymbol{c}$ is manifold compared to the price of component a or b, cturns out to be the cheapest alternative
- Another reason to choose $\boldsymbol{c}$ is that it probably stays longest in the market giving more time for the switch product

