

# **Reliability Challenges for High Performance Electronics in the Internet of Things Era**

Prof. Cecilia Metra DEI - ARCES – Univ. of Bologna cecilia.metra@unibo.it

ISFCT2018, Waseda University, July 24th, 2018



**Todays' electronics, and technological development till now.** 

**Reliability Challenges for today's electronics:** 

≻↑ Vulnerability to transient faults (TFs) → soft errors (SEs)

≻↑ Likelihood of Aging Phenomena (NBTI)

**Design Approaches for Reliable electronics.** 

ISFCT2018, Waseda University, July 24th, 2018



Todays' electronics, and technological development till now.

**Reliability Challenges for today's electronics:** 

≻↑ Vulnerability to transient faults (TFs) → soft errors (SEs)

≻↑ Likelihood of Aging Phenomena (NBTI)

**Design** Approaches for Reliable electronics.

ISFCT2018, Waseda University, July 24th, 2018

# **Today's Electronics**

□ Continuous miniaturization of microelectronic technology
 → massive diffusion/presence of electronic devices, possibly connected to each other through the Internet (IoT).



Cecilia Metra

Sources: 1. Cisco Virtual Networking Index, 2015. 2. IDC, IoT Market Forecast: Worldwide IoT Predictions, 2015.

ISFCT2018, Waseda University, July 24th, 2018

# **IoT, Big Data and Reliability**

□ Huge amount of electronic devices connected through the Internet (IoT) → huge amount of data to be stored (*Data Center/Cloud/Fog*), processed and distributed again.



R. Mariani, "Making the Autonomous Dream Work", Intel Fellow, Unviersity of Bologna presentation, May 2018

# Life's decisions driven by such data (autonomous drive factory, transport, home, etc). But can we rely on these data? Is the electronic storing/processing them reliable? ISFCT2018, Waseda University, July 24th, 2018

# **Today's Electronic Technology**



M. Bohr, "Continuing Moore's Law", Technology and Manufacturing Day, 19 September 2017

#### ISFCT2018, Waseda University, July 24th, 2018



### **How much small are 14nm?**



M. Bohr, "14nm Process Technology: Opening New Horizons", Intel Developer Forum, 2014

#### ISFCT2018, Waseda University, July 24th, 2018

# **Development of Electronic Technology**

# □ The Moore law (1965) has driven the evolution of microelectronic technology and is driving its future

## developments.





Courtesy of Intel Corporation Intel Techn. Journal, 2007



ISFCT2018, Waseda University, July 24th, 2018 https://www.elektormagazine.com/articles/moores-law Cecilia Metro

# **How Has It Been Possible to Follow** the Moore's Law?

Architectural Changes: multicore/many-core systems (since 2000)

□ Material Changes: high-k gate insulator (since 2007)

**Device Changes:** Tri-gate transistors (since 2011)

ISFCT2018, Waseda University, July 24th, 2018

# **How Has It Been Possible to Follow**

the Moore's Law?

 Architectural Changes: multicore/many-core systems (since 2000)
 June 15, 2010: > A trend that will continue

Experimental microprocessor with 48-cores



http://www.intel.com/pressroom/inn ovation, June 15, 2010



IEEE Computer Society 2022 Report, 2014 Cecilia Metra

ISFCT2018, Waseda University, July 24th, 2018

# **How Has It Been Possible To Follow** the Moore Law? (cnt'd)

# □ Material Changes: high-k gate insulator (since 2007)

Intel 45nm dual-core, Hafnium-based High-k Metal Gate process.





Intel Press Kit, November, 2007

ISFCT2018, Waseda University, July 24th, 2018

# **How Has It Been Possible To Follow** the Moore Law? (cnt'd)

Hafnium-based High-k Metal Gate process advantages:



Intel's High-k/Metal Gate k/Metal Gate Announcement November 4th, 2003

ISFCT2018, Waseda University, July 24th, 2018

# How Has It Been Possible To Follow the Moore Law?(cnt'd)

**Device Changes: Tri-gate transistors (since 2011):** 

≻Tri-Gate Transistors → higher speed & lower I<sub>OFF</sub> (→ low power consumption) [2002].

 ✓ Tri-Gate Transistors used in 22nm SRAM demonstrated in 2009

 ✓ Tri-Gate Transistors used in 22nm microprocessor demonstrated in April 2009

ISFCT2018, Waseda University, July 24th, 2018



R. S. Chau, Technology @ Intel Magazine, August 2006

# 



Bohr, Mistry, "22nm Details\_Presentation", May 2011 ISFCT2018, Waseda University, July 24th, 2018

# **How Has It Been Possible To Follow** the Moore Law? (cnt'd)

# □ Higher Speed

# **Reduced Leakage (I<sub>OFF</sub>)**



Bohr, Mistry, "22nm Details\_Presentation", May 2011

#### ISFCT2018, Waseda University, July 24th, 2018

# How Has It Been Possible To Follow the Moore Law? (cnt'd)

- ☐ Intel® Core<sup>™</sup> M Processor (announced on September 5<sup>th</sup>, 2014):
  - > 14 nm, 2<sup>nd</sup> generation 3-gate transistor technology
  - > 1.3 billion transistors
  - Compared to previous Intel Core processors
    - ♦ ↑ 50% performance
    - ♦ ↑ 40% graphic elaboration speed
    - ♦ ↑ 20% autonomy of charge



Intel Developer Forum San Francisco 2014 Cecilia Metra

ISFCT2018, Waseda University, July 24th, 2018

# **How Has It Been Possible To Follow**

# the Moore Law?(cnt'd)

# 2<sup>nd</sup> generation 3-gate transistors



ISFCT2018, Waseda University, July 24th, 2018

# **How Is It Possible To Follow the Moore Law?**

10nm process using the 3<sup>rd</sup> generation of 3-gate transistors:
 10 nm fins are approx. 25% taller and approx. 25% more closely spaced than 14nm



M. Bohr, "Technology Leadership", Technology and Manufacturing Day, 19 September 2017

#### ISFCT2018, Waseda University, July 24th, 2018

# <u>How Is It Possible To Follow the Moore Law?(cnt'd)</u>

10 nm process: compared to 14nm, higher transistor density (2,7%), higher performance (25%), and lower power (45%)



https://newsroom.intel.com/newsroom/wp-content/uploads/sites/11/2017/09/10-nm-icf-fact-sheet.pdf ISFCT2018, Waseda University, July 24th, 2018 Cecilia Metra

# **How Is It Possible To Follow the**

# Moore Law ? (cnt'd)

Intel Optane – announced on March 19<sup>th</sup>, 2017, available since Aprile 24<sup>th</sup>, 2017 (16GB, 32GB)

 Intermediate solution between DRAM and Flash memories
 DRAM (faster than Flash, less dense than Flash and volatile)
 Flash – used in current SSD (non volatile, denser than DRAM, slower than DRAM)



https://newsroom.intel.com/new s/intel-introduces-worlds-mostresponsive-data-center-solidstate-drive/

non volatile + denser (10X) than DRAM and faster (1000X) than Flash

**Technology** *"ideal for ...devices, applications, services...requiring fast access to large sets of data"* 

(http://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html) ISFCT2018, Waseda University, July 24th, 2018 Cecilia Metra



http://wccftech.com/intel-storage-roadmap-2017-optane-nand/
 ✓ Vertical stack (3D) of structures composed by columns (cell, selector) → ↑ density
 ✓ Each cell can be written/read changing only the voltage sent to the selctor → ↑ speed
 ISFCT2018, Waseda University, July 24th, 2018

# **Reliability Challenges in the IoT Era**

Following the Moore law enabled to ↑ integration density,
 ↑ complexity, and ↑ performance, to arrive to today's IoT, but also:

- > In the Field:
  - ♦ ↑ Vulnerability to Transient
     Faults (TFs) → Soft Errors (SEs)
  - Likelihood of ageing phenomena (mainly Negative Bias Temperature Instability – NBTI)

**Reliability Challenges** 







**Todays' electronics, and technological development till now.** 

**Reliability Challenges for today's electronics:** 

➤↑ Vulnerability to transient faults (TFs) → soft errors (SEs)

≻↑ Likelihood of Aging Phenomena (NBTI)

**Design Approaches for Reliable electronics.** 

ISFCT2018, Waseda University, July 24th, 2018

# **Reliability Challenges due to TFs and SEs**

□ TFs and consequent SEs may compromise electronics' correct operation in the field.



ISFCT2018, Waseda University, July 24th, 2018

 Example: Unexpected and violent descent of Quantas
 Flight 72 (Airbus A330-303)
 caused by particles hitting the flight control computer
 (October 2008)



"In-flight upset, 154 km west of Learmonth, WA, 7 Oct. 2008, VH-QPA Airbus A330-303," ATSB Transp. Safety Report -Aviation Occurrence Invest., AO-2008-070, pp. 1 – 313, Dec. 2011.

**Transient Faults and Soft Errors** 

- Undesired voltage fast transition (*spike* or *glitch*) on a circuit node/line.
  R. Baumann, «Boron Compounds
- □ They are generally by:

Baumann, «Boron Compounds as a Dominant Source of Alpha Particles in Semiconductor Devices», in Proc. of IEEE Conf. on Reliability Physics Symposium, 1995.



- Alpha particles: atoms of He that lost the electrons, possibly generated by the radioactive decay of unstable isotopes (e.g., <sup>232</sup>Th) present within the packages of electronic circuits
- Neutrons and protons originated by the collision of Galactic Cosmic Rays (GCRs) and atmosphere atoms (mainly Nitrogen and Oxigen)

J. F. Ziegler, "Terrestrial Cosmic Ray Intensities," IBM J. Res. Develop., Vol. 42(1), p. 125, Jan. 1998.

ISFCT2018, Waseda University, July 24th, 2018



Transient Faults and Soft Errors (cnt<sup>2</sup>d)

- □ If the TF affetcs a combinational circuit and is propagated till the input of a sampling element → possible output SE → *Reliability Risks* □ This happens if the TF:
  - Is not electrically filtered out by the gates between j and the FF input
  - >Is not logically filtered out (m=1) by the gates between
    j and the FF input > Arrives to the FF



Arrives to the FF input with a spike satisfying the FF's set-up and hold time conditions wrt the FF sampling instant Cecilia Metra

# Transient Faults and Soft Errors (ent<sup>2</sup>d)

- If the TF affetcs a memory element/cell → likely output SE
  → Reliability Risks
- □ For instance, if the TF hits the internal node B of a / standard latch while CK=0 (TG1 OFF, and TG2 ON) :
  - > the incorrect voltage value induced by the TF on node B is confirmed by the latch positive feedback loop → logic
     > value of Q changed → SE.



There is half of the CK period during which TFs can give rise to output SEs -> more likely than for TFs affecting the latch input



**Todays' electronics, and technological development till now.** 

**Reliability Challenges for today's electronics:** 

➤↑ Vulnerability to transient faults (TFs) → soft errors (SEs)

> Likelihood of Aging Phenomena (NBTI)

**Design Approaches for Reliable electronics.** 

ISFCT2018, Waseda University, July 24th, 2018

# **Aging Phenomena - NBTI**

Negative-Bias Temperature-Instability (NBTI) is the most likely aging effect for current, scaled down Integrated Circuits (ICs)

□ NBTI causes an increase in the absolute value of the V<sub>th</sub> of pMOS transistors → IC's performance degradation (> 20% in 10 years)

Signals on time-critical data-paths may violate setup/ hold times of output flip-flops → generation of incorrect outputs → *Reliability Risks* 



Todays' electronics, and technological development till now.

**Reliability Challenges for today's electronics:** 

➤↑ Vulnerability to transient faults (TFs) → soft errors (SEs)

≻↑ Likelihood of Aging Phenomena (NBTI)

**Design Approaches for Reliable electronics.** 

ISFCT2018, Waseda University, July 24th, 2018

# **Design Approaches for Reliable Electronics**

Hardware Fault Tolerance (HFT) is successfully adopted to guarantee the system's correct operation despite the occurrence of TFs and SEs during the infield operation.

□ Traditional HFT approaches:

Modular<br/>RedundancyOn-Line Testing<br/>& RecoveryError Correcting<br/>Codes (ECCs)

Proper aging monitors can be connected to the inputs of FFs at the output of time-critical data-paths -> early monitoring of delay effect due to NBTI -> possible activation of in-field compensation strategies
 > system's correct operation.

# **Example of Aging Monitors for NBTI**

□ Aging monitors connected to the inputs of the output FFs of time-critical data-paths ([1, 2]).

**Each aging monitor:** 



Checks the output of the datapath C<sub>i</sub> (S<sub>i</sub>) during a proper time guardband (T<sub>M</sub>)

Is enabled during T<sub>M</sub> only, by a proper control signal (TWC), which is = 1 only during T<sub>M</sub>

Gives an output alarm message in case of late transitions of Si during T<sub>M</sub>

[1] C. Metra, et al., "Self-Checking Monitor for NBTI Due Degradation", in Proc. of IEEE Int. Mixed-Signals, Sensors and Systems Test Workshop (IMS3TW), 2010

[2] C. Metra, et al., "Low Cost NBTI Degradation Detection & Masking Approaches", IEEE Transactions on Computers ISFCT2018, Waseda University, July 24th, 2018 Cecilia Metra

# **Example of Aging Monitors for NBTI (cnt'd)**

# Case of no late transition of S<sub>i</sub> while TWC=1

# Case of late transitions of S<sub>i</sub> while TWC=1



# $\Box (O_1, O_2) = (0,1)/(1,0) \rightarrow$ no alarm message

# $\Box (O_1, O_2) = (1,1) \text{ or } (0,0)$ $\rightarrow \text{ alarm message}$

[2] C. Metra, et al., "Low Cost NBTI Degradation Detection & Masking Approaches", IEEE Transactions on Computers ISFCT2018, Waseda University, July 24th, 2018 Cee

# **Example of Aging Monitors for NBTI (cnt'd)**

Costs (area & power) of the monitor in [2] wrt those in [3, 4]:

|                 | Area (Sq) | ΔΑ    | Power (µW) | $\Delta P$ |
|-----------------|-----------|-------|------------|------------|
| Our Monitor [2] | 60        | -     | 12         | -          |
| Monitor in [3]  | 78        | -23%  | 12.2       | -1.6%      |
| Monitor in [4]  | 62        | -3.2% | 15         | -20%       |

$$\Delta A(\%) = \mathbf{100} \cdot \frac{A_{our} - A_{[3,4]}}{A_{[3,4]}} \Delta P(\%) = \mathbf{100} \cdot \frac{P_{our} - P_{[3,4]}}{P_{[3,4]}}$$

[2] C. Metra, et al., "Low Cost NBTI Degradation Detection & Masking Approaches", IEEE Transactions on Computers
 [3] M. Agarwal et al., "Optimized Circuit Failure Prediction for Aging: Practicality and Promise", in Proc. of IEEE Int. Test Conf., pp. 1-10, 2008.
 [4] A. C. Cabe et al., "Small Embeddable NBTI Sensors (SENS) for Tracking On-Chip Performance Decay", in Proc. of Symp. on Quality Electronic Design, pp. 1-6, 2009.
 ISFCT2018, Waseda University, July 24th, 2018

# **New Approaches for Reliable Electronics implemented by Emergent Technologies?**

We have analyzed (by means of Spice simulations) the effects of the most likely faults (i.e., *shorts* and *opens* [2]) affecting the selectors of a *ReRAM* (of size 128x128).



[1] Y. Deng, et al., IEEE Trans. Electron Devices, Feb. 2013. ISFCT2018, Waseda University, July 24th, 2018

As for opens, our analyses showed that they can alter only the logic value stored in the faulty *ReRAM* cell

→ Single error → correction by the conventional ECCs

[2] G. Burr, et al., Journal of Vacuum Science & Tech. B, Jul./Aug. 2014. Cecilia Metra

# New Approaches for Reliable Electronics implemented by Emergent Technologies?cnt'd As for shorts, our analyses showed that they can alter (due to the huge current through the faulty cell) the logic value stored in:



[1] Y. Deng, et al., IEEE Trans. Electron Devices, Feb. 2013. ISFCT2018, Waseda University, July 24th, 2018

## 1. The faulty ReRAM cell, and

- 2. Many other cells sharing the same *word line* as the faulty *ReRAM*
- The # of cells in 2 depends mainly on the position of the faulty cell within the crossbar array, and it can be > 10.

→ High number of errors → need for alternate solutions to traditional ECCs



# **Reliability Challenges for High Performance Electronics in the Internet of Things Era**

Prof. Cecilia Metra DEI - ARCES – Univ. of Bologna cecilia.metra@unibo.it

ISFCT2018, Waseda University, July 24th, 2018