| Reconfigurable and Adaptive Systems Research | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Tools and Techniques for Analyzing Reliability Measures of Fault-Tolerant Reconfigurable Nano-Architectures
Manufacturing and transient faults may be abundant in high density reconfigurable design fabrics built with nanoscale technologies (silicon or other emerging technologies). Design of reliable digital logic and architectures on such defective fabrics will require adequate redundancy. However, redundancy is not always a solution to the reliability problem, and often too much or too little redundancy may cause degradation in reliability. The key challenge is in determining the granularity at which fault-tolerance is designed, and the level of redundancy to achieve a specific level of reliability. Thus in-depth analysis of redundancy/reliability trade-offs for such designs will be required for micro-architects to do design space explorations. The goal of this project is to develop different tools and techniques that can evaluate the reliability measures of reconfigurable nano-architectures, and analyze different resource redundancy based fault-tolerance techniques. In particular, we have extended an automated computational scheme based on Markov Random Fields (MRFs) and Belief Propagation techniques (incorporated in a tool named NANOLAB) to compute trade-off points for different reconfigurable Boolean networks in the face of thermal perturbations and interconnect noise. Previously this tool was used only for combinational design exploration but we have implemented a loopy Belief propagation algorithm that provides capabilities to model fault-tolerant programmable sequential logic design. The effectiveness of this automation is illustrated by analyzing reconfigurable Boolean networks formed by using different industry standard configurable logic blocks (CLBs) in the presence of thermal and signal noise. We have also developed reconfigurable core logic libraries in a probabilistic model checking based tool called SMART. This tool applies probabilistic model checking techniques and state space exploration techniques to calculate the likelihood of occurrence of transient and permanent faults in the devices and interconnections of large scale reconfigurable nano-architectures. Another tool called PRISM has also been used to develop a DTMC based generic von Neumann multiplexing library, so as to perform comparative studies of different multiplexing based redundancy techniques. These tools and techniques have already been used to illustrate certain anomalies which are counter-intuitive and can only be observed by complicated and cumbersome analytical methodologies. We believe that such methodologies will help furthering research and pedagogical interests in this area, expedite the reliability analysis process and enhance the accuracy of establishing reliability-redundancy trade-o ff points.
Manufacturing Methodologies
Some of the current manufacturing techniques are casting, milling, lithography etc. The salient characteristics of these are:
Some of the emerging methodologies are:
Defects at Nano-scale
The probability of devices and intercon nects being faulty in computation fabrics based on nanoscale devices will be non-negligible; in fact, faults may be common. There are different fault and defect categories at the nanoscale and the ones discussed below are an illustrative subset.
Why Defect-Tolerant Architectures ?
Figure 1. Nano-Architectures of the future Figure 1. (IEEE Nanotechnology Conference 2003) shows the emerging architectures suitable for nano-scale implementations, their advantages and drawbacks, and the level of developmental maturity. This indicates that defect- and fault- tolerant architectures will play a major role in the development of nano-scale digital systems in the presence of permanent and transient faults. Due to the small feature size, there will be a large number of nano-devices at a designers disposal. This will lead to resource (hardware) redundancy based defect- and fault- tolerant architectures, and thus some conventional techniques such as Triple Modular Redundancy (TMR), Cascaded Triple Modular Redundancy (CTMR), multiplexing and multistage iterations of these may be implemented to obtain high reliability. This motivates our comparative study of such techniques for different Boolean networks, and analysis of different reliability-redundancy-granularity and area-delay-cost trade-off points. Some of the defect-tolerance techniques are:
Key Challenges in Redundancy Based Defect-Tolerance
Different Abstraction Levels at which Redundancy can be applied
Physical device level Specific defect-tolerant features of nano-scale devices Architecture level Assembling collections of nano-devices Resource Redundancy based Fault-Tolerance Application level Features of the computing applications Correct operation on defect prone computing system
Defect-Tolerance Approaches
Detection of Faults followed by reconfiguration Heath et al., Goldstein et al., Durbeck et al.
Probabilistic Approach Estimate probability distribution of errors Design around possible faults Our approach
Design Flow of Reconfigurable Systems
Figure 2. The design flow of a Reconfigurable digital system Figure 2. shows the design flow of a reconfigurable digital system. It indicates the different stages of the system design starting from the specification (higher level of abstraction) to the net-list generation (detailed implementation/lower abstraction level). There are other back-end specific processes performed after logic synthesis from the RTL design, such as logic optimization, physical design, layout etc, that finally lead to the fabricated system. These have been omitted here for simplicity. The front-end design consists of translation of the system specifications to an architectural design, which is refined to a micro-architectural design. This is a detailed architectural description of the system. Such a design methodology for digital and information processing systems has to guarantee acceptable reliability levels. Therefore, quick and easy techniques are required to measure the reliability of such micro-architectural designs. If the desired reliability levels cannot be achieved with the architectural configuration, the design has to be made more robust. This may involve augmenting more redundancy at different granularity levels (such as gate level, logic block level, logic function level etc.). Existing literature and our previous work indicate that specific Boolean networks have different reliability-redundancy trade-off points and incorporating arbitrary number of redundant devices in the architecture may even degrade the reliability of computation . Several analytical probabilistic models have been proposed for evaluation of such trade-off points, but such analytical approaches are extremely challenging combinatorially and error-prone specially for complex Boolean networks. Also, analytical probabilistic analysis of large fault-tolerant architectural configurations are often non-composable in the sense that if the analyzed configuration is used as a part of a larger configuration, the combinatorial analysis becomes much more difficult. Interdependencies between the gates and the interconnects also augment to the complexity of such analysis. Such limitations necessitate the automation of such methodologies. Figure 1. shows the scope of the probabilistic tools and techniques that are being developed in this project, that we believe will expedite and ease the evaluation of reliability measures, redundancy/granularity levels and different performance-area-cost related parameters for specific micro-architectural descriptions.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Operated by the University
of California for the National Nuclear
Security Administration, of the US Department of Energy. Copyright © 2004 UC | Disclaimer/Privacy |
| NOTICE: Information from this server resides on a computer system funded by the U.S. Department of Energy. Anyone using this system consents to monitoring of this use by system or security personnel. For complete conditions of use see Disclaimer/Privacy. |