Los Alamos National LaboratoryGo to the Lab's home pageSearch for people in the Lab's directorySearch the Laboratory's Web site
dpas Reconfigurable and Adaptive Systems Research
 RASR
    Evolvable Hardware
    Reliability
    Compiler
    Applications
    System Architecture
    Power Aware
    

Contact Info:
Team Leader:
Maya Gokhale
Email: maya@lanl.gov
Phone: 505-665-9095

Questions about this website: webmaster


Page Info:
Last modified:
27 Jun 2008
Access Count: Unknown
 
ReliabilityAnalysisNANOLABNANOLAB ExampleTeamReferencesPublicationsRelated Sites

Tools and Techniques for Analyzing Reliability Measures of  Fault-Tolerant Reconfigurable Nano-Architectures

                                  

                                                                    

Manufacturing and transient faults may be abundant in high density reconfigurable design fabrics built with nanoscale technologies (silicon or other emerging technologies). Design of reliable digital logic and architectures on such defective fabrics will require adequate redundancy. However, redundancy is not always a solution to the reliability problem, and often too much or too little redundancy may cause degradation in reliability. The key challenge is in determining the granularity at which fault-tolerance is designed, and the level of redundancy to achieve a specific level of reliability. Thus in-depth analysis of redundancy/reliability trade-offs for such designs will be required for micro-architects to do design space explorations.

The goal of this project is to develop different tools and techniques that can evaluate the reliability measures of reconfigurable nano-architectures, and analyze different resource redundancy based fault-tolerance techniques. In particular, we have extended  an automated computational scheme based on Markov Random Fields (MRFs) and Belief Propagation techniques (incorporated in a tool named NANOLAB) to compute trade-off points for different reconfigurable Boolean networks  in the face of thermal perturbations and interconnect noise. Previously this tool was used only for combinational design exploration but we have implemented a loopy Belief propagation algorithm that provides capabilities to model fault-tolerant programmable sequential logic design. The effectiveness of this automation is illustrated by analyzing reconfigurable Boolean networks formed by using different industry standard configurable logic blocks (CLBs) in the presence of thermal and signal noise.

We have also developed reconfigurable core logic libraries in a probabilistic model checking based tool called SMART. This tool applies probabilistic model checking techniques and state space exploration techniques to calculate the likelihood of occurrence of transient and permanent faults in the devices and interconnections of large scale reconfigurable nano-architectures. Another tool called PRISM has also been used to develop a DTMC based generic von Neumann multiplexing library, so as to perform comparative studies of different multiplexing based redundancy techniques. 

These tools and techniques have already been used to illustrate certain anomalies which are counter-intuitive and can only be observed by complicated and cumbersome analytical methodologies. We believe that such methodologies will help furthering research and pedagogical interests in this area, expedite the reliability analysis process and enhance the accuracy of establishing reliability-redundancy trade-off points.

 

Manufacturing Methodologies

 

Some of the current manufacturing techniques are casting, milling, lithography etc. The salient characteristics of these are:

   bulletStatistical and Random

   bulletNumber of atoms in a transistor varies statistically     

   bulletLocation of individual dopant atoms probabilistic

Some of the emerging methodologies are:

   bulletSelf Assembly

   bulletPositional Control     

   bulletSelf Replication
   bulletImprint Lithography

   bulletElectron Beam Lithography

 

Defects at Nano-scale

 

The probability of devices and interconnects being faulty  in computation fabrics based on nanoscale devices will be non-negligible; in fact, faults may be common. There are different fault and defect categories at the nanoscale and the ones discussed below are an illustrative subset.

bullet

Manufacturing defects due to technology imperfections                                                                       

bullet

Reduced noise margin  result in Transient faults

 

    bulletReduced voltage and current levels        

bullet

 Faults due to ageing

      bulletMolecular and other new technologies are unlike silicon
bullet

    Quantum Physical Effects

 

 

Why Defect-Tolerant Architectures ?

     

                            

Figure 1. Nano-Architectures of the future

Figure 1. (IEEE Nanotechnology Conference 2003) shows the emerging architectures suitable for nano-scale implementations, their advantages and drawbacks, and the level of developmental maturity. This indicates that defect- and fault- tolerant architectures will play a major role in the development of nano-scale digital systems in the presence of permanent and transient faults. Due to the small feature size, there will be a large number of nano-devices at a designer’s disposal. This will lead to resource (hardware) redundancy based defect- and fault- tolerant architectures, and thus some conventional techniques such as Triple Modular Redundancy (TMR), Cascaded Triple Modular Redundancy (CTMR), multiplexing and multistage iterations of these may be implemented to obtain high reliability. This motivates our comparative study of such techniques for different Boolean networks, and analysis of different reliability-redundancy-granularity and area-delay-cost trade-off points.

Some of the defect-tolerance techniques are:

   bulletResource redundancy based techniques

   bulletInformation coding techniques     

   bulletLarge-scale reconfiguration
   bulletNano-device specific

 

Key Challenges in Redundancy Based Defect-Tolerance

 

bullet

Addition of arbitrary levels of hardware redundancy may decrease reliability                                                                      

bullet

For a specific architecture and a given device failure distribution , there exists an optimal redundancy level

 

    bulletAny increase or decrease in redundancy may lead to less reliable computation

bullet

Redundancy may be injected at different levels of granularity

      bulletgate level, logic block level, functional unit level etc
bullet

  Determining the correct granularity level crucial

 

Different Abstraction Levels at which Redundancy can be applied

 

• Physical device level

      – Specific defect-tolerant features of nano-scale devices

• Architecture level

      – Assembling collections of nano-devices

      –  Resource Redundancy based Fault-Tolerance

• Application level

      – Features of the computing applications

      – Correct operation on defect prone computing system

 

Defect-Tolerance Approaches

 

                                                           

• Detection of Faults followed by reconfiguration                           

             – Heath et al., Goldstein et al., Durbeck et al.

                                                                                        

• Probabilistic Approach

– Estimate probability distribution of errors

– Design around possible faults

– Our approach

 

 

Design Flow of Reconfigurable Systems

 

     

 

Figure 2. The design flow of a Reconfigurable digital system

Figure 2. shows the design flow of a reconfigurable digital system. It indicates the different stages of the system design starting from the specification (higher level of abstraction) to the net-list generation (detailed implementation/lower abstraction level). There are other back-end specific processes performed after logic synthesis from the RTL design, such as logic optimization, physical design, layout etc, that finally lead to the fabricated system. These have been omitted here for simplicity. The front-end design consists of translation of the system specifications to an architectural design, which is refined to a micro-architectural design. This is a detailed architectural description of the system. Such a design methodology for digital and information processing systems has to guarantee acceptable reliability levels. Therefore, quick and easy techniques are required to measure the reliability of such micro-architectural designs. If the desired reliability levels cannot be achieved with the architectural configuration, the design has to be made more robust. This may involve augmenting more redundancy at different granularity levels (such as gate level, logic block level, logic function level etc.).

Existing literature and our previous work indicate that specific Boolean networks have different reliability-redundancy  trade-off points and incorporating arbitrary number of redundant devices in the architecture may even degrade the reliability of computation . Several analytical probabilistic models have been proposed for evaluation of such trade-off points, but such analytical approaches are extremely challenging combinatorially and error-prone specially for complex Boolean networks. Also, analytical probabilistic analysis of large fault-tolerant architectural configurations are often non-composable in the sense that if the analyzed configuration is used as a part of a larger configuration, the combinatorial analysis becomes much more difficult. Interdependencies between the gates and the interconnects also augment to the complexity of such analysis. Such limitations necessitate the automation of such methodologies. Figure 1. shows the scope of the probabilistic tools and techniques that are being developed in this project, that we believe will expedite and ease the evaluation of reliability measures,  redundancy/granularity levels and different performance-area-cost related parameters for specific micro-architectural descriptions.

                                                              

 Los Alamos National Laboratory  Operated by the University of California for the National Nuclear Security Administration,
of the US Department of Energy.     Copyright © 2004 UC | Disclaimer/Privacy

 NOTICE: Information from this server resides on a computer system funded by the U.S. Department of Energy. Anyone using this system consents to monitoring of this use by system or security personnel. For complete conditions of use see Disclaimer/Privacy.