Home | Articles | Forum | Glossary | Books |
AMAZON multi-meters discounts AMAZON oscilloscope discounts (cont from part 1) 5. RELIABILITY PREDICTION Customers specify a product's reliability requirements. The marketing/product development groups want an accurate quantitative ability to trade off reliability for performance and density. They also may require application-specific qualifications to meet the needs of different market segments. The designers want design for reliability requirements that will not impede their time to market. Manufacturing wants stable qualified processes and the ability to prevent reliability problems. And there is the continuous pressure to reduce the cost of operations. Reliability modeling assists in calculating system-level reliability from sub system data and depicts the interrelationship of the components used. Using reliability models, a designer can develop a system that will meet the reliability and system level requirements and can perform tradeoff studies to optimize performance, cost, or specific parameters. Reliability prediction is performed to determine if the product design will meet its goals. If not, a set of quality initiatives or process improvements are identified and defined such that the goals will be met. Reliability process improvements are justified by relating them directly to improved field reliability predictions. Reliability prediction is nothing more than a tool for getting a gross baseline understanding of what a product's potential reliability (failure rate) is. The number derived from the calculations is not to be an end-all panacea to the reliability issue. Rather it is the beginning, a call to understand what constitutes reliability for that product and what the factors are that detract from achieving higher reliability. This results in an action plan. Initial reliability predictions are usually based on component failure rate models using either MIL-HDBK-217 or Bellcore Procedure TR-332. Typically one analyzes the product's bill of materials (BOM) for the part types used and plugs the appropriate numbers into a computer program that crunches the numbers. This gives a first "cut" prediction. However, the failure rates predicted are usually much higher than those observed in the field and are considered to be worst-case scenarios. One of the criticisms of the probabilistic approach to reliability (such as that of MIL-HDBK-217) is that it does not account for interactions among components, materials, and processes. The failure rate for a component is considered to be the same for a given component regardless of the process used to assemble it into the final product. Even if the same process is used by two different assemblies, their methods of implementation can cause differences. Furthermore, since reliability goals are based on competitive analysis and customer experience with field usage, handbook-based reliability predictions are unlikely to meet the product goals. In addition, these predictions do not take into account design or manufacturing process improvements possibly resulting from the use of highly accelerated life test (HALT) or environmental stress screening (ESS), respectively. Table 3 presents some of the limitations of reliability prediction. ==== TABLE 3 Limitations of Reliability Prediction Simple techniques omit a great deal of distinguishing detail, and the very prediction suffers inaccuracy. Detailed prediction techniques can become bogged down in detail and become very costly. The prediction will also lag far behind and may hinder timely hardware development. Considerable effort is required to generate sufficient data on a part class/level to report statistically valid reliability figures for that class/level. Component reliability in fielded equipment is very difficult to obtain due to lack of suitable and useful data acquisition. Other variants that can affect the stated failure rate of a given system are uses, operator procedures, maintenance and rework practices, measurement techniques or definitions of failure, operating environments, and excess handling differing from those addressed by modeling techniques. ==== Thus, reliability prediction is an iterative process that is performed through out the design cycle. It is not a "once done, forever done" task. The initial reliability prediction is continually refined throughout the design cycle as the bill of materials gets solidified by factoring in test data, failure analysis results, and the degree to which planned reliability improvement activities are completed. Subsequent predictions take into account usage history with the component technology, suppliers, and specific component type (part number) as well as field data from previous products and the planned design and manufacturing activities. Field data at the Tandem Division of Compaq Computer Corporation has validated that the reliability projections are more accurate than handbook failure rate predictions. ==== TABLE 4 Device Quality Level Description from Bellcore TR-332 The device failure rates contained in this document reflect the expected field reliability performance of generic device types. The actual reliability of a specific device will vary as a function of the degree of effort and attention paid by an equipment manufacturer to factors such as device selection/application, supplier selection/ control, electrical/mechanical design margins, equipment manufacturing process controls, and quality program requirements. Quality Level 0 Commercial-grade, reengineered, remanufactured, reworked, salvaged, or gray-market components that are procured and used without device qualification, lot-to-lot controls, or an effective feedback and corrective action program by the equipment manufacturer. Quality Level I Commercial-grade components that are procured and used without thorough device qualification or lot-to-lot controls by the equipment manufacturer. Quality Level II Components that meet requirements of Quality Level I plus purchase specifications that explicitly identify important characteristics (electrical, mechanical, thermal, and environmental), lot control, and devices qualified and listed on approved parts/manufacturer's lists. Quality Level III Components that meet requirements of Quality Levels I and II plus periodic device qualification and early life reliability control of 100% screening. Also an ongoing continuous reliability improvement program must be implemented. ==== TABLE 5 56K Modem Analysis Assumptions An ambient air temperature of 40°C around the components (measured 0.5 in. above the component) is assumed. Component Quality Level I is used in the prediction procedure. This assumes standard commercial, nonhermetic devices, without special screening or preconditioning. The exception is the Opto-couplers, which per Bellcore recommendation are assumed to be Level III. Electrical stresses are assumed to be 50% of device ratings for all components. Mechanical stress environment is assumed to be ground benign (GB). Duty cycle is 100% (continuous operation). A mature manufacturing and test process is assumed in the predicted failure rate (i.e., all processes under control). The predicted failure rate assumes that there are no systemic design defects in the product. ==== 5.1 Example of Bellcore Reliability Prediction A calculated reliability prediction for a 56K modem printed wiring assembly was made using Bellcore Reliability Prediction procedure for Electronic Equipment, TR-332 Issue 5, December 1995. (The latest revision of Bellcore TR-332. The device quality level has been increased to four levels: 0, I, II, and III, with 0 being the new level. Table 4 describes these four levels.) Inherent in this calculation are the assumptions listed in Table 5. Assuming component Quality Level I, the calculated reliability for the PWA is 3295 FITS (fails per 109 hr), which is equivalent to an MTBF of 303,481 hr. This failure rate is equivalent to an annual failure rate of 0.029 per unit, or 2.9 failures per hundred units per year. The assumption is made that there are no manufacturing test, or design problems that significantly affect field reliability. The results fall well within the normal range for similar hardware items used in similar applications. If quality Level II components are used the MTBF improves by a factor of about 2.5. One has to ask the following question: is the improved failure rate worth the added component cost? Only through a risk analysis and an understanding of customer requirements will one be able to answer this question. Failure rate (FITS) MTBF (hr) Annualized failure rate Quality Level I 3295 303,481 0.029 Quality Level II 1170 854,433 0.010 The detailed bill-of-material failure rates for Quality Levels I and II are presented in Tables 6 and 7, respectively. (coming soon) TABLE 6 Reliability Calculation Assuming Quality Level I (coming soon) TABLE 7 Reliability Calculation Assuming Quality Level II
6. RELIABILITY RISK It is important that the person or group of people who take action based on reliability prediction understand risk. Reliability predictions vary. Some of the source of risks include correct statistical distribution, statistical error (confidence limits), and uncertainty in models and parameters. Reliability metrics revolve around minimizing costs and risks. Four cost elements to incorporate in metrics are: 1. The cost of a failure in the field 2. The cost of lost business due to unacceptable field failures 3. Loss of revenue due to reliability qualification delaying time to market 4. The cost of the lost opportunity to trade off "excess" reliability safety margins for increased performance/density Note that the first two items represent a cost associated with failures that occur in a small subpopulation of the devices produced. In contrast the last two terms represent an opportunity to increase the profits on every part produced. Economic pressures are going to force increased attention on reliability's role in improving time to market and enhancing performance. There are two ways to increase the value of reliability predictions. First, rather than a point prediction, the capability is needed to develop curves of reliability levels versus design, manufacturing, and end use variables (Fig. 11). This will allow optimization of the reliability given the economics of a particular marketplace. Second, risk needs to be quantified so it can be factored into technology decisions. Let's use the bathtub curve to try to answer this question. As mentioned before, the bathtub curve depicts a product's reliability (i.e., failure rate) through- out its life. Figure 12 shows the bathtub curve with a vertical line placed at the product's design life requirements. If a high margin exists between the lifetime requirement and the wearout time, a high cost is incurred for having this design margin (overdesign for customer requirements), but there is a low reliability risk. If the wearout portion of the curve is moved closer to the lifetime requirement (less design margin), then a lower cost is incurred but a greater reliability risk presents itself. Thus, moving the onset of wearout closer to the lifetime expected by the customer increases the ability to enhance the performance of all products, is riskier, and is strongly dependent on the accuracy of reliability wearout models. Thus, one must trade off (balance) the high design margin versus cost. Several prerequisite questions are (1) why do we need this design margin and (2) if I didn't need to design my product with a larger margin, could I get my product to market faster? This begs the question what level of reliability does the customer for a given product really need. It is important to understand that customers will ask for very high levels of reliability. They do this for two reasons: (1) they don't know what they need and (2) as a safety net so that if the predictions fall short they will still be okay. This requires that the designer/manufacturer work with the customer to find out the true need. Then the question must be asked, is the customer willing to pay for this high level of reliability? Even though the customer's goal is overall system reliability, more value is often placed on performance, cost, and time to market. For integrated circuits, for example, it is more important for customers to get enhanced performance, and suppliers may not need to fix or improve reliability. Here it's okay to hold reliability levels constant while aggressively scaling and making other changes.
7. RELIABILITY GROWTH Reliability growth is a term used to describe the increase in equipment mean time to failure that comes about due to improvements in design and manufacturing throughout the development, preproduction, and early production phases. The model originally proposed by Duane (3) is probably the most well known for forecasting reliability growth. Since the burn-in process also in effect enhances the reliability, there has been some confusion regarding growth due to corrective actions in design and production, and growth due to burn-in. Figures 13a and 13b illustrate the separate effects of burn-in and MTTF growth. Burn-in removes the weak components and in this way brings the equipment into its useful life period with a (supposedly) constant hazard rate ? (see Fig. 13a). Reliability growth through design and manufacturing improvements, on the other hand, steadily reduces the inherent hazard rate in the useful life period of the product, i.e. it increases the MTTF. The corrective actions we speak of when discussing burn-in are primarily directed toward reducing the number of infant mortality failures. Some of these improvements may also enhance the MTTF in the useful life period, providing an added bonus. The efforts expended in improving the MTTF may very well reflect back on early failures as well. Nonetheless, the two reliability enhancement techniques are independent. 8. RELIABILITY DEGRADATION Degradation can be defined as the wearing down of the equipment through un wanted actions occurring in items in the equipment. An example would be component degradation. Degradation over time slowly erodes or diminishes the item's effectiveness until an eventual failure occurs. The cause of the failure is called the failure mechanism. A graphic example of degradation is the wearing of land by unwanted action of water, wind, or ice, i.e., soil erosion. Product or equipment reliability degradation can occur due to process induced manufacturing defects and assembly errors, the variable efficiency of conventional manufacturing and quality control inspection processes, and the la tent defects attributable to purchased components and materials. The last has historically caused irreparable problems in the electronics industry and requires that strict process control techniques be used in component manufacturing. The problem here is the unknown number of latent defects in marginal or weakened components which can fail under proper conditions of stress, usually during field operation. Some of the things that can be done to prevent reliability degradation are the following: 1. "Walk the talk" as regards quality. This requires a dedication to quality as a way of life from the company president down to the line worker. 2. Implement an effective quality control program at the component, PWA, module, subsystem, and system levels. 3. Design for manufacturing, testability, and reliability. 4. Use effective statistical quality control techniques to remove variability. 5. Implement manufacturing stress screens. 6. Improve manufacturing and test equipment preventative maintenance actions and eliminate poorly executed maintenance. 7. Train the work and maintenance forces at all levels and provide essential job performance skills. 8. Include built-in test equipment and use of fault-tolerant circuitry. 8.1 Component Degradation Component degradation is typically a change which occurs with time that causes the component's operational characteristics to change such that the component may no longer perform within its specification parameters. Operation degradation will occur through the accumulation of thousands of hours of component operation. The component may eventually fail due to wearout. If a component such as a semiconductor device is used within its design constraints and properly manufactured, it will provide decades of trouble-free operation. Component Degradation Mechanisms Typical IC degradation mechanisms include 1. Electrical overstress 2. Operation outside of a component's design parameters 3. Environmental overstress 4. Operational voltage transients 5. Test equipment overstress (exceeding the component's parameter ratings during test) 6. Excessive shock (e.g., from dropping component on hard surface) 7. Excessive lead bending 8. Leaking hermetically sealed packages 9. High internal moisture entrapment (hermetic and plastic packages) 10. Microcracks in the substrate 11. Chemical contamination and redistribution internal to the device 12. Poor wire bonds 13. Poor substrate and chip bonding 14. Poor wafer processing 15. Lead corrosion due to improperly coated leads 16. Improper component handling in manufacturing and testing 17. Use of excessive heat during soldering operations 18. Use of poor rework or repair procedures 19. Cracked packages due to shock or vibration 20. Component inappropriate for design requirements Looking through the list of degradation mechanisms indicates, it is clear that they can be eliminated as potential failure mechanisms resulting in high cost savings. These mechanisms can be eliminated by use of proper component de-sign, manufacturing, and derating processes and by ensuring that the correct component is used in the application. It is difficult to detect component degradation in a product until the product ceases functioning as intended. Degradation is very subtle in that it is typically a slowly worsening condition. 9. RELIABILITY CHALLENGES Electronics is in a constant state of evolution and innovation, especially for complex products. This results in some level of uncertainty as regards reliability and thus poses challenges. Two of these are as follows: 1. The ratio of new to tried and true portions of electronic systems is relatively high, therefore, reliability information may be largely un known. 2. There is basically no statistically valid database for new technology that is in a constant state of evolution. Predictions cannot be validated until an accepted database is available. 10. RELIABILITY TRENDS The integration of technology into every dimension of our lives has allowed customers to choose among many options/possibilities to meet their needs. This has led to a raised expectation of customized products. We have come from a one size-fits-all product and reliability mindset to one of customized products and reliability of manufacturing batches/lots of a single unit. This has increased the number of possible solutions and product offerings. Product designers and manufacturers have been driven to cut development times; product lifetimes have decreased due to the pace of innovation, and shorter times to market and times to revenue have resulted. Concurrently, the time between early adoption and mass adoption phases for new electronic and digital products has been compressed. In the past the physical life environment was important in reliability prediction. Today's shortened product life has caused one to question the underlying methodology of the past. What are the concerns of a product that will be produced with an expected life of 18 months or less? A product that could be considered to be disposable? Do we simply toss out the methods that worked in the past, or do we step back and decide which of our tools are the most appropriate and applicable to today's product development and life cycles and customer expectations? In this guide various tools and methods are presented that do work for high-end electronic products. It is up to the readers to decide which of the methods presented make sense and should be used in the changing conditions facing them in their own company in their chosen marketplace. |
PREV. | NEXT | Article Index | HOME |