Standard Planning And Reporting For Optimized Statistical Testing
By: Robert A Ostrowski
The boss just asked you to write an explanation of planning for a statistical test, choosing a test quantity, and reporting the results after the test is completed. This is the way I, as a mechanical engineer, statistics wannabe, would do it. Working for the feds for 22 years has caused some to suggest that money is always plentiful. It's not. In fact, the best way to optimize funds, in my opinion, is to test "just enough" to achieve 80% confidence with the new "power" number set at 74% and allowing for 1 failure during testing. Why such an odd power? Well, the old binomial calculation only dealt with confidence but the new "proportion" calculation that all the new software uses adds the power number into the equation. When I ran all probabilities through the new software with one failure, the power was 74% across the board when the confidence exceeded 80%. It makes sense since the mistake related to power (beta error) just means "back to the drawing boards" for a while, but a mistake due to confidence (alpha error) means a bad product out in the field that performs more poorly than expected.
The biggest bang for the buck is gained from replacing these probability requirements with a more robust "continuous" variable that measures performance (minutes, feet, dollars) so that small increments of improvement can be detected. You will find that as long as a reasonable "Es (standard deviations of improvement) like 0.75 or greater is achieved, the quantities required to test in repeated tests are around 3 or 4. This compares favorably to the double digit quantities usually required to prove 80% confidence with a probability requirement.
Either way, the following is a good method to explain the procedure no matter if your requirement is a probability or a continuous variable:
1. Statistics planning, quantity selection, and results reporting
a) Probability or Proportion requirement (with example)
i) Quantities and performance - The quantities chosen for a 0.75 "probability of success (Ps)" requirement, for example, are based on a standard of 80% confidence and 70% power as the minimum combination for quantity selection. This combination was selected because the past binomial (pass/fail) calculation, using one assumed failure results in a consistent 74% power when using the single proportion calculation in STATISTICA (statistics tool). The selected quantity using this "probability" requirement method will provide only a very gross generalization of performance. The determination that the system works XX% of the time is made by combining all the variability such as "type of item", temperature, speed, interference factors (or not) as an example. The variation of these test factors breaks the basic law of statistics which requires the same test to be repeated. It's wrong to assume performance during an easy test with ideal conditions will be the same as a strenuous test under harsher conditions. Probability tests are adequate only for tests that can be repeated (12 times in this example for a 0.75 Ps) under the exact (or nearly exact) same conditions. High fidelity, accredited models are sometimes used to simulate large quantity tests under various sets of conditions (with monte carlo inputs, for instance) to get an improved estimate of "probability of success". A more appropriate requirement for actual testing is a continuous (time/distance/cost/etc.) variable like "improved time to complete" to replace a "probability of success". Continuous requirements can result in only 3 to 5 tests required (vice 12 in this probability example). The use of continuous thresholds is discussed later.
ii) Results reporting - For probabilities, a graph of "power" vs. "number of failures" show that statistical power decreases as the amount of failures increases. Note the inability to determine if the test "statistically passes" when more than 1 failure occurs, especially when the sample of data produces exactly the same proportion (successes / total tests) as the threshold requirement. Three failures equates to a.75 result (9 of 12 passed) and the power is ZERO since it is unclear whether the true population, that this sample represents, will perform less than or greater than the threshold (you're sitting on the fence). Of course, a graph showing power vs. number of failures in the test plan will not have any failures circled previous to the test but it can be added post-test and the same curve used for the test report. This is a clear way to show how close the system is to "statistically" passing (or failing) the test. The confidence interval of the estimated range of performance in the field can also be calculated. In this case if 1 failure did occur, the "gross overall performance" (if it has any meaning at all) can be expected to fall in the range of 0.52 and 0.94 probability for the real system. If 3 failures out of 12 tests happened a re-calculated graph would show that we would expect a much worse 0.5 probability of success in the field with 81% confidence, 84% power. The point of this probability discussion, however, is to emphasize that these results can be totally misleading if the 12 tests were run with greatly varying conditions. Results are believable only when essentially the same (or nearly so) conditions were repeated 12 times. This is not always achievable.
b) Continuous requirement (with example)
i) Quantities and performance - A continuous variable (time/distance) is the preferred type of requirement which allows evaluation of relatively small incremental improvements with high confidence & power and low test quantities. The test director determines how the new design is progressing by inquiring about the mean, say, miss distance, for example, from early calculations or past legacy data, results of lab tests, model runs, or early development tests. A mean average and standard deviation can be calculated from this early data and a prediction made. Estimated new mean minus the threshold (or old mean if comparing to an older system) is divided by this estimated standard deviation. This is called the "Standardized effect (Es)" which is merely the number of standard deviations of improvement compared to the threshold or old mean. A graph of test quantities versus Es, for a 0.70 (Es) "standard deviation of improvement" or larger, test quantities are very small (3 to 5) and provide real savings compared to the "double digit" quantities typically required for probability requirements. For example, the 0.75 probability described earlier requires 12 tests to prove the performance of this strenuous test under the same (harsh or easy) set of conditions. A continuous measure (time, distance, dollars) that is estimated to be 1.0 "standard deviation" better than the desired threshold measure requires only 3 repeated tests. Consequently 4 different test types (easy, harsh, medium, special conditions) can be evaluated using this method (3 X 4 = 12) which provides much more value for the money by providing confidence in the performance against 4 distinct sets of conditions using a continuous variable requirement vice only 1 set of conditions for the probability requirement (12 nearly the same tests!). Continuous gives much more bang for the same dollar amount.
ii) Results reporting - An arrow that points at the estimated performance pre-test (estimated) and post-test (real results) and provides the actual performance improvement (with true 80% confidence and 70% power if the same test is repeated only 3 times). This universal curve could be used for each of the product "types" tested. The benefit can readily be seen versus the probability requirement method which provides a gross overall approximation of "probability of success" for all the product types combined which is rather meaningless since it is not statistically true if you are not testing the same product repeatedly because you are comparing apples to oranges. But for continuous variable requirements, each incremental improvement is readily seen on such a graph with less testing. Of course, if the result falls to the left of the estimated Es, the test director should question this reduced performance. In the worst case, if performance is much worse, the test quantities chosen can turn out to be inadequate to prove 80% confidence and 70% power (required test quantities increase as the resulting Es decreases). One or two spares would be useful to prepare for this possibility. Of course, better than expected performance (to the right) means the test could feasibly be ended early and the test director declares victory and saves one or two test assets. This is rarely possible since these tests are usually necessary for other purposes. The exact prediction of the population performance (mean time improvement, for instance) can be provided with a basic one-sided "t" calculation to show the 80% confidence "interval". This is where the population (in the field) performance would be expected to fall (say, between 5 seconds and 6.5 seconds, with high confidence and power).
2) Proposed Planning and Reporting Formats:
a) For Probability requirements and all tests nearly "the same":
i) Test Plan - A graph which shows (without the circled result) Confidence and Power vs. the number of failures out of Y tests of a 0.XX threshold. The Null hypothesis states that the result will be < 0.XX. The test director hopes to prove, with greater than 80% confidence and 70% power, that the Alternate is true, that the performance is > or = to 0.XX.
ii) Test Report
(1) If test is statistically passed or failed - The graph of confidence and power vs. number of failures would show (with circled result) the test results. X performance failures provide X% confidence and Y% power that the system will perform (above or below) threshold. The test provides statistically significant results that the system will perform (above spec or below spec).
(2) If test results are in the "inconclusive" range (less than 70% power) - The graph would show (with circled result) the test results. X performance failures provides statistically inconclusive results due to confidence and/or power failing to meet the desired 80% confidence / 70% power limit.
(3) The graph would show the real, statistically significant, estimated performance threshold of 0.XX due to the resulting X failures. For instance, if 9 of 12 passed (3 failures) the proportion passed would be 0.75, exactly the threshold value. The calculated power (and the power shown on the graph) would be ZERO.
(4) The field performance can be expected to be, with XX% confidence, between 0.XX and 0.YY (calculated confidence interval).
b) For Continuous (time/distance) requirements:
i) Test Plan - A graph would show (without actual results arrow) calculated test quantities for 80% confidence and 70% power vs. "standard deviations of improvement (Es)" beyond the expected mean threshold of acceptable performance (or beyond the old mean performance, different curve). The Es on the x-axis would start at 0 and go up to, say, 2 standard deviations of improvement. The estimated mean could be calculated using legacy data, preliminary calculations, models, lab test results, and/or development test results. The required number of X test data points required for each unique test (same or very similar conditions - state the unique groups tested (easy, harsh, medium, special) - a separate curve for each) is also shown.
ii) Test Report - The graph would show (with actual results arrow next to estimated performance arrow) that the resulting test data is better or worse than estimated.
(1) If results are better than estimated - Results show that performance will be better than (threshold or old data mean (different curve)) and also better than the pre-test estimate. The confidence and power of this test result is XX% confidence and YY% power. There is 80% confidence that the field equipment (state the threshold requirement, say distance achieved) will perform between X and Y (confidence interval).
(2) If results are worse than estimated - Results show that performance will be worse than estimated. But the test quantities required are still adequate (or the performance was so much worse than estimated that the quantities tested were inadequate for statistical significance.). This test results in XX% confidence and XX% power that the real population will perform better than threshold (or the old system mean performance).
(3) The 80% confidence interval predicts the field performance (say, time to perform the task) would be between 0.XX and 0.YY.