A Sampling Methodology for INSURV Material Inspections

Abstract : This technical report summarizes research into sampling methods that the U.S. Navy Board of Inspection and Survey (INSURV) could employ during Material Inspections (MIs) of ships and submarines. The goal is to improve the Board s efficiency in conducting MIs without compromising either Board effectiveness or inspection rigor. The idea of sampling is that, rather than inspecting every item of a specific type for example, portable CO2 bottles onboard a ship or submarine, INSURV will only inspect a sample of those items. From the sample, INSURV would then characterize the ship s or submarine s entire complement of that item. This report outlines a sampling methodology that is statistically rigorous and therefore quantitatively defensible, and it is implementable. It is based on well-known sampling methods, such as those described in Cochran (1977) and Lohr (1999). The method described herein allows INSURV to specify the desired margin of error of the results on each item. It is expected that this decision will be based on the mission essentiality and/or safety criticality of each item, where items that are mission essential or safety critical will be given very small margins of error. Similarly, items that are not mission essential or safety critical will be given appropriately larger margins of error.

effectiveness or inspection rigor.
The idea of sampling is that, rather than inspecting every item of a specific type-for example, portable CO2 bottles-onboard a ship or submarine, INSURV will only inspect a sample of those items. From the sample, INSURV would then characterize the ship's or submarine's entire complement of that item.
This report outlines a sampling methodology that is statistically rigorous and therefore quantitatively defensible, and it is implementable. It is based on well-known sampling methods, such as those described in Cochran (1977) and Lohr (1999).
The method described herein allows INSURV to specify the desired margin of error of the results on each item. It is expected that this decision will be based on the mission essentiality and/or safety criticality of each item, where items that are mission essential or safety critical will be given very small margins of error. Similarly, items that are not mission essential or safety critical will be given appropriately larger margins of error. The idea of sampling is that, rather than inspecting every item of a specific type-for example, portable CO2 bottles-onboard a ship or submarine, INSURV will only inspect a sample of those items. From the sample, INSURV would then characterize the ship's or submarine's entire complement of that item.
A good sampling methodology must: (1) ensure that the sample is representative of the population, and (2) the sample size is sufficiently large so that the "margin of error" of the results is small. Furthermore, a successful sampling methodology must be: The sampling methodology must be analytically/statistically rigorous and it needs to pass the "common sense" test (so the Fleet will accept it).
 Implementable: It must be easy to apply by inspection teams during the inspection process. That is, the purpose of sampling is to make the inspection teams more efficient, not to add a burden to the inspection process.
 Justifiable: It must result in a sufficient improvement in efficiency that more than makes up for the extra complexity it will introduce.
 Flexible: Inspectors must have the flexibility to adjust as shipboard conditions require. In particular, the methodology cannot be so rigid that it inhibits inspectors from pursuing things they observe aboard the platform.
 Transparent: Inspectees will need to know that INSURV will not necessarily inspect all items, though they need to be prepared for viii all, and they also will need to be prepared for, and be willing to accept, results based on sampling.
This report outlines a sampling methodology that meets the requirements of the first two bullets: it is statistically rigorous and therefore quantitatively defensible, and it is implementable. It is based on well-known sampling methods, such as those described in Cochran (1977) and Lohr (1999).
The method described herein allows INSURV to specify the desired margin of error of the results on each item. It is expected that this decision will be based on the mission essentiality and/or safety criticality of each item, where items that are mission essential or safety critical will be given very small margins of error. Similarly, items that are not mission essential or safety critical will be given appropriately larger margins of error.
Based on this choice, as well as other information about each item, such as how many are aboard a given platform, the method gives a required sample size. For mission-essential or safety-critical items, particularly when there are few of each item aboard the ship or submarine, the method often results in a requirement for 100% inspection, which is what INSURV is currently doing.
However, for non-mission-essential and non-safety-critical items, particularly when there are many of the items, the sample size can be considerably smaller than the total number of items.
Whether the sampling method described in this report is justifiable, in the sense that the effort of implementing it is worth the benefits of sampling, is a 1 As an annotated briefing, this report is intended to present in written form how the briefing would have been given verbally. The text under each slide (and sometimes continuing on to the next page) documents how each slide would have been described and presented. In so doing, the text sometimes reiterates verbiage from a slide and often amplifies and expands on the information contained in the slide.
This briefing introduces the idea of sampling, including discussing the arguments in favor of using sampling during MIs. It then describes a sampling methodology for MIs that allows INSURV to correctly and defensibly determine how many items to inspect (the sample size) as well as which items to inspect.
As we will discuss, the methodology is designed to ensure samples are representative of the population and sample sizes are set so that the sampling "margin of error" is appropriately small.
That said, this report only presents an overarching sampling methodology. The idea of a sampling is that, rather than inspecting every item of a specific type-such as CO2 bottles onboard a ship or submarine-INSURV will only inspect a sample of those items. Inherent in this idea is that there is sufficient information in the sample from which INSURV can make an accurate determination about the material condition of the entire platform with respect to the item.
Of course, the measurement resulting from the examination of a sample (e.g., the CO2 bottle EOC score) is unlikely to exactly match the measurement that would have occurred had the whole population been inspected. Thus, the goal of a good sampling methodology is to ensure that the EOC score estimated from the sample is "close" to the EOC score that would have been determined if the entire population had been inspected.
As we will discuss in this report, there are two critical sampling criteria: (1) the sample must be representative of the population, and (2) the sample size must be sufficiently large so that the sampling "margin of error" is appropriately small.

7
By ensuring these two criteria are achieved, it is possible to have confidence that the result from inspecting the sample of items does, indeed, reflect the entire shipboard complement of items. We will return to these ideas, first more rigorously defining them, and then discussing how a sampling methodology can be implemented that meet the criteria. 8 The "margin of error" quantifies the uncertainty inherent in sampling.
When used in the context of polling results, the margin of error is typically taken to mean that the result for the population is highly likely to be within the observed result of the poll, plus or minus the margin of error. While not technically correct, it does capture the idea that the margin of error is a measure of the uncertainty inherent in the sample.
Technically, the margin of error is the half-width of a 95% confidence interval around the statistic of interest. So, adding and subtracting the margin of error from the EOC score of the sample, say for the CO2 bottles from the previous slide, gives an interval within which the population EOC score lies with high confidence.
Perhaps most relevant to this discussion, note that the margin of error quantifies the uncertainty in the result. The larger the margin of error, the more likely the result observed from the sample can deviate significantly from what would have been observed if the entire population of items had been inspected.
The good news is that, via the sample size, INSURV can control the margin of error of its results.
Consider two examples to illustrate the margin of error idea. First, consider a Rasmussen Reports survey of President Obama. In it, 1,000 likely voters were asked to rate the President, and 42% of them said he is a good or excellent leader. The margin of error for this poll is 3%. Thus, a 95% confidence interval for the percentage of voters in the entire population who think the President is a good or excellent leader is from 39% to 45%. That is, we can be highly confident that between 39% and 45% of likely U.S. voters would rate the President a good or excellent leader.  It's often not necessary to inspect every item to make an adequate determination of the material condition/readiness.

11
There are two ways to think about how INSURV might apply sampling.
The first we'll call "within platform" sampling, which means that the sampling will be applied to items from one ship or submarine during its MI. This is the type of sampling we've been discussing thus far with the CO2 bottle example.
However, "between platform" sampling may also be of interest. Here the idea is that certain items are only inspected on some ships or submarines, not all of them. Obviously, this type of sampling is not relevant to MIs, but it could be useful if INSURV was tasked with assessing Fleet readiness for some item or program that is not part of an MI. Under these conditions, it may not be necessary to inspect every item on every platform, but rather only a sample of platforms.
In any case, since this report is concerned with how sampling might apply to MIs, we will only concern ourselves with within platform sampling.
12 Before continuing, it is important to lay out the principles that a sampling scheme should meet if it is to be applied by INSURV to MIs. Such a sampling scheme must be: 

Defensible:
The sampling methodology must be analytically/statistically rigorous and it needs to pass the "common sense" test (so the Fleet will accept it).
 Implementable: It must be easy to apply by inspection teams during the inspection process. That is, one purpose of sampling is to make the inspection teams more efficient, not to add a burden to the inspection process.

Flexible:
Inspectors must have the flexibility to adjust as conditions on the ground require. In particular, the methodology cannot be so rigid that it inhibits inspectors from pursuing things they observe during an MI.
 Justifiable: It must result in a sufficient improvement in efficiency that more than makes up for the extra complexity it will introduce.
Furthermore, from an INSURV organizational standpoint, sampling 13 is only a benefit if it eliminates resources that are required for material inspections.
• Transparent: Inspectees need to know that INSURV will not necessarily inspect all items, though they need to be prepared for all. They also will need to be prepared for, and be willing to accept, results based on sampling.
14 As previously mentioned, there are two critical criteria for valid sampling:  First, the sample must be representative of the population. That is, it cannot be focused on any particular subset, say by location, or division. Representativeness is usually achieved by randomly sampling from among all the items aboard a platform (i.e., the "population") according to specific guidelines that we will discuss.
 Second, the sample must also be of sufficiently large size to ensure an appropriately small "margin of error." The margin of error itself should be a function of the criticality of the item being inspected.
Thus, items with higher criticality should also have larger sample sizes, while items with lower criticality should have smaller sample sizes.
Much of the rest of this report will be about how to determine the necessary sample size and then how to select a representative sample.

15
The correct sample size for a particular inspection evolution is dependent on three quantities. First, it is dependent on the total number of items that could be inspected, which we call the size of the population. Second, it is dependent on the measurement variation in the population, meaning how much the items themselves vary in terms of the measurements being taken or observed. And, third, the sample size depends on the desired margin of error.
Of these, only the margin of error is within INSURV's control (indeed, it should be specified by INSURV). For the other two, the population size (which is usually denoted by a capital letter N) is simply the number of items that are installed or aboard the ship or submarine being inspected. The variation in the population, measured in terms of the standard deviation-usually denoted by of the EOC score for that item, is a function of how that particular item is operated, maintained, and probably a host of other factors.
The key point is that N and  are simply a function of what exist onboard the platform being inspected. The population size N, at least for most items, should be known precisely before the MI, while the EOC standard deviation will have to be estimated, probably from historical data.

16
In particular, the population standard deviation  is usually estimated by the sample standard deviation, denoted by s, and it is calculated in the usual way.
That is, if x 1 , x 2 , …, x n are the observed EOC scores on a sample of n items, then where x is the sample average.
An important point to note is that, for a fixed population size N, populations that are more variable (i.e., have greater EOC standard deviations) will require larger sample sizes to achieve a desired margin of error (compared to an equivalent population with a smaller EOC standard deviation). Now, while s is continuous and, at least theoretically can take on any nonnegative value, for the purposes of the calculations in this briefing, we will discretize s into four levels and base the sample size calculations on these four levels. As shown in the slide, we call these levels "low," "moderate," "high," and "very high" population variations. For the purposes of this report, they are defined as EOC standard deviations of 0.1, 0.2, 0.3, and 0.4, respectively.
These EOC standard deviation levels were derived from a brief analysis of some MI data for the CG-47 class.
Alternatively, INSURV can use the precise values of s for each item. This will require using the equation on Slide 19 to determine the required sample size, rather than the tabulated values we will discuss shortly.
Should INSURV decide to implement a sampling scheme, a more rigorous analysis of these levels should be undertaken to ensure they are the most 18 relevant. Then, during implementation, they should be used conservatively, meaning that sample size determinations should be based on rounding up to the largest reasonable value of s.

19
As we discussed, the sample size is a function of the margin of error, which INSURV would set. All other things being equal, a smaller margin of error will result in a larger sample size. At the extreme, a zero margin of error results in 100% inspection-i.e., all the items in the population must be inspected. This is where INSURV currently operates.
However, a zero margin of error may not be necessary for all inspection items and, in fact, the margin of error should be driven by the criticality of the item being inspected. That is, for mission-critical items, where it is vital that INSURV measures the population with a high degree of accuracy, there should be a small margin of error (resulting in a large sample size, which could be the entire population). On the other hand, for non-mission-critical items, a larger margin of error may be acceptable, which could result in not all the items being inspected.
To simplify the sample size calculations, we use four margins of error levels, which we characterize as "very low," "low," "moderate," and "high." These correspond to margins of error of 1%, 3%, 5%, and 10%, respectively. However, it is important to note that this is not the right data to use since it is only comprised of deficiencies. The correct calculation should use information on the entire population, which includes both deficient and nondeficient items.
Thus, these examples are likely underestimating the total population variation.
Nonetheless, they are included here for illustrative purposes.
The margins of error are also illustrative, but they show how mission-critical items such as mission-essential and safety equipment should For example, consider an item with an EOC variation of s=0.4 ("very high"), a population size of N=100, and a desired margin of error of m=0.1 (10% or "high"). For this item the required sample size is 40. That is, under these conditions, only 40% of the items need to be inspected. In contrast, for an item with an EOC variation of s=0.1 ("low"), a population size of N=10, and a margin of error of m=0.01 (1% or "very low"), the required sample size is 10. That is, under these conditions, a 100% inspection is required.  Similarly, as the population size N increases (all else staying constant), a larger sample size n is required to achieve the same margin of error.


To achieve a smaller margin of error m (again, holding all else constant), a larger sample size n is required.
As we discussed earlier, there are two important considerations in implementing a rigorous sampling scheme. One is the determination of the sample size. The second is the methodology for selecting a representative sample. Intuitively, a representative sample must allow each item to be in the sample. That is, it cannot arbitrarily eliminate specific items from being sampled.
In addition, the methodology should ensure that the sample comes from throughout the ship or platform.
A critical consideration is that the sampling methodology must eliminate human judgment from the sample selection process. There is an important reason for this, namely that if the sample selection is left up to human judgment, then it is possible for conscious or subconscious biases to creep into the sample.
For example, an inspector may tend to focus on items that are likely to be defective, thereby lowering the estimated EOC. Or perhaps another inspector decides to only sample easily available items, which also just happened to have better maintenance for the same reason, thereby overestimating the EOC.
These statistical issues aside, it's also the only way to defend against allegations that the inspector chose a biased sample, biased either for or against 26 the platform. That is, a formal methodology for selecting the sample that is outside the inspector's control also gives the inspectors a defense against allegations of either favoritism or antagonism.

27
A sampling strategy that meets the criteria on the previous slide and is also easy to implement is called systematic sampling. A simple example will make the idea clear. Imagine a population of N=100 items with a required sample size of n=10. A systematic sample results if the inspector chooses a random item to start with and then inspects every tenth item after it.
The calculation gets a little more complicated than that if the population size divided by the sample size (N/n) is not an integer, or if n is more than 50% of N. In the former case, the solution, as shown in the slide above, is to round N/n down to the next lowest integer. For example, if n=10.2, simply round it down to 10.
In the latter case, the calculation is shown in the last bullet of the slide above, where the idea is to determine which items to skip and not inspect. As a simple example, imagine the case where N=9 and n=6. Then N/(N-n)=3 and every third item would be skipped and not inspected. For example, if the items are numbered 1,2,…,9 and the inspector picked item 5 to start, then he or she would inspect items 5, 6, 8, 9, 2, and 3, skipping over items 7, 1, and 4. Now, there are some real-world complications that need to be addressed in the implementation of systematic sampling. The first is that the items have to have some type of natural ordering in order to be able to inspect every k th item.
For example, what does it mean to an inspector on the deckplates to inspect "every other" watertight door? To implement such a sampling scheme, it would probably be necessary to devise some type of route around the ship or submarine over which the inspector would stop and inspect every other watertight door as he or she comes to it on the route.
The second complication is that the frequency of items to inspect (or skip) must not match up with some systematic feature of the items being inspected. In the rack inspection example, for example, racks often come in tiers of three, so inspecting "every third" rack could result in only top or bottom racks being inspected.
In this case, a solution might be that the inspector has to randomly choose one rack in each tier to inspect, though a system would need to be put in place to remove subjectivity from the choice. Alternatively, the inspector might simply systematically subsample the racks in each tier, perhaps starting with the top 32 rack in the first tier inspected, proceeding to the middle rack in the second tier, to the bottom rack in the third tier, and then repeating this pattern for the fourth, fifth, and sixth tiers, etc.

33
A third complication arises if the resulting interval (the EOC score of the sample plus and minus the margin of error) straddles two categories, say degraded and unsat (e.g., the interval turns out to be 77%-83%). It's not clear, then, from the results of the sample, whether the item should be given a degraded or unsat rating.
There are two possible solutions for this. The first is to let the inspection team make a subjective call as to how the item is to be rated. The second is to collect more data in order to decrease the margin of error, with the hope that the new interval will fall entirely in one rating category or the other.
The former is consistent with current practice and is probably the appropriate solution. However, it does open the inspection team up to second guessing and allegations of bias. On the other hand, the latter is likely to be unworkable in the field, as it would require the inspectors and ship's force to revisit an inspection already seemingly completed. Furthermore, it's still possible that the new interval will straddle the two rating categories (unless the second inspection results in a 100% inspection).

34
A fourth issue that would need to be addressed is the question of how much of the sampling calculations should be done by the inspectors during the inspection versus how much should be done for the inspectors ahead of time.
This report has presented a sampling methodology that is statistically rigorous and quantitatively defensible. What it does is allow INSURV to explicitly trade off the desired inspection precision of a system against the required inspection level of effort. The current inspection methodology-100% inspection-also implicitly makes such a trade-off, requiring maximum precision for all systems regardless of whether such a level of precision is necessary or appropriate.
Of course, just because the method is quantitatively defensible does not mean it is justifiable to the Fleet, or that the Fleet will find it acceptable. This is not something that can be addressed via statistics; it is a substantive issue that only PRESINSURV and Fleet leadership can determine. And, while the methodology is theoretically feasible, additional work is required to understand whether there are implementation issues that will need to be addressed. We briefly address some of these issues in the next three slides.

36
First, it is important to emphasize that this report only presents an overarching sampling methodology. It does not discuss or explain how to implement the sampling methodology within INSURV's MIs. Such implementation will require decisions about which platforms, and then which systems within platforms, sampling should be applied to. Perhaps more importantly, successful incorporation of sampling into MIs will require a well-considered implementation plan. That is, it will be critical to get all deckplate implementation and execution details exactly right before a full roll-out to the Fleet.

37
Given all the foregoing, should PRESINSURV determine that the sampling methodology is still worth pursuing, the next logical step is a limited beta test on a small number of non-mission-critical, non-safety systems. Some of the habitability inspections seem like good potential candidates. The issue is to assess how sampling would work under actual shipboard conditions and to uncover any real-world impediments to the use of sampling. For example, as described in Slides 25 and 26 ("Some Important Issues to be Resolved"), what are all the issues with and barriers to systematic sampling when applied to shipboard items? How should such systematic sampling plans be codified for the inspectors? Will they need to be tailored for each system? How much flexibility do inspectors need when executing a sampling strategy? How much support do they need?
Of course, a successful beta test is only the first step in establishing sampling as a routine part of MIs. That is, not only might the methodology need to be revised to deal with any issues or complications that are found during the beta test, but a set of guidelines and/or instructions will need to be written and approved that formalize the program within INSURV. In addition to the formal guidelines, internal operating procedures will need to be established and implemented, including defining and establishing any additional analytical support capabilities that will be required. For example, INSURV will need to determine who will specify the margins of error and then who will be responsible for taking those margins of error and calculating the sample sizes and specifying sampling parameters.
Similarly, once the organizational details have been determined, resourced, and implemented, a formal training program should be conducted to educate inspectors about how to execute sampling in the field, including how to explain and justify it to ship's force (as necessary/desired). Also, INSURV will need to communicate the new policies and procedures to the Fleet and Navy 39 leadership. All of this will require careful planning and execution to ensure the changes are understood and positively received.