torsdag 14 maj 2020

Probability and statistics in monitoring of epidemics


My friend and colleague Calle Berglöf recently challenged us reactor physicist colleagues whether we can utilise the knowledge we use in methods of determining reactor safety parameters by stochastic methods in controlling, or at least monitoring the spread of the Covid-19 pandemics. This brought up to me an interesting application of probability theory in disease control which I thought to share here. It is not applicable to the Covid-19 pandemics, but it shows how probability theory can make a very significant contribution in connection with detecting diseases.
 I found this pearl in the outstanding book “An Introduction to Probability Theory and Its Applications", Vol. I, by William Feller.The method was invented by a certain R. Dorfman during World War II, when one had to take a large number of blood tests to detect diseases (such as malaria) at the troops fighting in tropical countries. In a situation when there are very few infected persons, most of the tests are negative, hence lots of tests are done in vain. The solution Dorfman used was to pool the blood samples from a group of k persons, and hence analyse the samples together. If the test was negative, all k persons were found healthy with one single test. If the test of the pool was positive, then all persons had to be tested separately. In this latter case one had to perform k+1 tests. However, on the average, the gain with the cases when the group sample was negative, was much larger than the loss of the extra test of the group which proved positive. According to Feller, Dorfman achieved savings of up to 80%.
We can give an illustration here. Assume that on the average, only one person in 100 is infected, i.e. the probability that a single person is infected is 1/100. Assume that one needs to test 1000 persons. Then, on the average, the total number of infected persons in this population is 10. If one divides the 1000 persons into 100 groups with 10 persons in each group, then roughly 9 out of 10 groups will prove negative, and only one positive. For the 100 groups, one has to take first 100 tests (one for each group). 90 group (900 persons) will be negative, and 10 (100 persons) positive. These latter will have to be tested individually, which means another 100 tests. The total number of tests is 200, instead of 1000, hence the savings is 80%.
In reality, the situation is even better. Namely, there is a chance that in several groups, there are 2 or more infected persons, and this means that the number of groups which prove negative will be even higher. Also, one can refine the method such that in the groups which proved to be positive, one can further sub-divide them into smaller groups, and perform the same strategy as above.
Of course, to optimise the method, one has to have an estimate of the probability of a person being infected. But this can be estimated by testing first 100 persons individually and count the number of positive tests. If one has a good estimate of this probability, then, for any population size to be tested, one can determine, from probability theory, the optimum strategy (number and size of groups) which gives, on the average, the least number of tests needed.
In principle, such a method could be effective to identify the individuals bearing the corona virus in a large population, which could reduce the number of necessary tests drastically, given the fact that the probability of being infected is low. However, the corona test is not based on blood samples, and unlike in the army, the people to be tested are not concentrated in a small area (this would be very unwise), so in practice this method cannot be applied to fight the Covid-19.
After having published the above text on Facebook, I got the interesting comment from Yves Barmaz that the above simple example does not take into account specificity and sensitivity, and worse, how the performance of a single test would get affected by batching.
It is true that the simple example I gave is based on the assumption that each and every test gives the correct result. This is an idealisation, but in many cases it should be true or nearly true, otherwise we cannot trust medicine...
If a test has a certain probability to give erroneous results, then both the traditional method of testing individually, as well as the batching method, will have a probability of a number of misclassifications. This does not say immediately which is better. If the test reliability is close to unity, the batching method should be still better. With a high probability of test error, it might well be that the batching method, while requiring less tests, will miss (much) more positive cases. Since the corona virus test, or some of them, are not 100% reliable, this is another reason why the batching method should not be used to monitor the development of the Covid-19 pandemics.
But even for not 100% reliable methods, it is still possible to devise an optimum method if one assumes a probability of the failure of a single test. It might well be that under some circumstances it will turn out that the individual test is best. This question can be developed to a goldmine of statistical problems (and I am sure mathematicians have dealt with it) by defining the probability of both the "missed alarm" (not detecting the infection) and "false alarm" (declaring a person infected when she/he is not), and calculating the probabilities of the number of missed and false alarms in the testing of a population with various strategies, and calculating expectations and moments. However, then one has also to define what is an "optimum" in order to find the optimal strategy.
My purpose was only to give the illustration of a principle in an idealised case, and the principle is still genial. At any rate, in such cases my motto is a statement from the first edition of the book of Athanasios Papoulis "Probability, Random Variables and Stochastic Processes". It disappeared from the later editions, but it is reproduced in the preface of the second edition:
"Scientific theories deal with concepts, not with reality. All theoretical results are derived from certain axioms by deductive logic. In physical sciences the theories are so formulated as to correspond in some useful sense to the real world, whatever that may mean. However, this correspondence is approximate, and the physical justification of all theoretical conclusions is based on some form of inductive reasoning".