SOME APPLICATIONS OF THE CRAMER-RAO INEQUALITY J. L. HODGES, JR. AND E. L. LEHMANN UNIVERSITY OF CALIFORNIA
1. Summary and introduction In 1945 and 1946, Cram6r [1] and Rao [21 independently investigated the problem of obtaining a simple lower bound to the variance of point estimates. In 1947 Wolfowitz [3] simplified the conditions under which Cram6r had obtained this bound and extended the result to sequential estimates. In the present paper, use is made of the Cramer-Rao result, in Wolfowitz's form, to investigate some problems of the minimax theory of estimation. The Bayes method for obtaining minimax estimates developed by Wald since 1939 [4], [5], is completely satisfactory whenever the minimax estimate is the Bayes solution for some a priori distribution of the parameter. However, frequently minimax estimates are not Bayes solutions, but only limits of Bayes solutions. When this occurs, the possibility is left open that the minimax estimate is not admissible; that is, that there exists some other minimax estimate whose risk is never greater and is for some parameter value less than that of the given estimate. In section 2 we consider certain estimation problems in which the loss is proportional to the square of the error of estimate, and use the Cramer-Rao bound to establish directly that certain estimates, which can be shown to be minimax by the Bayes method, are in addition admissible. In section 3 we consider several problems of sequential estimation, for some of which previously no minimax estimates have been known. In all of these cases it turns out that there are minimax estimates based on samples of fixed size. Problems similar to those treated in the present paper were considered simultaneously by Girshick and Savage [6], the scope of whose work is much larger than ours. Portions of both papers were presented at the joint colloquium of the Stanford and California statistical groups, resulting in a fruitful exchange of ideas. The method introduced here has been employed in [6] to obtain extensions of some of our results. 2. Estimates based on samples of fixed size Let X be a random variable with distribution Pe, 0 ED, so that the probability that X falls in a set A is given by
(2.1)
Pe (A)
f po (x) dy (x). A
I3
I4
SECOND BERKELEY SYMPOSIUM: HODGES AND LEHMANN
Let f(X) be any estimate of 0 and let bf(0) = Eo[f(X)] - 0 be its bias. Then the Cramer-Rao inequality states that the variance o-j(0) of f(X) satisfies (2.2)
2()>
[1+ b;() 2
Eo[( log Po (X))] We shall now prove a theorem which will essentially reduce the problem of proving that certain estimates are admissible and minimax, to proving that there is a unique solution to a differential inequality related to (2.2). It will be convenient to associate with each bias function b(0) the function Cb(0) defined by
(2.3)
Cb (0) = b2 (0) +
[1 + b' (0) ]2 Eo >#9 log Pe (X) ]
If the loss is defined to be the square of the error of estimation, (2.3) has the significance of a lower bound on the risk of an estimate whose bias function is b(0). Suppose now that g(X) is an estimate for which the risk everywhere attains this lower bound. We may then substitute, in a proof of the admissibility of g(X), the bound (2.3) for the actual risk. THEOREM 1. If the loss is squared error, if g(X) is an estimate for which (2.2) becomes an equality, if the inequality (2.2) is satisfied for all estimates, and if, for every bias function b(0), (2.4) Cb(0) _ Cb0(0) for all 0ED implies b(0) =b(0), then g(X) is admissible. PROOF. Since loss is squared error, Rf (0) = b2 (0) + -2 (0) > Cb, (0) . Supposefor some estimate f(X), Rg(O) _ Rf(0) for all 0 C D. Since by assumption Cb,(0) = R,(O), we have Gb,(0) >- Cbf(0) and from (2.4) conclude b,(0) = bf(o). From this follows b'(0) = bf(0),Cb,,(0) = Cb,(0),Rf(0) _ Cbf (0) = Cb,,(0) = R0(0), and henceRf(0) =
Rg(0).
COROLLARY 1. Theorem 1 remains valid if the loss function is squared error divided by a function q(0) which is everywhere positive and finite. PROOF. Admissibility is not affected when the risk function is divided by such a function. COROLLARY 2. If in addition to the assumptions of theorem 1 we assume that g(X) is a constant risk estimate, then g(X) is an admissible minimax estimate. PROOF. Any constant risk admissible estimate is minimax. Remarks. (i) A statistical problem is not completely specified until the loss function has been stated. Squared error is the classical loss function for estimation, primarily for reasons of convenience [7, p. 516]. An alternative loss function is obtained if we divide squared error by the variance of X, thus measuring the seriousness of errors in terms of the difficulty of estimation as reflected by ax as a function of 0. This alternative approach is particularly desirable in those problems for which, when the loss function is squared error itself, the minimax risk is infinite. For, when this happens, every estimate is minimax and the minimax principle provides no basis for choice.
CRAMER-RAO INEQUALITY
I5
Those loss functions obtained by dividing squared error by a function of 0 have been termed "quadratic loss functions" by Girshick and Savage [6]. (ii) In all of the problems considered below, the family of distributions is complete in the sense of [8], and hence every estimate is uniquely determined by its bias function. Since in the proof of theorem 1 we have established the uniqueness of the admissible minimax bias function, it follows that the estimates shown below to be admissible minimax estimates are in fact the unique minimax estimates. (iii) In the statistical applications which we shall make of this theorem, we sometimes replace a sample Xi, X2, . . . , X. by a single sufficient statistic, say X. It is known that nothing is lost by this simplification, since from the risk point of view one may restrict oneself to estimates which are nonrandomized functions of a sufficient statistic [9]. Actually, it is not necessary to work with the single sufficient statistic, since the Cramer-Rao inequality may be applied directly to a sample, but the regularity conditions are easier to check when dealing with a single variable. As an application of theorem 1 we shall now consider five specific problems, showing in each case that a given estimate is admissible and minimax. To apply theorem 1 we must check the validity of (2.2) for all estimates. By the method of proof given by Wolfowitz [3], (2.2) can be shown to be valid under the following
assumptions: (i) The parameter 0 lies in an open interval D of the real line, which may be infinite or semi-infinite; (ii) For almost all x, ape(X) exists for all 0 ED; (iii) The expression sign; (iv) For every
fpe (x) d1.
OED,Ee[I
(x) may be differentiated under the integral
(X) ] >0; (v) The expression ff (x) pe (x) dA (x) may be differentiated under the a
integral sign. The problems we treat concern the binomial, Poisson, normal, and chi-square distributions, and we now check the validity of (2.2) for these distributions. LEMMA 1. If
[email protected](X) is any of the following: (a) ( ) O. (1 _ 0) n-, x = 0, 1, .. ., n; 0 <0< 1; IA = counting measure;
(b) -!e
,X
=
O, 1, . . .; 0 <0<
(C)--_e -1/2(X-)2
-
x
IA; = counting measure; ;-
<
0,<
;A
=
Lebesgue measure;
X^/2 -1e-x/29
(d)
20/2r
Y
)
V/2
0 < x < o; 0 <0<
co ;
A
=
Lebesgue measure;
then (2.2) is satisfied. PROOF. Conditions (i)-(iv), none of which involve the estimate f(X), are obviously satisfied. In checking condition (v), there is no loss of generality in assuming that f(X) has finite variance, since otherwise (2.2) certainly holds.
i6
SECOND BERKELEY SYMPOSIUM: HODGES AND LEHMANN
For distribution (a), condition (v) is obvious, since we are dealing with a finite sum. In cases (c) and (d) the result follows immediately from well known properties of the bilateral and unilateral Laplace transform, respectively. In case (b) our assumption guarantees the absolute convergence of the power series
S f (x) Oz/ x! in the open interval 0 < 0 < c, and hence the series may be z-o
differentiated term by term in that interval. In the examples below we need in each case only check (2.4), the remaining conditions of theorem 1 obviously being satisfied. Problem 1. Let Xi, X2, . .. , X. be a sample from the normal distribution with unknown expectation 0 and known variance which we may without loss of generality take to be 1. Let the loss be squared error. It has been known for some time that X = S Xi/n is a minimax estimate for 0. This result was obtained by Stein and Wald [10] for a different loss function for the much harder sequential problem, and was proved explicitly for the loss function here employed by Wolfowitz [11]. We shall now use theorem 1 to prove both the admissibility and minimaxity of X. Since X is sufficient we need only consider estimates of the form f(X), and since X is normally distributed we may apply (c) of lemma 1. We need only check (2.4) which now becomes b2(0) + 1[1 + b' (0) 1 2< 1 (2.5) n
n
for every 0 ED implies b(0) = 0. Since neither term on left side of (2.5) can be negative, b(0) I is bounded and b'(0) is never positive. Consequently there exists a sequence [{ o} for which b'(0i) approaches 0 as Ij0i approaches -, and hence by the hypothesis of (2.5), b(0i) does likewise. But since b(0) is monotone, it must always be 0. It is interesting to observe that if we assume certain additional information about 0, the estimate X may continue to be minimax without any longer being admissible. This is the case, for example, if we assume it known that 0 > Go. For, b2(0) + [1 + b'(0)12/n is still a lower bound for the risk, and by an argument sup b2(0) + [1 + b'(0)12/n _ analogous to the one just given it is easily seen that 0>00 l/n. Hence the minimax risk is still 1 and X is minimax; its inadmissibility follows from the fact that P(X < Oo) > 0, [9]. Problem 2. Let X1, . . ., X. be a sample from a Poisson distribution with unknown mean 0. For the loss we take squared error divided by 0; see remark (i) n above. Since X = E Xi is sufficient for 0 we may restrict consideration to estii-i
mates f(X). Taking g(X) = X/n, we shall prove admissibility of X/n by checking (2.4), which now becomes the condition
nb2()+ [1 + b' (a) ]2<
(2.6)
implies b(0)
= 0
for all ED.
for all OED
CRAMIR-RAO INEQUALITY
I7
Since neither term on the left can be negative, b(0) is bounded by VO/In and b'(0) satisfies the inequality b' (0) < '1 nb2(0) 1 < nb2 () 20 0 Thus lim b(0) = 0, b'(0) < 0, and hence b(O) _ 0 for all 0. But if for some Go, b(0o) were negative, it would thereafter always be less than or equal to the function c(@) for which c'(0) = -nc2(0)/(20) and c(Oo) = b(0o). This latter function may be obtained explicitly by solving the differential equation, and is easily seen not to be absolutely bounded by VO/In. We justified the choice of loss functions for the present problem in part by the remark that there exists no estimate with bounded risk function when the loss is squared error. That this is so is easily seen. For letf(X) be such an estimate. Then
cbf (0)
=b() + - [l + bf (0)2
is bounded. But boundedness of the second term of Cbf(0) implies that bf (0) . -1 + E for all sufficiently large 0, and hence the unboundedness of the first term. An analogous remark applies to the x2-problem which we treat next and, in general, whenever the range of 0 and the function ao log Pe (X) ] are both unbounded and (2.2) holds for all estimates. Problem 3. We next consider the estimation of the parameter in the chi-square distribution. This problem arises, for example, if we have a sample X1, X2, . . *, X" from a normal distribution of known expectation but unknown variance. Then the
Eo[I
n
statistic
[Xi -E (Xi) ]2 is sufficient for the variance and has a chi-square
distribution. Suppose now that X has a chi-square distribution of n degrees of freedom with expectation 0, and take the loss to be squared error divided by 02. This loss function is chosen according to the principle discussed in remark (i) above. We shall now show, by means of theorem 1, that the estimate g(X) = X/(n + 2) is the unique admissible minimax estimate for 0. It is interesting that this estimate is biassed, while the minimum variance unbiassed estimate has constant risk but is neither minimax nor admissible. Condition (2.4) of theorem 1 now becomes the condition b2 (0) 2 2 2 (2.7) 02 +-[1+ b'(0) 2<+ 2 forevery OED implies b() =-20/(n + 2). Since neither term on the left of the hypothesis of (2.7) can be negative, we have b'(0) < 0 and Ib(0)I < 0. It follows that b(0+) = 0. If for any 0, b'(0) = b(0)/0, their common value must be -2/(n + 2), since the expression r2 + 2[1 + n]2/n has a minimum of 2/(n + 2) when r = -2/(n + 2). We next observe that b'(0) < b(V)/0. For, suppose that for some 0, b'(0) > b(0)/0. Then we should have [b(a]2±n[1 + b' (0)] 2 > [ b Qi]2+21 + b (0) ]2
i8
SECOND BERKELEY SYMPOSIUM: HODGES AND LEHMANN
which, by the previous paragraph, is not less than 2/(n + 2). But this contradicts (2.7). Observing 02[b()/0]' = 0b'[0 - b(0)], we conclude that b(0)/0 is a nonincreasing function of 0. We shall now prove that b'(0) is not bounded away from b(0)/0 for large 0. For suppose b'(0) < b(0)/0 - E for all 0 > 0. Then for 0 . 0, b(0) will lie below that function c(@) for which c(0o) = b(0o) and c'(0) = c(8)/@- e. But c(8) = e-0 log 6 + k-0, which violates -0 < b(0) < c(0). Analogously, b'(0) cannot be bounded away from b(0)/0 as 0 -* 0. For otherwise b(0)/0 : c(@)/@ > 0 for 0 sufficiently small, while we know that b(0) _ 0. We next see that if for some sequence I Oi}, b'(0) - b(0i)/10, - 0 then b(0,)/10 -2/(n + 2). For, the hypothesis of (2.7) may be written (28 b (0) (2.8)
]2+ 2[1 + b (e) ]
+
2[b
(0) ]2 b) ( b(0)][
-
)]-
and our hypothesis implies that the second term on the left side of (2.8) approaches 0; consequently the first term must approach its minimum which implies our statement. Combining the results of the two preceding paragraphs, and using the monotoneness of b(0)/0, we see that b(0)/0 2/(n + 2) as 0 -4 0 or o , whence our result (2.7) follows. Problem 4. Suppose that X has the binomial distribution (a) of lemma 1, and that the loss is squared error divided by 0(1 - 0). Condition (2.4) now becomes the condition (I--0) implies b(O)30. (2.9) b2(0) + 0(1 n ) [+ b' ()] n
Letting 0 tend to 0 and 1 yields b(0) = b(1) = 0, while b'(0) < 0 since b2(0) is nonnegative. Problem 5. As a final example, we consider the previous problem, using however the classical squared error loss function. It is already known [9] that in this case + j-) is an admissible minimax estimate for 0, but we v; ( v/ + yy + 2 shall now establish this fact as a consequence of theorem 1. The verification of (2.4) involved will in any case be needed when considering the sequential problem in the next section. Condition (2.4) now becomes
(2. 10) b2 (0) + @ (L
)
[1I + b' () ]_2 4
(Vf+l)2implies
b (0)
=
2
PROOF. Since the second term on the left side of the hypothesis of (2.10) cannot be negative, we have I b (0) < 2 We next observe that b(0) _ -=_-1for ' < 0 < 1. For if on the contrary we had, for some 2 < Oo < 1,
CRAMER-RAO INEQUALITY 2 -
b(0o) <
=b(1) -b(-
I9
, we should be able to find a point 0o < 01 < 1 at which b'(01) 1
)
and at which b (01) < It is clear 10 from the identical satisfaction of (2.10) by b (0) = 2 that this would imthe violation of at (2.10) 0,. ply By the symmetrical argument in the interval 0 < 0 . A, we find that in that interval b (0) < 2 I and hence that b(A) = 0. It also follows that b'(1) >
-+/-+ b(l)
=
1; but on
substituting 2 for 0 in the hypothesis of (2.10) and using
0, we find b' ()
= -
v1+i
We can now conclude that (2.10) is satisfied. For suppose b(0) satisfies the hypothesis of (2.10). By symmetry we need only consider the interval 2 < 0 < 1 and need only show that b ( 0) > -- 0 for 2 < 0o < 1 leads to a contradiction. Consider the function c ( 0) = b ( - V1 + . c(8) is continuous and has a continuous derivative, is nonnegative for A < 0 < 1, and c(2) = c'(1) = 0. Hence for every E > 0 and every k > 0 we can find A < 01 < Oo for which |c'(01) < e and c'(01) > kc(01). This is easily seen by considering [log c(e)]' = c'(0)/c(0), and using the continuity of c'(0). Since b(0) is assumed to satisfy the hypothesis of (2.10) we can subtract to obtain c ( 0) [b ( 0) + v/n ]+ c' (0) [2 +b' () ]+ +1 . _ 2 -n_ __ Take now e < 2 - = + and k > 2- to obtain a contradiction. 3. Estimates based on sequential procedures In the previous section we have considered only the class of estimates based on a sample of fixed size, and have shown that certain estimates are optimum within this class. However, as is well known [12], the efficiency of statistical procedures can often be improved by taking the observations sequentially. Various definitions of optimum sequential procedures are possible within the minimax theory. For example, one may try to minimize the maximum expectation of a linear combination of loss and cost, measuring the latter by the number of observations. Alternatively, one may place a bound on the expected number of observations and try to minimize the maximum expected loss. Both of these formulations have been considered in the literature [5], [10], [11]. We have applied the method of the present paper to obtain optimum sequential procedures only under the second definition; it seems doubtful that our method would give easy results under the first definition. Although the first definition of an optimum estimate has theoretical advantages, in practical applications the second is sometimes more reasonable. This may happen, for example, when cost and loss cannot be measured on a common scale of value, or when budgetary considerations compel one to place a separate bound on the average cost of experimentation.
20
SECOND BERKELEY SYMPOSIUM: HODGES AND LEHMANN
It is interesting that in a number of problems it turns out that a procedure of fixed sample size n is optimum among all sequential procedures for which the expected value of the number of observations N never exceeds n. For example, this is the case in problems treated by Stein and Wald and by Wolfowitz. We shall now show that the same holds true for the five problems treated in section 2. The basis of our results in the present section is the extention by Wolfowitz [3] of the Cramer-Rao inequality to the sequential case. Wolfowitz proved under certain regularity conditions that
(3.1)
2(f
21b()2
)
_
Ee (N)E9[a0 log Pe (X)
where pe(x) is the density of an individual observation and N is the (random) number of observations taken. It is clear that theorem 1, with the obvious modifications, remains valid for the class of all sequential estimates, if we replace inequality (2.2) by inequality (3.1). Further, if we consider only those sequential procedures for which, for some integer n, (3.2) Ee(N)_ n forall OED, theorem 1 will be valid if in (3.1) we replace Ee(N) by n. To extend the results of section 2 to the sequential case, we must verify the satisfaction of (2.4) and of the regularity conditions under which Wolfowitz proved (3.1). We carry out these checks not for all sequential procedures satisfying (3.2), but only for bounded procedures; that is, for procedures such that P(N _ m) = 1 for all OGED (3.3) for some finite number m. Since our results will be independent of the value of the bound m, provided only that it is sufficiently large, the restriction (3.3) is not serious from a practical point of view: any actual experiment does have a bound on the number of observations. However, the restriction is theoretically undesirable. We shall show below that the solutions obtained retain their minimax character when the restriction is removed. On the other hand, our argument does not establish the admissibility of the estimate within the unrestricted class of sequential procedures. The Wolfowitz regularity conditions are contained in section 3 of his paper [3]. If the sequential procedure is bounded, and if the density is one of those considered in our lemma 1, all of these conditions are trivial, except for his (3.4). An examination of the proof shows that this condition is used only to permit a certain differentiation under the integral sign. We shall assure the applicability of the inequality by checking this differentiability directly. LEMMA 3. If the sequential procedure is bounded, and if the density is one of those considered in lemma 1, then (3.1) holds. PROOF. Let R, be the set of points (xl, x2,... , x,) for which N = j. Then
(3.4) Ee f (X1, X2, ... , XN)] =
f
(X1, X2, * *,) PO (XI) PO (X2) ... X pe (xj) d,U (x1) d,U (X2) ... d,U (Xi).
CRAMER-RAO INEQUALITY
2I
In view of the remarks just made we need only check that the right side of (3.4) may be differentiated under the integral sign. For density (a) of lemma 1 there is no difficulty, since we have simply a polynomial in 0. With (b) we have a convergent multiple series of nonnegative terms. This can be rearranged as a convergent series of powers of 0, which may be differentiated termwise. The normal cases are somewhat more involved. We may assume E9 If(X1, X2, . . ., XN) < for all 0 ED, and hence the finiteness of each integral on the right of (3.4). Let 0#(x1, X2, ... , xi) be the characteristic function on Rj; we must show the differentiability under the integral of
(3.5)
L..
(xl, x2, . . .X,x) f (x1, x2, . ..., x) Pe (X1) PO (X2)
..
XPe(xj) d (X) d (X2)... d(xj). Recalling pe (x) =
)', collecting the exponents, and making an orthogonal transformation with yi = x1 + x2 + . . . + xj, we see that (3.5) may be rewritten as
(3.6)
e
e -1/2(Z-
-(i/2)e0.f '.fK (yi, Y2,
,
yj)
e°O`dyidY2
...
dyj-
Using the Fubini theorem, we see that the integral in (3.6) is a convergent Laplace transform and may therefore be differentiated. A similar argument applies to the chi-square situation, using a unilateral instead of a bilateral Laplace transform. We can now conclude that the estimates found to be admissible minimax estimates in problems 1-5 of the preceding section continue to have this property in the class of all estimates based on sequential procedures satisfying (3.2) and (3.3). We shall not have to recheck the differential inequalities which result from (2.4), since they are in each case unchanged. Finally, we observe that condition (3.3) may be removed in all of these problems without effecting the conclusion that all estimates considered are minimax. For, if there were a sequential estimate a not satisfying (3.3) having a maximum risk r less by e > 0 than the minimax risk for bounded sequential procedures, then we could construct a bounded sequential estimate with maximum risk < r + 2. To see this, notice that in each of the cases treated there exists an estimate Bo of 0, based on a single observation, whose risk is bounded, say by a constant k. Since E(N) 5 n, PO(Na > m) -O 0 uniformly in 0. Let the estimate abe defined as follows. If Na < m, let 5' agree with 6. If NA > m, take an (m + 1)-st observation and let 5' agree with bo(xn+1). It is clear that P(ANa1 < m + 1) = 1, that E(Na) < n and that sup Re ( 5') can be made less than r + 2 by taking m sufficiently large. OD
REFERENCES [1] H. CRAMIkR, "A contribution to the theory of statistical estimation," Skandinavisk Aktuarietidskrift, Vol. 29 (1946), pp. 85-94. [2] C. R. RAo, "Information and the accuracy attainable in the estimation of statistical parameters," Bull. Calcutta Math. Soc., Vol. 37, No. 3 (1945), pp. 81-91.
22
SECOND BERKELEY SYMPOSIUM: HODGES AND LEHMANN
[3] J. WOLFOWITZ, "The efficiency of sequential estimates and Wald's equation for sequential processes," Annals of Math. Stat., Vol. 18 (1947), pp. 215-230. [4] A. WALD, "Contributions to the theory of statistical estimation and testing hypothesis," Annals of Math. Stat., Vol. 10 (1939), pp. 299-326. [5] -, Statistical Decision Functions, Wiley, New York, 1950. [6] M. A. GIRSHICK and L. J. SAVAGE, "Bayes and minimax estimates arising from quadratic risk functions," Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 1951, pp. 53-73. [7] C. F. GAUSS, Abhandlungen zur Methode der Kleinsten Quadrate, Berlin, 1887. [8] E.L. LEHMANN and H. SCHEFFi, "Completeness, similar regions, and unbiased estimation -Part I," Sankhya, Vol. 10 (1950), pp. 305-340. [9] J. L. HODGES, JR. and E. L. LEHMANN, "Some problems in minimax point estimation," Annals of Math. Stat., Vol. 21 (1950), pp. 182-197. [101 C. STEIN and A. WALD, "Sequential confidence intervals for the mean of a normal distribution with known variance," Annals of Math. Stat., Vol. 18 (1947), pp. 427-433. [11] J. WOLFOWITZ, "Minimax estimates of the mean of a normal distribution with known variance," Annals of Math. Stat., Vol. 21 (1950), pp. 218-230. [12] A. WALD, Sequential Analysis, Wiley, New York, 1947.