1 High-dimensional instrumental variables regression and confidence sets Eric Gautier, Alexandre Tsybakov To cite this version: Eric Gautier, Alexandr...

Author:
Eric Lambert

0 downloads 12 Views 630KB Size

To cite this version: Eric Gautier, Alexandre Tsybakov. High-dimensional instrumental variables regression and confidence sets. 2011.

HAL Id: hal-00591732 https://hal.archives-ouvertes.fr/hal-00591732v3 Submitted on 8 Oct 2011 (v3), last revised 6 Sep 2014 (v4)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

HIGH-DIMENSIONAL INSTRUMENTAL VARIABLES REGRESSION AND CONFIDENCE SETS ERIC GAUTIER AND ALEXANDRE TSYBAKOV

CREST (ENSAE), 3 avenue Pierre Larousse, 92 245 Malakoff Cedex, France; [email protected]; [email protected] Abstract. We propose an instrumental variables method for estimation in linear models with endogenous regressors in the high-dimensional setting where the sample size n can be smaller than the number of possible regressors K, and L ≥ K instruments. We allow for heteroscedasticity and we do not need a prior knowledge of variances of the errors. We suggest a new procedure called the STIV (Self Tuning Instrumental Variables) estimator, which is realized as a solution of a conic optimization program. The main results of the paper are upper bounds on the estimation error of the vector of coefficients in ℓp -norms for 1 ≤ p ≤ ∞ that hold with probability close to 1, as well as the corresponding confidence intervals. All results are non-asymptotic. These bounds are meaningful under the assumption that the true structural model is sparse, i.e., the vector of coefficients has few non-zero coordinates (less than the sample size n) or many coefficients are too small to matter. In our IV regression setting, the standard tools from the literature on sparsity, such as the restricted eigenvalue assumption are inapplicable. Therefore, for our analysis we develop a new approach based on data-driven sensitivity characteristics. We show that, under appropriate assumptions, a thresholded STIV estimator correctly selects the non-zero coefficients with probability close to 1. The price to pay for not knowing which p coefficients are non-zero and which instruments to use is of the order log(L) in the rate of convergence. We extend the procedure to deal with high-dimensional problems where some instruments can be non-valid. We obtain confidence intervals for non-validity indicators and we suggest a procedure, which correctly detects the non-valid instruments with probability close to 1.

Date: First version December 2009, this version: September 2011. Keywords: Instrumental variables, sparsity, STIV estimator, endogeneity, high-dimensional regression, conic programming, optimal instruments, heteroscedasticity, confidence intervals, non-Gaussian errors, variable selection, unknown variance, sign consistency. We thank seminar participants at Brown, Bocconi, CREST, Compi`egne, Harvard-MIT, Institut Henri Poincar´e, Paris 6 and 7, Princeton, Toulouse 3, Yale as well as participants of SPA and Saint-Flour 2011 for helpful comments. 1

2

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

1. Introduction Endogeneity is one of the most important issues in empirical economic research. Consider the linear model (1.1)

yi = xTi β ∗ + ui ,

i = 1, . . . , n,

where xi are vectors of explanatory variables of dimension K × 1, ui is a zero-mean random error

possibly correlated with xi , and β ∗ is an unknown parameter. We denote by xki , k = 1, . . . , K,

the components of xi . The regressors xki are called endogenous if they are correlated with ui and they are called exogenous otherwise. Without loss of generality, we assume that the endogenous variables are x1i , . . . , xkend i for some kend ≤ K. It is well known that endogeneity occurs, for example, when a regressor correlated both with yi and regressors in the model is unobserved; in the errorsin-variables model when the measurement error is independent of the underlying variable; when a regressor is determined simultaneously with the response variable yi ; in treatment effect models when the individuals can self-select to the treatment (see, e.g., Wooldridge (2002)). The quantities of interest that we would like to estimate are the components βk∗ of β ∗ . They characterize the partial effect of the variable xki on the outcome yi for fixed other variables. Having access to instrumental variables makes it possible to identify the coefficients βk∗ in such a setting. A random vector zi of dimension L × 1 with L ≥ K will be called a vector of instrumental variables (or instruments) if it satisfies (1.2)

E[zi ui ] = 0,

where E[ · ] denotes the expectation. Throughout the paper, we assume that the exogenous variables serve as their own instruments, which means that the components zli of zi satisfy zli = xl′ i , where l′ = kend + l, l = 1, . . . , K − kend . We consider the problem of inference on the parameter β ∗ from n

independent realizations (yi , xTi , ziT ), i = 1, . . . , n. We allow these observations and the unobserved error terms ui to be heteroscedastic. In this paper, we are mainly interested in the high-dimensional setting where the sample size n is small compared to K, and one of the following two assumptions is satisfied: (i) only few coefficients βk∗ are non-zero (β ∗ is sparse), (ii) most of the coefficients βk∗ are too small to matter (β ∗ is approximately sparse). In this setting, the system of equations (1.2) provides more moment conditions than observations, and β ∗ cannot be identified by usual instrumental variables methods. Even if L = K and the observations

3

are identically distributed, the empirical counterpart of the matrix E[z1 xT1 ] has rank at most min(n, K) and is not invertible for K > n. To our knowledge, no estimator for this setting is currently available. Cross-country or cross-states regressions are typical situations where one may want to use high-dimensional procedures. The sample size is usually small and one may want to include many variables. Economic theory is indeed not always explicit about the variables that belong to the true model (see, e.g., Sala-i-Martin (1997) concerning development economics). Cross-country regressions are widely used in macroeconomics, development economics or international finance. One possible application is the estimation of Engle curves using aggregate data where the total expenditure is endogenous and we consider as regressors various transformations of the total expenditure. There are other contexts in economics where high-dimensional methods can be used. For example, it is notably hard to obtain adequate data (legal issues, etc.) in contract economics and the researcher may be interested in studying contracts between governors and public firms or state regulations of private telecommunications company, etc. Even in contexts where sample sizes are relatively large, the full search over the models is exponentially hard in the number of parameters. High-dimensional methods can be extremely useful for this purpose since they provide computationally feasible methods of variable selection. There are indeed many cases where the theory asks for a rich and flexible specification. The list of possible regressors quickly increases when one considers interactions between variables or wants to explore the IV-regression in nonparametric setting using linear combinations of elementary functions to approximate the nonparametric function of interest. One may also want to control for many variables when there is a rich heterogeneity or to justify exclusion restrictions and the validity of instruments. Finally, even in cases where the theory is explicit and the selection of variables is not a priori an issue, it becomes important in a stratified analysis when models are estimated in small population sub-groups (for example, in estimating models by cells as defined by the value of an exogenous discrete variable). Statistical inference under the sparsity scenario when the dimension is larger than the sample size is now an active and challenging field. The most studied techniques are the Lasso, the Dantzig selector (see, e.g., Cand`es and Tao (2007), Bickel, Ritov and Tsybakov (2009), Belloni and Chernozhukov (2011a); more references can be found in the recent books by B¨ uhlmann and van de Geer (2011), as well as in the lecture notes by Koltchinskii (2011), Belloni and Chernozhukov (2011b)), and the Bayesian-type methods (see Dalalyan and Tsybakov (2008), Rigollet and Tsybakov (2011) and the papers cited therein). In recent years, these techniques became a reference in several areas, such as biostatistcs and imaging. Some first applications are now available in economics. Thus,

4

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

Belloni and Chernozhukov (2011a) study the ℓ1 -penalized quantile regression and give an application to cross-country growth analysis. Belloni and Chernozhukov (2010) present various applications of the Lasso to economics including wage regressions, in particular, the selection of instruments in such models. Belloni, Chernozhukov and Hansen (2010) use the Lasso to estimate the optimal instruments with an application to the impact of eminent domain on economic outcomes. Caner (2009) studies a Lasso-type GMM estimator. Rosenbaum and Tsybakov (2010) deal with the high-dimensional errors-in-variables problem where the non-random regressors are observed with error and discuss an application to hedge fund portfolio replication. The high-dimensional setting in a structural model with endogenous regressors that we are considering here has not yet been analyzed. Note that the direct implementation of the Lasso or Dantzig selector fails in the presence of a single endogenous regressor as the zero coefficients in the structural equation (1.1) do not correspond to the zero coefficients in a linear projection type model. The main message of this paper is that, in model (1.1) containing endogenous regressors, the high-dimensional vector of coefficients can be estimated together with proper confidence intervals using instrumental variables. This is achieved by the STIV estimator (Self Tuning Instrumental Variables estimator) that we introduce below. Based on it, we can also perform variable selection. All our results are non-asymptotic and provide meaningful bounds when either (i) or (ii) above holds and log(L) is small as compared to n. In particular, they can be used in the still troublesome case K ≤ n < L, i.e., for models with relatively small number of variables and relatively large number of instruments. As exemplified by Angrist and Krugger (1991), under a stronger notion of exogeneity which is based on a zero conditional mean assumption, considering interactions of instruments or functionals of instruments can lead to a large amount of instruments. This is related to the many instruments literature (see, e.g., Andrews and Stock (2007) for a review). Important problems in this context are selection of instruments (see, e.g., Donald and Newey (2001), Hall and Peixe (2003), Okui (2008), Bai and Ng (2009), and Belloni and Chernozhukov (2011b)), estimation of optimal instruments (see, e.g., Amemiya (1974), Chamberlain (1987), Newey (1990), and Belloni, Chen, Chernozhukov et al. (2010)) or various other issues in the many instruments asymptotics (see, e.g., Chao and Swanson (2005), Hansen, Hausman and Newey (2008), Hausman, Newey, Woutersen et al. (2009)). The number of instruments can be much larger than the sample size. Carrasco (2008), building on Carrasco and Florens (2000, 2008), analyzes this setting in the inverse problems framework and proposes a suitable regularization method. The STIV estimator also leads to a smoothing procedure which is able to handle this case.

5

The STIV estimator is an extension of the Dantzig selector of Cand`es and Tao (2007). Like the Square-root Lasso of Belloni, Chernozhukov and Wang (2010), the STIV estimator is a pivotal method independent of the variances of the errors, which are allowed to be heteroscedastic. The implementation of the STIV estimator corresponds to solving a conic optimization program. The results of this paper extend those on the Dantzig selector (see Cand`es and Tao (2007), Bickel, Ritov and Tsybakov (2009) and further references in B¨ uhlmann and van de Geer (2011)) in several ways: By allowing for endogenous regressors when instruments are available, by working under weaker sensitivity assumptions than the restricted eigenvalue assumption of Bickel, Ritov and Tsybakov (2009), by imposing weak distributional assumptions, by introducing a procedure independent of the noise level and by providing finite sample confidence intervals. We present basic definitions and notation in Section 2 and we introduce the STIV estimator in Section 3. In Section 4 we present the sensitivity characteristics, which play a major role in our error bounds and confidence intervals. They provide a generalization of the restricted eigenvalues to non-symmetric and non-square matrices. The main results on the STIV estimator are given in Section 5. In Section 6 we consider the setting where some instruments might be non-valid and we wish to detect this. Section 7 discusses some special cases and extensions, in particular, the STIV procedure with estimated linear projection type instruments, akin to two-stage least squares. Section 8 considers computational issues and presents a simulation study. All the proofs are given in the appendix. Also in the appendix, we compare our sensitivity analysis to the more standard one based on restricted eigenvalues in the case of symmetric square matrices. 2. Basic definitions and notation We set Y = (y1 , . . . , yn )T , U = (u1 , . . . , un )T , and we denote by X and Z the matrices of dimension n × K and n × L respectively with rows xTi and ziT , i = 1, . . . , n. The sample mean is denoted by En [ · ]. We use the notation n

En [Xka U b ]

1X a b , xki ui , n i=1

n

En [Zla U b ]

1X a b , zli ui , n i=1

where xki is the kth component of vector xi , and zli is the lth component of zi for some k ∈ {1, . . . , K}, l ∈ {1, . . . , L}, a ≥ 0, b ≥ 0. Similarly, we define the sample mean for vectors; for example, En [U X] is a row vector with components En [U Xk ]. We also define the corresponding population means: n

E[Xka U b ] ,

1X E[xaki ubi ], n i=1

n

E[Zla U b ] ,

1X E[zlia ubi ], n i=1

6

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

and set xk∗ , max |xki |, i

zl∗ , max |zli | i

for k = 1, . . . , K, l = 1, . . . , L. We denote by DX and DZ the diagonal K × K (respectively, L × L) −1 matrices with diagonal entries x−1 k∗ , k = 1, . . . , K (respectively, zl∗ , l = 1, . . . , L).

For a vector β ∈ RK , let J(β) = {k ∈ {1, . . . , K} : βk 6= 0} be its support, i.e., the set of indices corresponding to its non-zero components βk . We denote by |J| the cardinality of a set

J ⊆ {1, . . . , K} and by J c its complement: J c = {1, . . . , K} \ J. The subset of indices {1, . . . , K}

corresponding to endogenous regressors is denoted by Jend . The ℓp norm of a vector ∆ is denoted by |∆|p , 1 ≤ p ≤ ∞. For ∆ = (∆1 , . . . ∆K )T ∈ RK and a set of indices J ⊆ {1, . . . , K}, we consider

∆J , (∆1 1l{1∈J} , . . . , ∆K 1l{K∈J} )T , where 1l{·} is the indicator function. For a vector β ∈ RK , we set −−−−→ sign(β) , (sign(β1 ), . . . , sign(βK )) where 1 if t > 0 sign(t) , 0 if t = 0 −1 if t < 0 −1 For a ∈ R, we set a+ , max(0, a), a−1 + , (a+ ) , and a/0 , ∞ for a > 0. We adopt the convention

0/0 , 0 and 1/∞ , 0.

3. The STIV estimator The sample counterpart of the moment conditions (1.2) can be written in the form (3.1)

1 T Z (Y − Xβ ∗ ) = 0. n

This is a system of L ≥ K equations with K unknown parameters. If L > K, it is overdetermined; if

L = K the matrix ZT X is not invertible in the high-dimensional case K > n, since its rank is at most

min(n, K). Furthermore, replacing the population equations (1.2) by (3.1) induces a huge error when L, K > n. So, looking for the exact solution of (3.1) in the high-dimensional setting makes no sense. However, we can stabilize the problem by restricting our attention to a suitable “small” candidate set of vectors β, for example, to those satisfying the constraint 1 T Z (Y − Xβ) ≤ τ, (3.2) n ∞

where τ > 0 is chosen such that (3.2) holds for β = β ∗ with high probability. We can then refine the search of the estimator in this “small” random set of vectors β by minimizing an appropriate

7

criterion. It is possible to consider different small sets in (3.2), however, the use of the sup-norm makes the inference robust when some (not all) instruments for each endogenous variable are weak. In what follows, we use this idea with suitable modifications. First, notice that it makes sense to normalize the matrix Z. This is quite intuitive because, otherwise, the larger the instrumental variable, the more influential it is on the estimation of the vector of coefficients. For technical reasons, we choose normalization by the maximal absolute value, i.e., multiplying Z by DZ . Then the constraint (3.2) is modified as follows: 1 T DZ Z (Y − Xβ) ≤ τ. n ∞

(3.3)

Along with the constraint of the form (3.3), we include the second constraint to account for the unknown level σ of the “noise” ui ; in particular, if the errors ui are i.i.d., σ 2 corresponds to their unknown variance. Specifically, we say that a pair (β, σ) ∈ RK × R+ satisfies the IV-constraint if it belongs to the set (3.4)

Ib ,

(β, σ) : β ∈ RK , σ > 0,

for some r > 0 (to be specified below), and

1 2 b DZ ZT (Y − Xβ) ≤ σr, Q(β) ≤σ n ∞ n

1X b Q(β) , (yi − xTi β)2 . n i=1

b σ Definition 3.1. We call the STIV estimator any solution (β, b) of the following minimization problem:

(3.5)

min

b (β,σ)∈I

where 0 < c < 1.

−1 D β + cσ , X 1

We use βb as an estimator of β ∗ . Finding the STIV estimator is a conic program; it can be

efficiently solved, see Section 8.1. Note that the STIV estimator is not necessarily unique. Minimizing the ℓ1 criterion D−1 X β 1 is a convex relaxation of minimizing the ℓ0 norm, i.e., the number of non-zero coordinates of β. This usually ensures that the resulting solution is sparse. The term cσ is included in the criterion to prevent from choosing σ arbitrarily large; indeed, the IV-constraint does not prevent

from this. The matrix D−1 X arises from re-scaling of X, which is similar to the re-scaling of Z discussed above. For the particular case where Z = X, the STIV estimator provides an extension of the Dantzig selector to the setting with unknown variance of the noise. In this particular case, the STIV estimator can also be related to the Square-root Lasso of Belloni, Chernozhukov and Wang (2010), which solves

8

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

the problem of unknown variance in high-dimensional regression with deterministic regressors and i.i.d. errors. The definition of STIV estimator contains the additional constraint (3.3), which is not present in the conic program for the Square-root Lasso. This is due to the fact that we have to handle the endogeneity. Remark 3.2. Other normalizations can be used. Instead of matrices DX and DZ we can take the di2 ]−1/2 and E [Z 2 ]−1/2 , . . . , E [Z 2 ]−1/2 respectively. agonal matrices with entries En [X12 ]−1/2 , . . . , En [XK n L n 1

Then the proofs become more complicated and we need extra assumptions, in the spirit that for every l = 1, . . . , L, and i = 1, . . . , n, the variables u2i and zli2 are “almost” uncorrelated. 4. Sensitivity characteristics The identifiability of β ∗ relies on the sensitivity characteristics of the problem. In the usual linear regression in low dimension, when Z = X and the Gram matrix XT X/n is positive definite, the sensitivity is given by the minimal eigenvalue of this matrix. In high-dimensional regression, the theory of the Lasso and the Dantzig selector comes up with a more sophisticated sensitivity analysis; there the Gram matrix cannot be positive definite and the eigenvalue conditions are imposed on its sufficiently small submatrices. This is typically expressed via the restricted isometry property of Cand`es and Tao (2007) or the more general restricted eigenvalue condition of Bickel, Ritov and Tsybakov (2009). In our structural model with endogenous regressors, these sensitivity characteristics cannot be used, since instead of a symmetric Gram matrix we have a rectangular matrix ZT X/n involving the instruments. More precisely, we will deal with its normalized version Ψn ,

1 DZ ZT XDX . n

In general, Ψn is not a square matrix. For L = K, it is square matrix but, in the presence of at least one endogenous regressor, Ψn is not symmetric. Since the endogenous variables are assumed to be the first variables in the model, the lower right block of matrix Ψn is, up to a scaling, the sample correlation matrix of the exogenous variables (when considering xi as centered). The upper left block accounts for the relation between the endogenous variables and the instruments. We now introduce some scalar sensitivity characteristics related to the action of the matrix Ψn on vectors in the cone

1+c K CJ , ∆ ∈ R : |∆J c |1 ≤ |∆J |1 , 1−c where 0 < c < 1 is the constant in the definition of STIV estimator, J is a subset of {1, . . . , K}, and

J c denotes the complement of J. When the cardinality of J is small, the vectors ∆ in the cone CJ

9

have a substantial part of their mass concentrated on a set of small cardinality. We call CJ the cone of dominant coordinates. The use of similar cones to define sensitivity characteristics is standard in the literature on the Lasso and the Dantzig selector (see, Bickel, Ritov and Tsybakov (2009)); the particular choice of the constant

1+c 1−c

will become clear from the proofs. It follows from the definition

of CJ that |∆|1 ≤

(4.1)

2 2 |∆J |1 ≤ |J|1−1/p |∆J |p , 1−c 1−c

∀ ∆ ∈ CJ , 1 ≤ p ≤ ∞.

For p ∈ [1, ∞], we define the ℓp sensitivity as the following random variable: κp,J ,

inf

∆∈CJ : |∆|p =1

|Ψn ∆|∞ .

These quantities are similar to the cone invertibility factors defined in Ye and Zhang (2010). Given a subset J0 ⊂ {1, . . . , K}, we define the J0 block sensitivity as κ∗J0 ,J ,

(4.2)

inf

∆∈CJ : |∆J0 |1 =1

|Ψn ∆|∞ .

By convention, we set κ∗∅,J(β ∗ ) = ∞. We use the notation κ∗k,J for coordinate-wise sensitivities, i.e., for block sensitivities when J0 = {k} is a singleton: κ∗k,J ,

inf

∆∈CJ : ∆k =1

|Ψn ∆|∞ .

Note that here we restrict the minimization to vectors ∆ with positive kth coordinate, ∆k = 1, since replacing ∆ by −∆ yields the same value of |Ψn ∆|∞ . The finite sample bounds that we obtain below (see, e.g., Theorem 5.2) show that the inverse of the sensitivities drive the width of the confidence interval for the true parameter. Thus, it is important to have computable bounds on these characteristics. The following proposition will be useful. Proposition 4.1. κ∗J0 ,J ≥ κ∗

J0 ,Jb

b Then κp,J ≥ κ b, and (i ) Let J, Jb be two subsets of {1, . . . , K} such that J ⊆ J. p,J

for all p ∈ [1, ∞];

(ii ) For all J0 ⊂ {1, . . . , K}, κ∗J0 ,J ≥ κ1,J . (iii ) For all p ∈ [1, ∞], (4.3)

2|J| 1−c

−1/p

κ∞,J ≤ κp,J ≤

2 |J|1−1/p κ1,J . 1−c

The proof of Proposition 4.1 is given in Section 9.3. We can control κp,J(β ∗ ) without knowing J(β ∗ ) by means of sparsity certificate. Assume that we have an upper bound s on the sparsity of β ∗ , i.e., we know that |J(β ∗ )| ≤ s for some integer s. Meaningful values of s are small presuming that only few regressors are relevant. In view of (4.1), if

10

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

|J| ≤ s, then for any ∆ in the cone CJ we have |∆|1 ≤

2s 1−c |∆|∞ .

Thus, for all J such that |J| ≤ s, we

can bound the coordinate-wise sensitivities as follows: κ∗k,J

(4.4)

≥ ≥

where a =

2s 1−c

inf

∆k =1, |∆|1 ≤a|∆|∞

min

j=1,...,K

|Ψn ∆|∞

min

∆k =1, |∆|1 ≤a|∆j |

|Ψn ∆|∞

, κ∗k (s),

. For given s, this bound is data-driven since the minimum in curly brackets can be

computed by linear programming (see Section 8.1). Then we can deduce a lower bound on κ∞,J from κ∞,J ≥

(4.5)

min κ∗k,J .

k=1,...,K

Using (4.3) – (4.5) we get computable lower bounds for all κp,J , p ∈ [1, ∞], which depend only on s and on the data. In particular, for |J| ≤ s, κ1,J ≥

(4.6)

1−c min κ∗ (s) , κ1 (s). 2s k=1,...,K k

Analogously to (4.4), the sparsity certificate approach yields a bound for block sensitivities: (4.7)

κ∗J0 ,J

≥ ≥

inf

|∆J0 |1 =1, |∆|1 ≤a|∆|∞

min

j=1,...,K

(

|Ψn ∆|∞

min

|∆J0 |1 =1, |∆|1 ≤a|∆j |

|Ψn ∆|∞

)

, κ∗J0 (s).

In Section 8.1 we show that the expression in curly brackets in (4.7) can be computed by solving 2|J0 | linear programs. Thus, the values κ∗J0 (s) can be readily obtained for sets J0 of small cardinality. An alternative to the sparsity certificate approach is to compute κ1,J and κ∗k,J directly, which is numerically feasible for J of small cardinality. In Section 8.1 we show that obtaining the coordinatewise sensitivities corresponds to solving 2|J| linear programs. Using (4.3) and (4.5) we obtain computable lower bounds for all κp,J , p ∈ [1, ∞]. The lower bounds are valid for any given index set J. However, we will need to compute the characteristics for the inaccessible set J = J(β ∗ ), where β ∗ is

the true unknown parameter. To circumvent this problem, we can plug in an estimator Jb of J(β ∗ ).

b The confidence bounds remain valid whenever J(β ∗ ) ⊆ J, b since For example, we can take Jb = J(β). b then κp,J(β ∗ ) ≥ κp,Jb by Proposition 4.1 (i). Theoretical guarantees for the inclusion J(β ∗ ) ⊆ J(β)

to hold with probability close to 1 require |βk∗ | to be not too small on the support of β ∗ (see Theo-

rem 5.7 (iv)). On the other hand, one typically observes in simulations that the relevant set J(β ∗ ) is

b so that the required either estimated exactly or overestimated by its empirical counterpart Jb = J(β),

b inclusion is satisfied for such a simple choice of J.

11

We show in Section 9.1 that the assumption that the sensitivities κp,J are positive is weaker and more flexible than the restricted eigenvalue (RE) assumption of Bickel, Ritov and Tsybakov (2009). Unlike the RE assumption, it is applicable to non-square non-symmetric matrices and thus allows one to consider the case where several instruments are used for the same endogenous variable. In the next proposition, we present a simple lower bound on κp,J for general L × K rectangular matrices Ψn . Its proof, as well as other lower bounds on κ1,J , can be found in Section 9. It is important to note that adding rows to matrix Ψn (i.e., adding instruments) increases the sup-norm |Ψn ∆|∞ , and thus potentially increases the sensitivities κp,J(β ∗ ) . This has a positive effect since the

inverse of the sensitivities drive the width of the confidence set for β ∗ , see Theorem 5.2. Thus, adding instruments potentially improves the confidence set, which is quite intuitive. On the other hand, the price for adding instruments in terms of the rate of convergence is only logarithmic in the number of instruments, as we will see it in the next section. Proposition 4.2. Fix J ⊆ {1, . . . , K}. Assume that there exist η1 > 0 and 0 < η2 < 1 such that η1 |(Ψn ) l(k)k | ≥ 1−c , (4.8) ∀k ∈ J, ∃l(k) : maxk′ 6=k |(Ψn )l(k)k′ | ≤ (1−η2 )(1−c) . |(Ψn )l(k)k |

2|J|

Then

κp,J ≥ (2|J|)−1/p (1 − c)−1+1/p η1 η2 . The proof of Proposition 4.2 is given in Section 9.3. Assumption (4.8) is similar in spirit to the coherence condition introduced by Donoho, Elad and Temlyakov (2006) for symmetric matrices, but it is more general because it deals with rectangular matrices. Since the regressors and instruments are random, the values η1 and η2 can, in general, be random. Remarkably, for estimation of the coefficients of the endogenous variables, it suffices to have a “good” row of matrix Ψn . This means that it is enough to have, among all instruments, one good instrument. The way the instruments are ordered is not important. Good instruments correspond to the rows l(k), for which the value |(Ψn )l(k)k | measuring the relevance of the instrument for the kth variable is high. On the other hand, the value maxk′ 6=k |(Ψn )l(k)k′ | accounting for the relation between the instrument and the other variables should be small. An instrument which is well “correlated” with two variables of the model is not satisfactory for this assumption.

12

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

5. Main results We start with introducing some assumptions. Assumption 5.1. There exists δ > 0 such that, for all i = 1, . . . , n, l = 1, . . . , L, the following conditions hold: E[|zli ui |2+δ ] < ∞,

E[zli ui ] = 0,

and neither of zli ui is almost surely equal to 0. Define dn,δ , min

l=1,...,L

(

Pn

qP

n 2 2 i=1 E[zli ui ]

2+δ ])1/(2+δ) i=1 E[|zli ui |

. 2 u2 ])1/2 (E[zl1 1 . l=1,...,L (E[|zl1 u1 |2+δ ])1/(2+δ)

δ

If, for any fixed l ∈ {1, . . . , L}, the variables zli ui are i.i.d., then dn,δ = n 4+2δ min For A ≥ 1 set

p 1+δ p o 1 + A 2 log(L) α = 2L 1 − Φ A 2 log(L) , + 2A0 LA2 −1 d2+δ n,δ n

(5.1)

where A0 > 0 is the absolute constant from Theorem 9.4, and Φ(·) is the standard normal c.d.f. Theorem 5.2. Let Assumption 5.1 hold. For A ≥ 1, define α by (5.1), and set r 2 log(L) r=A . n bσ b) of the Assume that L ≤ exp(d2n,δ /(2A2 )). Then, with probability at least 1 − α for any solution (β,

minimization problem (3.5) we have (5.2)

DX −1 (βb − β ∗ ) ≤ p

2b σr κp,J(β ∗ )

1−

r κ∗Jend ,J(β ∗ )

−

r2 κ∗J c

end ,J(β

∗)

!−1

∀ p ∈ [1, ∞],

,

+

and, for all k = 1, . . . , K, (5.3) Furthermore, (5.4)

|βbk − βk∗ | ≤

σ b≤

q

xk∗

2b σr κ∗k,J(β ∗ )

b ∗) 1 + Q(β

1−

r κ∗Jend ,J(β ∗ )

r cκ∗J(β ∗ ),J(β ∗ )

!

1−

−

r2 κ∗J c

end ,J(β

∗)

r cκ∗J(β ∗ ),J(β ∗ )

!−1 +

!−1 +

.

.

13

The proof of Theorem 5.2 is given in Section 9.3. c By convention, κ∗∅,J(β ∗ ) = ∞, so when either Jend or Jend is empty, the term with the cor-

responding sensitivity disappears from the right hand-side of (5.2) and (5.3). If Jend = ∅, L = K and X = Z, the STIV estimator yields a pivotal extension of the Dantzig selector of Cand`es and Tao (2007), in the sense that it allows for the unknown distribution of errors. For this model, which is in the focus of the literature on high-dimensional regression in the recent years, we provide a considerable improvement, since Cand`es and Tao (2007), Bickel, Ritov and Tsybakov (2009) and the subsequent papers (cf. B¨ uhlmann and van de Geer (2011)) treat the case of i.i.d. errors, which are either Gaussian with known variance or have bounded exponential moment with known parameter. We also improve the Dantzig selector in other aspects by allowing for endogenous regressors, by using weaker sensitivity assumptions than in the previous work, and by providing finite sample confidence intervals. The bounds (5.2) and (5.3) are meaningful if r is small, i.e., n ≫ log(L). Then under the r

2

− κ∗ c r ∗ p Jend ,J (β ) is close to 1 and the bound on the estimation error in (5.3) is of the order O(r) = O( log(L)/n). p Thus, we have an extra log(L) factor as compared to the usual root-n rate, which is a modest price appropriate conditions on the sensitivities (cf. Remark 5.3), the factor τ1 , 1 − κ∗

Jend ,J (β ∗ )

for using a large number L of instruments.

Remark 5.3. Simple sufficient conditions for τ1 to be close to 1 can be derived from Propositions 4.1 and 4.2. By Proposition 4.1 (ii), we have κ∗Jend ,J(β ∗ ) ≥ κ1,J(β ∗ ) and κ∗J c

end ,J(β

neglecting the O(r 2 ) term, we get that τ1 can be approximately replaced by 1− κ

∗)

r 1,J (β ∗ )

≥ κ1,J(β ∗ ) . Thus, in (5.2) and (5.3).

Therefore, under the premise of Proposition 4.2, for τ1 ≈ 1 it is sufficient to have |J(β ∗ )| ≤ Cr −1 = p O( n/ log(L)) where C > 0 is a proper constant. This is quite a reasonable condition on the sparsity |J(β ∗ )| of the true vector β ∗ . Moreover we get even better conditions if the set of endogenous regressors

Jend is small. Then the sensitivity κ∗Jend ,J(β ∗ ) is large, whereas the small sensitivity of its complement κ∗J c

end ,J(β

κ∗J

∗)

r ∗ end ,J (β )

is compensated by the small value r 2 in the numerator. In the extreme case Jend = ∅ we have = 0, so that τ1 ≤ 1 −

r2 κ1,J (β ∗ ) ,

and it is sufficient to have |J(β ∗ )| ≤ Cr −2 = O(n/ log(L)).

The assumption L ≤ exp(d2n,δ /(2A2 )) in Theorem 5.2 is relatively mild. Indeed, in the i.i.d.

case, it is equivalent to the condition that L ≤ exp(Cnδ/(2+δ) ) for some C > 0.

The value dn,δ depends on the distribution of the errors and in practice it is unknown. However, in the high-dimensional setting when L is large, the term involving dn,δ in (5.1) is negligible for reasonable values of A (say, A ≥ 2) and for moderate sample size n. Thus, in practice, we can drop

14

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

this term, and choose A large enough to have n p o 2L 1 − Φ A 2 log(L) = α,

(5.5)

where α is a suitable confidence level. This yields

α 1 . r = √ Φ−1 1 − n 2L Theorem 5.2 holds for arbitrary tuning constant 0 < c < 1. This constant appears in the definition of the STIV estimator. Choosing a small c increases the sensitivities in the denominators of the bounds in Theorem 5.2 since the cone of dominant coordinates shrinks as c decreases. On the other hand, this yields less penalization for large values of σ and results in higher σ b.

The proof of Theorem 5.2 relies on a bound for moderate deviations for self-normalized sums

of random variables, cf. Jing, Shao and Wang (2003). This is a useful tool that was first applied in the context of high-dimensional estimation by Belloni, Chen, Chernozhukov et al. (2010). There, “asymptotically valid penalty loadings” are required but we do not need such an assumption. The only unknown ingredient of the inequalities (5.2) and (5.3) is the set J(β ∗ ) that determines the sensitivities. To turn these inequalities into valid confidence bounds, it suffices to provide datadriven lower estimates on the sensitivities. As discussed in Section 4, there are two ways to do it. The first one is based on the sparsity certificate, i.e., assuming some known upper bound s on |J(β ∗ )|; then we get bounds depending only on s and on the data. The second way is to plug in, instead of b i.e., a set satisfying J(β ∗ ) ⊆ Jb with probability close to 1. J(β ∗ ), some data-driven upper estimate J,

b In particular, assertion (iv) The next theorem (Theorem 5.7) provides examples of such estimators J.

b has the required of Theorem 5.7 guarantees that, under some assumptions, the estimator Jb = J(β)

property. Moreover, Theorem 5.7 establishes upper bounds on the rate of convergence of βb in terms

of population characteristics. To state the theorem, we need the following additional assumptions. The first one introduces the population “noise level” σ∗ .

Assumption 5.4. There exist constants σ∗ > 0 and 0 < γ1 < 1 such that P En [U 2 ] ≤ σ∗2 ≥ 1 − γ1 . The second assumption concerns the population counterparts of the sensitivities. It is stated in terms of subsets J0 of {1, . . . , K} and constants p ≥ 1, k ∈ {1, . . . , K} that can differ from case to case and will be specified later.

15

Assumption 5.5. There exist constants cp > 0, c∗J0 > 0, and 0 < γ2 < 1 such that, with probability at least 1 − γ2 , κp,J(β ∗ ) ≥ cp |J(β ∗ )|−1/p ,

(5.6)

κ∗J0 ,J(β ∗ ) ≥ c∗J0 .

(5.7)

If J0 = {k} is a singleton we write for brevity c∗J0 = c∗k .

The dependence on |J(β ∗ )| of the right hand side of (5.6) is motivated by Proposition 4.2. In

(5.7), we do not indicate the dependence of the bounds on |J(β ∗ )| explicitly because it can be different for different sets J0 . For general J0 , combining Proposition 4.1 (ii) and Proposition 4.2 suggests that the value c∗J0 can be bounded from below by a quantity of the order |J(β ∗ )|−1 . Note, however, that this is a coarse bound valid for any set J0 . The last assumption defines a population counterpart of xk∗ . Assumption 5.6. There exist constants vk > 0 and 0 < γ3 < 1 such that P (xk∗ ≥ vk , ∀ k ∈ J(β ∗ )) ≥ 1 − γ3 . We set γ = α +

P3

j=1 γj ,

τ∗ ,

1+

and r

cc∗J(β ∗ )

!

1−

r cc∗J(β ∗ )

!−1 +

1−

r c∗Jend

−

r2 c∗J c

end

!−1

.

+

Theorem 5.7. Under the assumptions of Theorem 5.2 and Assumption 5.4, the following holds. (i) Let part (5.7) of Assumption 5.5 with J0 = J(β ∗ ) be satisfied. Then, with probability at least 1 − α − γ1 − γ2 for any solution σ b of (3.5) we have ! !−1 r r σ b ≤ σ∗ 1 + ∗ . 1− ∗ ccJ(β ∗ ) ccJ(β ∗ ) +

(ii) Fix p ∈ [1, ∞]. Let Assumption 5.5 be satisfied, where (5.7) holds simultaneously for J0 = c . Then, with probability at least 1 − α − γ − γ , for any solution J(β ∗ ), J0 = Jend and J0 = Jend 1 2

(5.8)

βb of (3.5) we have

2σ∗ r|J(β ∗ )|1/p τ ∗ . DX −1 βb − β ∗ ≤ cp p

(iii) Let Assumptions 5.5 and 5.6 be satisfied, where (5.7) holds simultaneously for J0 = {k}, ∀k,

c . Then with probability at least 1 − γ, for any solution β b and J0 = J(β ∗ ), J0 = Jend , J0 = Jend

16

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

of (3.5) we have |βbk − βk∗ | ≤

(5.9)

2σ∗ rτ ∗ , c∗k vk

(iv) Let the assumptions of (iii) hold, and |βk∗ | >

k = 1, . . . , K. 2σ∗ rτ ∗ c∗k vk

at least 1 − γ, for any solution βb of (3.5) we have:

for all k ∈ J(β ∗ ). Then, with probability

b J(β ∗ ) ⊆ J(β).

The proof of Theorem 5.7 is given in Section 9.3.

For reasonably large sample size (n ≫ log(L)), the value r is small, and τ ∗ is a constant

approaching 1 as r → 0. Thus, the bounds (5.8) and (5.9) are of the order of magnitude O(r|J(β ∗ )|1/p )

and O(r) respectively. These are the same rates, in terms of the sparsity |J(β ∗ )|, the dimension L, and the sample size n, that were proved for the Lasso and Dantzig selector in high-dimensional regression

with Gaussian errors and without endogenous variables Cand`es and Tao (2007), Bickel, Ritov and Tsybakov (2009), Lounici (2008) (see also B¨ uhlmann and van de Geer (2011) for references to more recent work). for βk∗ : (5.10)

From (5.3) and Theorem 5.7 (iv), we obtain the following confidence intervals of level 1 − γ

|βbk − βk∗ | ≤

2b σr xk∗ κ∗

b k,J(β)

1 −

r κ∗ b Jend ,J(β)

−

r2 κ∗ c

b Jend ,J(β)

−1

.

+

Theorem 5.7 (iv) provides an upper estimate on the set of non-zero components of β ∗ . We now consider the problem of the exact selection of variables. For this purpose, we use the thresholded STIV estimator whose coordinates are defined by βb if |βb | > ω , k k k (5.11) βek (ωk ) , 0 otherwise,

b and ωk > 0, k = 1, . . . , K, are thresholds where βbk are the coordinates of the STIV estimator β,

that will be specified below. We will use the sparsity certificate approach, so that the thresholds will depend on the upper bound s on the number of non-zero components of β ∗ . We will need the following modification of Assumption 5.5.

Assumption 5.8. Fix an integer s ≥ 1. There exist constants c∗J(β ∗ ) > 0, c∗J0 (s) > 0, and 0 < γ2 < 1 such that, with probability at least 1 − γ2 , (5.12)

κ∗J(β ∗ ),J(β ∗ ) ≥ c∗J(β ∗ )

and

κ∗J0 (s) ≥ c∗J0 (s)

17 c . If J = {k} is a singleton we write for brevity c∗ (s) = for J0 = {k}, ∀k, and J0 = Jend , J0 = Jend 0 J0

c∗k (s).

Set τ ∗ (s) ,

1+

r cc∗J(β ∗ )

!

1−

r cc∗J(β ∗ )

!−1 +

r2 1− ∗ − ∗ cJend (s) cJ c (s) r

end

!−1

.

+

The following theorem shows that, based on thresholding of the STIV estimator, we can reconstruct exactly the set of non-zero coefficients J(β ∗ ) with probability close to 1. Even more, we achieve the sign consistency, i.e., we reconstruct exactly the vector of signs of the coefficients of β ∗ with probability close to 1. Theorem 5.9. Let the assumptions of Theorem 5.2 and Assumptions 5.4, 5.6, 5.8 be satisfied. Assume that |J(β ∗ )| ≤ s, and |βk∗ | >

4σ∗ rτ ∗ (s) c∗k (s)vk

for all k ∈ J(β ∗ ). Take the thresholds

2b σr ωk (s) , ∗ κk (s)xk∗

1−

r κ∗Jend (s)

−

r2 κ∗J c (s) end

!−1

,

+

and consider the estimator βe with coordinates βek (ωk (s)), k = 1, . . . , K. Then, with probability at least

1 − γ, we have (5.13)

e = J(β ∗ ). As a consequence, J(β)

−−−−→ −−−−−→ e = sign(β ∗ ). sign(β)

The proof of Theorem 5.9 is given in Section 9.3.

Remark 5.10. Inspection of the proof of Theorem 5.9 shows that the same conclusion as in Theorem 5.9 holds with other definitions of the thresholds. Indeed, κ∗k (s), κ∗Jend (s), and κ∗J c (s) in the end

definition of ωk (s) are lower bounds for the sensitivities

κ∗k,J(β ∗ ) , κ∗Jend ,J(β ∗ ) ,

and

κ∗J c ,J(β ∗ ) . end

They

can be replaced by other s-dependent lower bounds on these sensitivities. Then Theorem 5.9 remains valid, with the modifications only in the value of the lower bound on |βk∗ | required for k ∈ J(β ∗ ), and in a slightly different formulation of Assumption 5.8. For example, if there is only one endogenous variable, Jend = {1}, we can take the thresholds 2b σr ω k (s) , ∗ κk (s)xk∗

r2 r − 1− ∗ κ1 (s) κ1 (s)

−1 +

.

18

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

We now consider the approximately sparse setting. The sparsity assumption is quite natural in empirical economics since usually only a moderate number of covariates is included in the model. However, one might be also interested in the case when the true vector β ∗ is only approximately sparse. This means that most of the coefficients βk∗ are not exactly zero but too small to matter, whereas the remaining ones are relatively large. This setting received some attention in the statistical literature. For example, the performance of Dantzig selector and M U -selector under such assumptions is studied by Cand`es and Tao (2007) and Rosenbaum and Tsybakov (2010) respectively. We will derive a similar result for the STIV estimator. Consider the enlarged cone eJ , C

2+c K ∆ ∈ R : |∆J c |1 ≤ |∆J |1 1−c

and define, for p ∈ [1, ∞] and J0 ⊂ {1, . . . , K} κ ep,J ,

inf

eJ ∆∈RK : |∆|p =1, ∆∈C

|Ψn ∆|∞

and κ e∗J0 ,J ,

inf

eJ ∆∈RK : |∆J0 |1 =1, ∆∈C

|Ψn ∆|∞ .

The following theorem is an analog of the above results for the approximately sparse case. Theorem 5.11. Let A, α, and r satisfy the same conditions as in Theorem 5.2. Assume that L ≤

exp(d2n,δ /(2A2 )) and fix p ∈ [1, ∞]. Let Assumption 5.1 be satisfied. Then with probability at least

1 − α for any solution βb of (3.5) we have (5.14) DX −1 βb − β ∗ ≤ p

min

J⊂{1,...,K}

max

2b σr κ ep,J

1−

r κ ˜ ∗Jend ,J

−

r2 κ ˜ ∗J c

end ,J

!−1 +

−1 ∗ 6 DX β J c 1 . , 1−c

We can interpret Theorem 5.11 as the fact that the STIV estimator automatically realizes a

“bias/variance” trade-off related to a non-linear approximation. Inequality (5.14) means that this estimator performs as well as if the optimal subset J were known. 6. Models with possibly non-valid instruments In this section, we propose a modification of the STIV estimator for the model with possibly non-valid instruments. The main purpose of the suggested method is to construct confidence intervals for non-validity indicators, and to detect non-valid instruments. This question has been addressed in the non high-dimensional case, for example, in Andrews (1999) and Hahn and Hausman (2002) among others; one of the most recent papers is Liao (2010) where one can find more references. The model

19

can be written in the form: (6.1)

yi = xTi β ∗ + ui ,

(6.2)

E [zi ui ] = 0,

(6.3)

E [z i ui ] = θ ∗ ,

where xi , zi , and z i are vectors of dimensions K, L and L1 , respectively. The instruments are decomposed in two parts, zi and z i , where z Ti = (z 1i , . . . , z L1 i ) is a vector of possibly non-valid instruments. A component of the unknown vector θ ∗ ∈ RL1 is equal to zero when the corresponding instrument is indeed valid. The component θl∗ of θ ∗ will be called the non-validity indicator of the

instrument z li . Our study covers the models with dimensions K, L and L1 that can be much larger than the sample size. As above, we assume independence and allow for heteroscedasticity. The difference from the previous sections is only in introducing equation (6.3). In addition to xi , yi , zi , we observe the realizations of mutually independent random vectors z i , i = 1, . . . , n, with components z li satisfying E[z li ui ] = θl∗ for all l = 1, . . . , L1 , i = 1, . . . , n. We denote by Z the matrix of dimension n × L1 with rows z Ti , i = 1, . . . , n. Set

n

z ∗ = max

l=1,...,L1

1X 2 z li n i=1

!1/2

.

In this section, we assume that we have a pilot estimator βb and a statistic bb such that, with probability close to 1,

DX −1 (βb − β ∗ ) ≤ bb.

(6.4)

1

For example, βb can be the STIV estimator based only on the vectors of valid instruments z1 , . . . , zn .

In this case, an explicit expression for bb can be obtained from (5.2) by replacing there J(β ∗ ) by a suitable estimator or upper estimator Jb (see Theorem 5.7 (iv) and Theorem 5.9).

(6.5)

bσ We define the STIV-NV estimator (θ, b1 ) as any solution of the problem

where 0 < c < 1, b I1 , (θ, σ1 ) : θ ∈ RL1 , σ1 > 0,

min

b1 (θ,σ1 )∈I

|θ|1 + cσ1 ,

1 T b − θ ≤ σ1 r1 + bbz ∗ , F (θ, β) b ≤ σ1 + bbz ∗ Z (Y − Xβ) n ∞

20

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

for some r1 > 0 (to be specified below), where for all θ = (θ1 , . . . , θL1 ) ∈ RL1 , β ∈ RK , q b l (θl , β) Q F (θ, β) , max l=1,...,L1

with

n

X 2 b l (θl , β) , 1 z li (yi − xTi β) − θl . Q n i=1

It is not hard to see that the optimization problem (6.5) can be re-written as a linear program. Assumption 6.1. There exists δ > 0 such that, for all i = 1, . . . , n, l = 1, . . . , L1 , the following conditions hold: E[|z li ui |2+δ ] < ∞,

E[z li ui ] = θl∗ ,

where θl∗ is the lth component of θ ∗ and neither of z li ui − θl∗ is almost surely equal to 0. Define dn,δ,1 , For A ≥ 1 set (6.6)

α1 = 2L1

n

min

l=1,...,L1

pPn

∗ 2 i=1 E[|z li ui − θl | ] . ∗ |2+δ ] 1/(2+δ) E[|z u − θ li i i=1 l

Pn

p 1+δ p o 1 + A 2 log(L1 ) , 1 − Φ A 2 log(L1 ) + 2A0 2 −1 2+δ LA dn,δ,1 1

where A0 > 0 is the absolute constant from Theorem 9.4, and Φ(·) is the standard normal c.d.f. The following theorem provides a basis for constructing confidence intervals for the non-validity indicators. Theorem 6.2. Let Assumption 6.1 hold. For A ≥ 1, define α1 by (6.6), and set r 2 log(L1 ) r1 = A . n Assume that L1 ≤ exp(d2n,δ,1 /(2A2 )), and that βb is an estimator satisfying (6.4) with probability at

bσ least 1 − α2 for some 0 < α2 < 1. Then, with probability at least 1 − α1 − α2 for any solution (θ, b1 ) of the minimization problem (6.5) we have h i 2 σ b1 r1 + (1 + r1 (1 − c)−1 )bbz ∗ (6.7) |θb − θ ∗ |∞ ≤ , V (b σ1 , bb, |J(θ ∗ )|) , (1 − 2r1 (1 − c)−1 |J(θ ∗ )|)+

and

(6.8)

|θb − θ ∗ |1 ≤

h 2 2|J(θ ∗ )| σ b1 r1 + (1 + r1 )bbz ∗ + cbbz ∗ ] (1 − c − 2r1 |J(θ ∗ )|)+

.

21

This theorem should be naturally applied when r1 is small, i.e., n ≫ log(L1 ). In addition, we

need a small bb, which is guaranteed by the results of Section 5 under the condition n ≫ log(L) if the pilot estimator βb is the STIV estimator. Note also that the bounds (6.7) and (6.8) are meaningful if

their denominators are positive, which is roughly equivalent to the following bound on the sparsity of p θ ∗ : |J(θ ∗ )| = O(1/r1 ) = O( n/ log(L1 )).

Bounds for all the norms |θb− θ ∗ |p , ∀ 1 < p < ∞, follow immediately from (6.7) and (6.8) by the

standard interpolation argument. We note that, in Theorem 6.2, βb can be any estimator satisfying

(6.4), not necessarily the STIV estimator.

b To turn (6.7) and (6.8) into valid confidence bounds, we can replace there |J(θ ∗ )| by |J(θ)|,

as follows from Theorem 6.4 (ii) below. In addition, Theorem 6.4 establishes the rate of convergence

of the STIV-NV estimator and justifies the selection of non-valid instruments by thresholding. To state the theorem, we will need an extra assumption that the random variable F (θ ∗ , β ∗ ) is bounded in probability by a constant σ1∗ > 0: Assumption 6.3. There exist constants σ1∗ > 0 and 0 < ε < 1 such that, with probability at least 1 − ε, n

(6.9)

max

l=1,...,L1

1X 2 (z li ui − θl∗ )2 ≤ σ1∗ . n i=1

As in (5.11) we define a thresholded estimator θb if |θb | > ω, l l (6.10) θel , 0 otherwise,

where ω > 0 is some threshold. For b∗ > 0, s1 > 0, define −1 4r1 s1 2b∗ z ∗ (1 + 2(1 + r1 )s1 /c) σ∗ = 1 − σ1∗ + . c(1 − c − 2r1 s1 )+ + (1 − c − 2r1 s1 )+ Theorem 6.4. Let the assumptions of Theorem 6.2 and Assumption 6.3 be satisfied. Then the following holds. (i) Let βb be an estimator satisfying

(6.11)

−1 b ∗ ( β − β ) DX ≤ b∗ 1

with probability at least 1 − α2 for some 0 < α2 < 1 and some constant b∗ . Assume that

|J(θ ∗ )| ≤ s1 . Then, with probability at least 1 − α1 − α2 − ε, for any solution θb of the

22

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

minimization problem (6.5) we have |θb − θ ∗ |∞ ≤ V (σ ∗ , b∗ , s1 ).

(6.12)

b σ (ii) Let (β, b) be the STIV estimator, and let the assumptions of all the items of Theorem 5.7

be satisfied (with p = 1 in item (ii)). Assume that |J(θ ∗ )| ≤ s1 , |J(β ∗ )| ≤ s, and |θl∗ | > V (σ ∗ , b∗ , s1 ) for all l ∈ J(θ ∗ ), where

(6.13)

b∗ =

2σ∗ rsτ ∗ (s) . c1

Then, with probability at least 1 − α1 − ε − γ, for any solution θb of the minimization problem

(6.5) we have

b J(θ ∗ ) ⊆ J(θ).

(6.14)

(iii) Let the assumptions of item (ii) and Assumption 5.8 hold. Assume that |θl∗ | > 2V (σ ∗ , b∗ , s1 ) for all l ∈ J(θ ∗ ). Let θ˜ be the thresholded estimator defined in (6.10) where θb is any solution of the minimization problem (6.5), and the threshold is defined by ω = V (b σ1 , bb, s1 ) with σ rs bb = 2b κ1 (s)

1−

r

κ∗Jend (s)

−

r2

κ∗J c (s) end

!−1

.

+

Then, with probability at least 1 − α1 − ε − γ, we have (6.15) e = J(θ ∗ ). As a consequence, J(θ)

−−−−→ −−−−−→ e = sign(θ ∗ ). sign(θ)

b this is a In practice, the parameter s may not be known and it can be replaced by |J(θ)|;

reasonable upper bound on |J(θ ∗ )| as suggested by Theorem 6.4 (ii). It is interesting to analyze the dependence of the rate of convergence in (6.12) on r, r1 , s, and s1 . As discussed above, a meaningful framework is to consider small r,r1 and the sparsities s,s1 such that rs, r1 s1 are comfortably smaller than 1. In this case, the value b∗ given in (6.15) is of the order O(rs) and the rate of convergence in (6.12) is of the order O(r1 ) + O(rs). We see that the rate does not depend on the sparsity s1 of θ ∗ but it does depend on the sparsity s of β ∗ . It is interesting to explore whether this rate is optimal, i.e., whether it can be improved by estimators different from the STIV-NV estimator.

23

7. Complements 7.1. Non-pivotal STIV estimator. We consider first a simpler version of the STIV estimator which is not pivotal in the sense that it depends on the upper bound σ∗ on the “noise level” appearing in

Assumption 5.4. The estimator that we consider here is a solution βb of the following minimization

problem:

min |D−1 X β|1 ,

(7.1)

bnp β∈I

where

1 T Ibnp : DZ Z (Y − Xβ) ≤ σ∗ r . n ∞ It is not hard to see that (7.1) can be written as a linear program. We have the following bounds on , β ∈ RK

the ℓp -errors of this estimator.

Theorem 7.1. Let Assumptions 5.1 and 5.4 hold. For A ≥ 1, define α by (5.1), and set r 2 log(L) r=A . n Assume that L ≤ exp(d2n,δ /(2A2 )). Then, with probability at least 1 − α − γ1 for any solution βb of the

minimization problem (7.1) we have 2σ∗ r , (7.2) DX −1 (βb − β ∗ ) ≤ κp,J(β ∗ ) p

∀ p ∈ [1, ∞],

and, for all k = 1, . . . , K, (7.3)

|βbk − βk∗ | ≤

2σ∗ r . xk∗ κ∗k,J(β ∗ )

Here the sensitivities κp,J(β ∗ ) and κ∗k,J(β ∗ ) are defined on the cone CJ with c = 0. The proof of this result is easily obtained by simplifying the proof of Theorem 5.2. 7.2. STIV estimator with linear projection instruments. The results of the previous sections show that the STIV estimator can handle a very large number of instruments, up to an exponential in the sample size. Moreover, adding instruments always improves the sensitivities. In this section, we consider the case where we look for a smaller set of instruments, namely, of size K. A classical solution with one endogenous regressor in low dimension is the two-stage least squares estimator (see, e.g., Wooldridge (2002)). Under the stronger zero conditional mean assumption, the solution in low dimensions is given by the optimal instruments (see Amemiya (1974), Chamberlain (1987), and Newey (1990)). In the homoscedastic case, it corresponds to the projection of the endogenous variables on

24

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

the space of variables measurable with respect to all the instruments. These optimal instruments are expressed in terms of conditional expectations that are not available in practice and should be estimated. When K is large, we are typically facing the curse of dimensionality and extremely large samples would be needed to obtain precise estimates of these ideal instruments. In this setting, Belloni, Chen, Chernozhukov et al. (2010a) propose to use the Lasso. Then they consider the heteroscedastic robust IV estimator with these instruments. We propose to proceed in a different way. As discussed after Proposition 4.2, we can expect to get higher sensitivities and thus to obtain tighter bounds if for each endogenous regressor we use a “good instrument”, i.e., the instrument correlated as much as possible with the endogenous variable. Akin to the two-stage least squares, we consider instruments which are the projections of the endogenous variables on the linear span of all the instruments, and do not make the stronger zero conditional mean assumption. Note that, for every k = 1, . . . , kend , we can write the reduced form equations (7.4)

xki =

L X

zli ζkl + vki ,

i = 1, . . . , n,

l=1

where ζkl are unknown coefficients of the linear combination of instruments, and (7.5)

E[zli vki ] = 0

for i = 1, . . . , n, l = 1, . . . , L. The representation (7.4)–(7.5) holds whenever xki and zli have finite P second moments. We call L l=1 zli ζkl the linear projection instrument. We now estimate the unknown

coefficients ζkl . If L ≥ K > n and if the reduced form model (7.4) has some sparsity, it is natural to use a high-dimensional procedure, such as the Lasso, the Dantzig selector or the Square-root Lasso,

to produce estimators ζbkl of the coefficients. (Since there is no endogeneity in (7.4) we need not apply the STIV estimator requiring more computations.) Then we replace the initial L-dimensional

vector of instruments by a K-dimensional vector x bi = (b x1i , . . . , x bKi ) whose first kend coordinates are PL b l=1 ζkl zli and the remaining coordinates are the exogenous variables. These are estimators of the linear projection instruments that we use on the second stage to estimate β ∗ . Specifically, on the second stage we apply the STIV estimator where we replace the matrices Z, DZ , and Ψn by their estimated counterparts corresponding to new vectors of instruments of size K instead of L (just use x bi , instead of zi ). Intuitively this should yield larger sensitivities κp,J , κ∗k,J and others since the

new instruments are better correlated with the endogenous variables. Also, the log(L) term in the expression for r and in the rates is reduced to a log(K) term.

25

We do not discuss here a theoretical justification of this method. In Section 8 we show that it works successfully in simulations. Note also that a quick proof can be obtained using the sample splitting argument. Indeed, if the linear projection instruments are obtained from the first subsample, whereas the second subsample independent from the first one is used to estimate β ∗ , then x bi are valid

instruments. Therefore, conditioning on the first subsample, we can apply the theory of Section 5. However, in practice, it seems reasonable to use the whole data set on both steps of the two-stage procedure. Finally, note that another type of two-stage procedures, not motivated by the endogeneity, is discussed in the literature on sparsity in high-dimensional linear models (see, e.g., Cand`es and Tao (2007) and Belloni and Chernozhukov (2010)). At the first stage, the support of the true vector is estimated with a high-dimensional procedure, such as the Lasso or Dantzig selector, and at the second stage the OLS is used on the estimated support. Belloni and Chernozhukov (2010) study the theoretical properties of such two stage procedures. An analog of this approach for the setting that we consider here would be a two-stage procedure with the STIV estimator at the first stage and some classical IV estimator (such as the GMM) at the second stage.

8. Practical implementation b σ 8.1. Computational aspects. Finding a solution (β, b) of the minimization problem (3.5) reduces √ to the following conic program: find β ∈ RK and t > 0 (σ = t/ n), which achieve the minimum (8.1)

min

(β,t,v,w)∈V

K X

√ wk + c nt

k=1

!

where V is the set of (β, t, v, w), with satisfying: v = Y − Xβ, −w ≤ D−1 X β ≤ w,

1 −rt1 ≤ √ DZ ZT (Y − Xβ) ≤ rt1, n w ≥ 0,

(t, v) ∈ C.

Here and below 0 and 1 are vectors of zeros and ones respectively, the inequality between vectors is understood in the componentwise sense, and C is a cone: C , {(t, v) ∈ R × Rn : t ≥ |v|2 }. Conic programming is a standard tool in optimization and many open source toolboxes are available to implement it (see, e.g., Sturm (1999)).

26

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

The expression in curly brackets in the lower bound (4.4) is equal to the value of the following optimization program: (8.2)

min

min

v

ǫ=±1 (w,∆,v)∈Vk,j

where Vk,j is the set of (w, ∆, v) with w ∈ RK , ∆ ∈ RK , v ∈ R satisfying: v ≥ 0, wI = 0,

−v1 ≤ Ψn ∆ ≤ v1, ∆k = 1,

where g is the constant such that

w ≥ 0,

K X

ǫ∆j ≥ 0,

i=1

−wI c ≤ ∆I c ≤ wI c for I = {j, k}, wi + 1 ≤ ǫ(a + g)∆j

0 if k = j g= −1 otherwise.

Note that, here, ǫ is the sign of ∆j , and (8.2) is the minimum of two terms, each of which is the value of a linear program. Analogously, the expression in curly brackets in (4.7) can be computed by solving 2|J0 | linear programs. The reduction is done in the same way as in (8.2) with the only difference that instead of ǫ we introduce a vector (ǫk )k∈J0 of signs of the coordinates ∆k for indices k ∈ J0 . The coordinate-wise sensitivities κ∗k,J =

inf

1+c ∆k =1, |∆J c |1 ≤ 1−c |∆J |1

|Ψn ∆|∞

can be efficiently computed for given J when the cardinality |J| is small. Indeed, it is enough to find the minimum of the values of 2|J| linear programs: (8.3)

min

min

(ǫj )j∈J ∈{−1,1}|J | (w,∆,v)∈Uk,J

v

where Uk,J is the set of (w, ∆, v) with w ∈ RK , ∆ ∈ RK , v ∈ R satisfying: v ≥ 0, wI = 0,

−v1 ≤ Ψn ∆ ≤ v1, ∆k = 1,

w ≥ 0,

−wI c ≤ ∆I c ≤ wI c for I = J ∪ {k},

ǫj ∆j ≥ 0, for j ∈ J,

K X i=1

wi ≤

1+cX ǫj ∆j + g. 1−c j∈J

Here (ǫj )j∈J is the vector of signs of the coordinates ∆j with j ∈ J and g is the constant defined by 0 if k ∈ J, g= −1 otherwise.

27

8.2. Simulations. In this section, we consider the performance of the STIV estimator on simulated data. The model is as follows: yi =

K X

xki βk∗ + ui ,

k=1

x1i =

L−K+1 X

zli ζl + vi ,

l=1

xl′ i = zli

for l′ = l − L + K and l ∈ {L − K + 2, . . . , L},

where (yi , xTi , ziT , ui , vi ) are i.i.d., (ui , vi ) have the joint normal distribution 2 σstruct ρσstruct σend , N 0, 2 ρσstruct σend σend

ziT is a vector of independent standard normal random variables, and ziT is independent of (ui , vi ).

Clearly, in this model E[zi ui ] = 0. We take n = 49, L = 50, K = 25, σstruct = σend = ρ = 0.3, β ∗ = (1, 1, 1, 1, 1, 0, . . . , 0)T and ζl = 0.15 for l = 1, . . . , L − K + 1. We have 50 instruments and only 49 observations, so we are in a framework of application of high-dimensional techniques. We set c = 0.1 and take A satisfying (5.5) with α = 0.05. The three columns on the left of Table 1 present simulation results for the STIV estimator. It is straightforward to see that only the first five variables b The (the true support of β ∗ ) are eligible to be considered as relevant. This set will be denoted by J.

second and third columns in Table 1 present the true coordinate-wise sensitivities κ∗

k,Jb

as well as their

lower bounds κ∗k (5) obtained via the sparsity certificate with s = 5. These lower bounds are easy to compute, and we see that they yield reasonable approximations from below of the true sensitivities. The estimate σ b is 0.247 which is quite close to σstruct . Next, based on (5.3), the fact that Jend = {1},

and the bounds on the sensitivities in Proposition 4.1 and in (4.4) – (4.7), we have the following formulas for the confidence intervals |βbk −

(8.4)

(8.5) Here, κ∗

k,Jb

|βbk − βk∗ | ≤

βk∗ |

2b σr ≤ xk∗ κ∗

k,Jb

2b σr xk∗ κ∗k (s)

r2 1− ∗ − κ b κ1,Jb r

1,J

r2 r − 1− ∗ κ1 (s) κ1 (s)

−1

!−1

,

+

with s = 5.

+

and κ∗k (s) are computed directly via the programs (8.3) and (8.2) respectively. The value

κ1 (s) is then obtained from (4.6), and for κ1,Jb we use a lower bound analogous to (4.6): κ1,Jb ≥

1−c min κ∗ b. 2|Jb| k=1,...,K k,J

28

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

Table 1. Results for the STIV estimator without and with estimated instruments, n = 49 βˆ

(1)

κ∗k,Jb (1)

κ∗k (5)

(1)

βˆ

(2)

κ∗k,Jb (2)

κ∗k (5)

(2)

β1∗

1.03

0.107

0.103

1.03

0.085

0.068

β2∗

0.98

0.308

0.157

0.98

0.367

0.075

β3∗

0.96

0.129

0.103

0.96

0.126

0.071

β4∗

0.95

0.150

0.109

0.95

0.115

0.057

β5∗

0.90

0.253

0.175

0.90

0.177

0.086

β6∗

0.00

0.166

0.095

0.00

0.126

0.065

β7∗

0.00

0.155

0.080

0.00

0.148

0.060

β8∗ .. .

0.00 .. .

0.154 .. .

0.110 .. .

0.00 .. .

0.122 .. .

0.056 .. .

∗ β23

0.02

0.287

0.170

0.02

0.231

0.128

∗ β24

0.00

0.243

0.137

0.00

0.195

0.105

∗ β25

0.00 0.141 0.109 0.00 0.106 0.067 We use dots because the values that do not appear are similar. (1): With all the 50 instruments,

(2): With 25 instruments including an estimate of the linear projection instrument.

We get κ∗

1,Jb

= 0.0096 and κ∗1 (5) = 0.0072. In particular, we have r/κ∗

1,Jb

= 4.40 > 1, so that (8.4) and

(8.5) do not provide confidence intervals in this numerical example. The columns on the right in Table 1 present the results where we use the same data, estimate the linear projection instrument by the Square-root Lasso and then take only K instruments: zil , P b b l = L − K + 2, . . . , L, and x bi1 = L l=1 ζl zil , where ζl are the Square-root Lasso estimators of ζl , l =

1, . . . , L. The Square-root Lasso with parameter c√Lasso = 1.1 recommended in Belloni, Chernozhukov and Wang (2010)

1

yields all coefficients equal to zero when keeping only the first three digits. This

is disappointing since we get an instrument equal to zero. It should be noted that estimation in this setting is a hard problem since the dimension L is larger than the sample size, the number of non-zero coefficients ζl is large (L − K + 1 = 26), and their values are relatively small (equal to 0, 15).

To improve the estimation, we adjusted the parameter c√Lasso empirically, based on the value of the estimates. Ultimately, we have chosen c√Lasso = 0.3. This choice is not covered by the theory of Belloni, Chernozhukov and Wang (2010) because there c√Lasso should be greater than 1. However, it q b = 0.309, which is very close to σend . The corresponding estimates ζbl are given in Table b β) leads to Q( 1The constant c√

denoted by c in Belloni, Chernozhukov and Wang (2010) should not be mixed up with c = cST IV √ in the definition of the STIV estimator; c√Lasso is an equivalent of n/cST IV , up to constants. Lasso

29

Table 2. Estimates of the coefficients of the linear projection instrument ζb1

ζb2

ζb3

ζb4

ζb6

ζb8

ζb9

ζb10

ζb14

ζb15

ζb16

ζb17

ζb18

ζb20

0.084

0.130

0.190

0.142

0.115

0.083

0.104

0.126

0.176

0.030

0.023

0.157

0.135

0.082

ζb21

ζb23

ζb24

ζb25

ζb26

ζb27

ζb32

ζb33

ζb34

ζb44

ζb47

ζb49

ζb50

0.100

0.125

0.038

0.025

-0.009

-0.063

0.033

0.026 -0.058 0.108 0.005 -0.053 -0.006 We only show the non-zero coefficients.

2. We see that they are not very close to the true ζl ; some of the relevant coefficients are erroneously set to 0 and several superfluous variables are included, sometimes with significant coefficients, such as ζb32 . We get κ∗

1,Jb

= 0.0076 and κ∗1 (5) = 0.0040. Again, r/κ∗

1,Jb

> 1, so that we cannot use (8.4) and

(8.5) to get the confidence intervals. Note that this approach based on the estimated linear projection instrument gives sensitivities, which are lower than with the full set of instruments. This is mainly due to the fact that the estimation of the linear projection instrument is quite imprecise. Indeed, we add an instrument x bi1 , which is not so good, and at the same time we drop a large number of other

instruments, which may be not so bad. The overall effect on the sensitivities turns out to be negative.

Recall that since the sensitivities involve the maximum of the scalar products of the rows of Ψn with ∆, the more we have rows (i.e., instruments) the higher is the sensitivity. The same deterioration of the sensitivities occurred in other simulated data sets. In conclusion, the approach based on estimation of the linear projection instrument was not helpful to realize the above confidence intervals in this small sample situation. However, we will see that it achieves the task when the sample size gets large. Although in this numerical example we were not able to use (8.4) and (8.5) for the confidence intervals, we got evidence that the performance of the STIV estimator is quite satisfactory. Table 3 shows a Monte-Carlo study where we keep the same values of the parameters of the model, of the b simulate 1000 data sets, and compute sample size n = 49, and of the parameter A defining the set I,

1000 estimates. The empirical performance of the STIV estimator is extremely good, even for the endogenous variable. The Monte-Carlo estimation of the variability of βb1 is very similar to that of the

exogenous variables. With c = 0.1 the estimate σ b is larger than σstruct in 95% of the simulations. This

suggests that there remains some margin to penalize less for the “variance” in (3.5), i.e., to decrease c and thus to obtain higher sensitivities.

Next, we study the empirical behavior of the non-pivotal STIV estimator. We consider the same model and the same values of all the parameters, and we choose σ∗ = 2 · 0.233 where 0.233 is the median of σ b from Table 3. Indeed P En [U 2 ] ≤ σ∗2 should be close to 1 (see Assumption 5.4).

The results are given in Table 4. The non-pivotal procedure seems to better estimate as zeros the

30

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

Table 3. Monte-Carlo study, 1000 replications

β1∗ β2∗

5th percentile

Median

95th percentile

0.872

0.986

1.093

0.877

0.970

5th percentile

Median

95th percentile

β8∗

-0.057

0.000

0.055

1.048

β9∗

-0.052 .. .

0.000 .. .

0.059 .. .

β3∗

0.879

0.970

1.049

.. .

β4∗

0.886

0.971

1.051

∗ β23

-0.051

0.000

0.051

1.049

∗ β24

-0.057

0.000

0.051

-0.053

0.000

0.049

0.181

0.233

0.291

β5∗

0.877

0.968

β6∗

-0.048

0.000

0.055

∗ β25

β7∗

-0.059

0.000

0.063

σ ˆ

Table 4. Monte-Carlo study of the non-pivotal estimator, 1000 replications

β1∗ β2∗ β3∗

5th percentile

Median

95th percentile

0.714

0.914

1.110

0.803 0.789

0.909 0.904

5th percentile

Median

95th percentile

β8∗

-0.003

0.000

0.016

1.010

β9∗

0.000

0.000

0.024

1.019

∗ β10

0.000 .. .

0.000 .. .

0.018 .. .

β4∗

0.793

0.904

1.023

.. .

β5∗

0.796

0.907

1.017

∗ β23

0.000

0.000

0.021

0.020

∗ β24

0.000

0.000

0.016

0.021

∗ β25

0.000

0.000

0.005

β6∗ β7∗

0.000 0.000

0.000 0.000

zero coefficients. This is because we minimize the ℓ1 norm of the coefficients without an additional cσ term. On the other hand, the non-zero coefficients are better estimated using the pivotal estimator. The non-pivotal procedure yields some shrinkage to zero (especially for large σ∗ ). Using the pivotal procedure in the first place allows us to have a good initial guess of σ∗ . Let us now increase n to see whether we can obtain interval estimates and take advantage of thresholding for variable selection. We consider the same model as above and the same values of the parameters of the method but we replace n = 49 by n = 8000. Then we are no longer in a situation where we must use specific high-dimensional techniques. However, it is still a challenging task to select among 25 candidate variables, one of them being endogenous. Indeed, classical selection procedures like the BIC would require to solve 225 least squares problems. Our methods are much less numerically intensive. They are based on linear and conic programming, and their computational cost is polynomial in the dimension. We study both the setting with all the 50 instruments and the setting where we estimate the linear projection instrument.

31

Table 5. Confidence intervals and selection of variables, n = 8000 βˆl,SC

βˆl,Jb

βˆ

βˆu,Jb

βˆu,SC

κ∗k,Jb

κ∗k (5)

ωk,Jb

ωk,SC

β1∗

0.131

0.135

1.048

1.960

1.965

0.134

0.134

0.912

0.917

β2∗

0.795

0.804

0.995

1.185

1.195

0.897

0.855

0.191

0.200

β3∗

0.824

0.829

1.004

1.179

1.185

0.796

0.775

0.175

0.180

β4∗

0.817

0.822

0.998

1.173

1.178

0.858

0.833

0.175

0.181

β5∗

0.833

0.834

1.001

1.168

1.168

0.793

0.790

0.167

0.168

β6∗

-0.163

-0.160

0.003

0.166

0.169

0.807

0.791

0.163

0.166

β7∗

-0.173

-0.168

0.002

0.172

0.177

0.846

0.823

0.170

0.175

β8∗ .. .

-0.173 .. .

-0.170 .. .

0.001 .. .

0.173 .. .

0.175 .. .

0.789 .. .

0.779 .. .

0.172 .. .

0.174 .. .

∗ β23

-0.190

-0.188

0.003

0.194

0.197

0.802

0.793

0.191

0.193

∗ β24

-0.171

-0.166

0.001

0.168

0.172

0.842

0.821

0.167

0.172

∗ β25

-0.172

-0.169

-0.005

0.158

0.162

0.828

0.811

0.163

0.167

Consider first the case where we use all the instruments. Set for brevity !−1 −1 r2 r2 r r − . , w(5) , 1 − ∗ w b, 1− ∗ − κ b κ1,Jb κ1 (5) κ1 (5) + 1,J

+

These are the quantities appearing in (8.4) and (8.5). As above, we take Jb equal to the set of the first five coordinates; w(5) corresponds to the sparsity certificate approach with s = 5. Computing the exact coordinate-wise sensitivities we obtain the bound w b ≤ 1.6277. The sparsity certificate approach

with s = 5 yields w(5) ≤ 1.6306. We obtain σ b = 0.2970 and the estimates in Table 5. The values

βˆl,Jb and βˆu,Jb are the lower and upper confidence limits respectively obtained from (8.4); βˆl,SC and

βˆu,SC are the lower and upper confidence limits obtained from (8.5) (sparsity certificate approach with s = 5). The thresholds ωk,Jb and ωk (5) are computed from the formulas ωk,Jb =

2 · 1.6277ˆ σr , ∗ xk∗ κ b k,J

ωk (5) =

2 · 1.6306ˆ σr . ∗ xk∗ κk (5)

Table 5 shows that in this example thresholding works well: The true support of β ∗ is recovered exactly by selecting the variables, for which the estimated coefficient is larger than the threshold. Note that the threshold for the endogenous variable is very close to the estimate of the first coefficient βb1 since the confidence intervals are wider for the endogenous variable.

We now consider the case where we use only 25 instruments; the 24 exogenous variables serve

as their own instruments and the Square-root Lasso estimator of the linear projection instrument is

32

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

Table 6. Estimates of the coefficients in the linear projection instrument ζb1

ζb2

ζb3

ζb4

ζb5

ζb6

ζb7

ζb8

ζb9

ζb10

ζb11

ζb12

ζb13

ζb14

0.142

0.145

0.134

0.136

0.137

0.135

0.139

0.139

0.134

0.140

0.146

0.140

0.134

0.136

ζb15

ζb16

ζb17

ζb18

ζb19

ζb20

ζb21

ζb22

ζb23

ζb24

ζb25

ζb26

0.137

0.138

0.141 0.128 0.142 0.137 0.133 0.135 0.135 0.142 0.137 0.138 We only show the non-zero coefficients (keeping only three digits).

Table 7. Confidence intervals and selection of variables, n = 8000 βˆl,SC

βˆl,Jb

βˆ

βˆu,Jb

βˆu,SC

κ∗k,Jb

κ∗k (5)

ωk,Jb

ωk,SC

β1∗

0.901

0.909

1.048

1.187

1.194

0.556

0.531

0.139

0.146

β2∗

0.872

0.883

0.995

1.106

1.118

0.968

0.882

0.111

0.123

β3∗

0.896

0.905

1.004

1.103

1.112

0.888

0.821

0.099

0.108

β4∗

0.885

0.893

0.998

1.102

1.110

0.907

0.848

0.105

0.113

β5∗

0.899

0.902

1.001

1.100

1.103

0.843

0.823

0.099

0.102

β6∗

-0.098

-0.092

0.003

0.099

0.104

0.868

0.822

0.095

0.101

β7∗

-0.103

-0.098

0.002

0.102

0.107

0.907

0.869

0.100

0.105

β8∗ .. .

-0.099 .. .

-0.095 .. .

0.001 .. .

0.098 .. .

0.102 .. .

0.886 .. .

0.853 .. .

0.096 .. .

0.101 .. .

∗ β23

-0.115

-0.109

0.003

0.115

0.121

0.862

0.825

0.112

0.118

∗ β24

-0.104

-0.099

0.001

0.101

0.106

0.888

0.848

0.100

0.105

∗ β25

-0.109

-0.104

-0.005

0.093

0.098

0.870

0.830

0.098

0.103

used for the endogenous variable. This time, we apply the Square-root Lasso with the recommended q √ b = 0.3012. The estimates of ζbl are given in Table 6. Next, we use b β) choice c Lasso = 1.1. We get Q(

(8.4) and (8.5) to obtain the confidence intervals. Computing the exact coordinate-wise sensitivities

we get the bound w b ≤ 1.0941. The sparsity certificate approach with s = 5 yields w(5) ≤ 1.0990. We also get σ b = 0.2970. The thresholds ωk,Jb and ωk (5) are obtained from the formulas ωk,Jb =

2 · 1.0941ˆ σr , ∗ xk∗ κ b k,J

ωk (5) =

2 · 1.0990ˆ σr . ∗ xk∗ κk (5)

The results are summarized in Table 7. Note that the confidence intervals and the thresholds are sharper than in the approach including all the instruments. The particularly good news is that the confidence interval for the coefficient of the endogenous variable becomes much tighter. In conclusion, when the sample size is large, the coordinate-wise sensitivities based on the sparsity certificate work remarkably well for estimation, confidence intervals, and variable selection.

33

We also get a significant improvement from using the two-stage procedure with estimated linear projection instrument. 9. Appendix 9.1. Lower bounds on κp,J for square matrices Ψn . The following propositions establish lower bounds on κp,J when Ψn is a square K × K matrix. For any J ⊆ {1, . . . , K} we define the following restricted eigenvalue (RE) constants κRE,J ,

inf

∆∈RK \{0}: ∆∈CJ

|∆T Ψn ∆| , |∆J |22

κ′RE,J ,

inf

∆∈RK \{0}: ∆∈CJ

|J| |∆T Ψn ∆| . |∆J |21

Proposition 9.1. For any J ⊆ {1, . . . , K} we have κ1,J ≥ Proof. For such that |∆J c |1 ≤

(1 − c)2 (1 − c)2 ′ κRE,J ≥ κRE,J . 4|J| 4|J|

1+c 1−c |∆J |1

we have |∆|1 ≤

2 1−c |∆J |1 .

Thus,

|∆T Ψn ∆| |Ψn ∆|∞ |∆|1 |Ψn ∆|∞ 4 . ≤ ≤ 2 2 2 (1 − c) |∆|1 |∆J |1 |∆J |1

This proves the first inequality of the proposition. The second inequality is obvious. Proposition 9.2. Let J ⊆ {1, . . . , K} be such that (9.1)

inf

∆∈RK \{0}: ∆∈CJ

|XDX ∆|2 √ ≥κ e n|∆J |2

for some e κ > 0, and let there exist 0 < δ < 1 such that 2 e2 1 ∗ T ≤ δ(1 − c) κ (XD − ZD ) XD . (9.2) X X Z n 4|J| ∞

Then

κ1,J ≥ Proof. We have

where

(1 − δ)(1 − c)2 κ e2 . 4|J|

|Ψn ∆|∞ |∆|1 ≥ |∆T Ψn ∆| T1 T1 T T ≥ ∆ DX X XDX ∆ − ∆ (XDX − ZDZ ) XDX ∆ n n T1 1 T T ∆ (XDX − ZDZ ) XDX ∆ ≤ (XDX − ZDZ ) XDX |∆|21 n n ∞ ≤

α(1 − c)2 κ e2 |∆|21 . 4|J|

34

ERIC GAUTIER AND ALEXANDRE TSYBAKOV 4 2 (1−c)2 |J||∆J |2

Combining these inequalities and using that |∆|21 ≤

for all ∆ ∈ CJ (cf. proof of

Proposition 9.2) we get the result.

Note that (9.1) is the restricted eigenvalue condition of Bickel, Ritov and Tsybakov (2009) for the Gram matrix of X-variables, up to the normalization by DX . Relation (9.2) accounts for the closeness between the instruments and the original set of variables suspected to be endogenous. We now obtain bounds for sensitivities κp,J with 1 < p ≤ 2. For any s ≤ K, we consider a

uniform version of the restricted eigenvalue constant: κRE (s) , min|J|≤s κRE,J . Proposition 9.3. For any s ≤ K/2 and 1 < p ≤ 2, we have κp,J ≥ C(p)s−1/p κRE (2s),

where C(p) = 2−1/p−1/2 (1 − c) 1 +

1+c 1−c

(p − 1)−1/p

−1

∀ J : |J| ≤ s,

.

Proof. For ∆ ∈ RK and a set J ⊂ {1, . . . , K}, let J1 = J1 (∆, J) be the subset of indices in {1, . . . , K} corresponding to the s largest in absolute value components of ∆ outside of J. Define J+ = J ∪ J1 . If |J| ≤ s we have |J+ | ≤ 2s. It is easy to see that the kth largest absolute value of elements of ∆J c satisfies |∆J c |(k) ≤ |∆J c |1 /k. Thus, |∆J+c |pp ≤ |∆J c |p1

X 1 |∆J c |p1 ≤ . kp (p − 1)sp−1

k≥s+1

For ∆ ∈ CJ , this implies |∆J+c |p ≤ where c0 = (9.3)

1+c 1−c .

c0 |∆J |p |∆J c |1 c0 |∆J |1 ≤ ≤ , 1/p 1−1/p 1/p 1−1/p (p − 1) s (p − 1) s (p − 1)1/p

Therefore, for ∆ ∈ CJ ,

|∆|p ≤ (1 + c0 (p − 1)−1/p )|∆J+ |p ≤ (1 + c0 (p − 1)−1/p )(2s)1/p−1/2 |∆J+ |2 .

Using (9.3) and the fact that |∆|1 ≤

2 1−c |∆J |1

|∆T Ψn ∆| |∆J+ |22

≤

√ 2 s 1−c |∆J |2

≤ ≤ ≤

Since |J+ | ≤ 2s, this proves the proposition.

for ∆ ∈ CJ , we get

|∆|1 |Ψn ∆|∞ |∆J+ |22 √ 2 s|Ψn ∆|∞ (1 − c)|∆J+ |2 s1/p |Ψn ∆|∞ . C(p)|∆|p

35

The lower bounds in Propositions 9.1 and 9.3 require to control from below |∆T Ψn ∆| (where Ψn is a non-symmetric possibly non-positive definite matrix) by a quadratic form with many zero eigenvalues for vectors in a cone of dominant coordinates. This is potentially a strong restriction on the instruments that we can use. In other words, the sensitivity characteristics κp,J can be much larger than the above bounds. The propositions of this section imply that, even in the case of symmetric matrices, these characteristics are more general and potentially lead to better results than the restricted eigenvalues κRE (·) appearing in the usual RE condition of Bickel, Ritov and Tsybakov (2009). 9.2. Moderate deviations for self-normalized sums. We use of the following result from Jing, Shao and Wang (2003), formula (2.11). Theorem 9.4. Let X1 , . . . , Xn be independent random variables such that, for every i, E[Xi ] = 0 and 0 < E[|Xi |2+δ ] < ∞ for some 0 < δ ≤ 1. Set Sn =

n X

Xi ,

Bn2

=

n X

E[Xi2 ],

=

n X

Xi2 ,

Ln,δ

n i h X 1/(2+δ) . E |Xi |2+δ , dn,δ = Bn /Ln,δ = i=1

i=1

i=1

i=1

Vn2

Then ∀0 ≤ x ≤ dn,δ , |P(Sn /Vn ≥ x) − (1 − Φ(x))| ≤ A0 (1 + x)1+δ e−x

2 /2

/d2+δ n,δ

where A0 > 0 is an absolute constant. 9.3. Proofs. Proof of Proposition 4.1. Parts (i) and (ii) of the proposition are straightforward. 1/p

1−1/p

The upper bound in (4.3) follows immediately from (4.1). Next, obviously, |∆|p ≤ |∆|1 |∆|∞ we get that, for ∆ 6= 0,

Furthermore, (4.1) implies |∆|1 ≤

|Ψn ∆|∞ |Ψn ∆|∞ ≥ |∆|p |∆|∞ 2 1−c |J||∆|∞

|∆|∞ |∆|1

1/p

and

.

for ∆ ∈ CJ . Combining this with the above inequality

we obtain the lower bound in (4.3). Proof of Proposition 4.2. For all 1 ≤ k ≤ K and 1 ≤ l ≤ L, |(Ψn ∆)l − (Ψn )lk ∆k | ≤ |∆|1 max |(Ψn )lk′ |, ′ k 6=k

which yields |(Ψn )lk | |∆k | ≤ |∆|1 max |(Ψn )lk′ | + |(Ψn ∆)l | . ′ k 6=k

The two inequalities of the assumption yield (Ψn )l(k)k |∆k | ≤ |∆|1 (1 − η2 )(1 − c) |(Ψn )l(k)k | + 1 − c (Ψn ∆) (Ψn )l(k)k . l(k) 2|J| η1

36

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

Now, using that (Ψn ∆)l(k) ≤ |Ψn ∆|∞ we obtain

(9.4)

|∆j | ≤ |∆|1

(1 − η2 )(1 − c) 1 − c + |Ψn ∆|∞ 2|J| η1

Summing the inequalities over j in J, yields |∆J |1 ≤

|J|(1 − c) (1 − η2 )(1 − c) |∆|1 + |Ψn ∆|∞ . 2 η1

This and the first inequality in (4.1) imply that we can take (9.5)

κ1,J =

η1 η2 . 2|J|

Next, from (9.4) and (9.5) we deduce

1 − η2 1 |∆j | ≤ + (1 − c) |Ψn ∆|∞ η1 η2 η1 1−c ≤ |Ψn ∆|∞ , η1 η2 which implies κ∞,J ≥ This and the lower bound in (4.3) yield the result.

η1 η2 . 1−c

Proof of Theorem 5.2. Define the event q 1 T ∗ b )r . G = DZ Z U ≤ Q(β n ∞

b ∗ ) = En [U 2 ], the union bound yields Since Q(β P ! L n X z u 1 i li i=1 P(G c ) ≤ P (9.6) ≥r p n zl∗ En [U 2 ] l=1 P ! L n X p z u li i ≤ P pPni=1 ≥ A 2 log(L) . 2 (z u ) i=1 li i l=1

By Theorem 9.4, for all l = 1, . . . , L, P ! p n p p (1 + A 2 log(L))1+δ zli ui i=1 . (9.7) P pPn ≥ A 2 log(L) ≤ 2 1 − Φ(A 2 log(L)) + 2A0 2 LA2 d2+δ i=1 (zli ui ) n,δ

Thus, the event G holds with probability at least 1 − α, by the definition of α in (5.1). Set ∆ , DX −1 (βb − β ∗ ). On the event G we have: 1 1 T T ∗ b |Ψn ∆|∞ ≤ DZ Z (Y − Xβ) + DZ Z (Y − Xβ ) (9.8) n n ∞ ∞

37

1 T ≤ rb σ + DZ Z U n ∞ q b ∗) . ≤r σ b + Q(β

q ∗ ∗ b σ b b On the other hand, (β, Notice that, on the event G, the pair β , Q(β ) belongs to the set I. b) b Thus, on the event G, minimizes the criterion DX −1 β 1 + cσ on the same set I. q −1 ∗ −1 b b ∗ ). σ ≤ |DX β |1 + c Q(β DX β + cb

(9.9)

1

This implies, again on the event G, (9.10)

∆J(β ∗ )c = 1 ≤

X

k∈J(β ∗ )c

xk∗ βbk

q q X ∗ ∗ b b b b |xk∗ βk | − xk∗ βk + c Q(β ) − Q(β)

k∈J(β ∗ )

q q ∗ b b b Q(β ) − Q(β) ≤ ∆J(β ∗ ) 1 + c q E [U X T ]D ∆ n X b by convexity of β → 7 ≤ ∆J(β ∗ ) 1 + c p Q(β) En [U 2 ] E [U X T ]D n X ≤ ∆J(β ∗ ) 1 + c p |∆|1 En [U 2 ] ∞ ≤ ∆J(β ∗ ) 1 + c|∆|1 (by the Cauchy-Schwarz inequality).

Note that (9.10) can be re-written as a cone condition:

∆J(β ∗ )c ≤ 1 + c ∆J(β ∗ ) . 1 1 1−c

(9.11)

Thus, ∆ ∈ CJ(β ∗ ) on the event G. Using (9.8) and arguing as in (9.10) we find (9.12)

|Ψn ∆|∞ ≤ r 2b σ+

q

b ∗) − σ Q(β b

b b Q(β) ≤ r 2b σ+ ! E [U X T ]D ∆ n X ≤ r 2b σ+ p En [U 2 ]

q

b ∗) − Q(β

q

q b ≤σ b β) (since Q( b)

38

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

En [U Xj ] En [U Xj ] |∆J |1 + max q |∆J c |1 σ + max q ≤ r 2b end c end j∈Jend j∈Jend En [Xj2 U 2 ] En [Xj2 U 2 ] En [U Xj ] |∆J c |1 (by the Cauchy-Schwarz inequality). q ≤ r 2b σ + |∆Jend |1 + max end c j∈Jend En [Xj2 U 2 ]

c = {k Since L ≥ K and xji = zj ′ i where j ′ = j − kend , j ∈ Jend end + 1, . . . , K} (the exogenous variables

serve as their own instruments), from (9.7) we obtain that, on the event G, En [U Xj ] ≤ r. max q c j∈Jend En [Xj2 U 2 ]

Combining this with (9.12) and using the definition of the block sensitivity κJ0 ,J(β ∗ ) with J0 = Jend , c , we get that, on the event G, J0 = Jend

|Ψn ∆|∞ ≤ r 2b σ+

(9.13)

q

b ∗) − σ Q(β b

|Ψn ∆|∞ |Ψn ∆|∞ ≤ r 2b σ+ ∗ +r ∗ κJend ,J(β ∗ ) κJ c ,J(β ∗ ) end

!

,

which implies

(9.14)

|Ψn ∆|∞ ≤ 2b σr 1 −

r κ∗Jend ,J(β ∗ )

−

r2 κ∗J c

∗ end ,J(β )

!−1

.

+

This inequality and the definition of the sensitivities yield (5.2) and (5.3). To prove (5.4), it suffices to note that, by (9.9) and by the definition of κ∗J(β ∗ ),J(β ∗ ) , q b ∗) cb σ ≤ |∆J(β ∗ ) |1 + c Q(β q |Ψn ∆|∞ b ∗ ), ≤ + c Q(β κ∗J(β ∗ ),J(β ∗ ) and to combine this inequality with (9.8).

Proof of Theorem 5.7. Part (i) of the theorem is a consequence of (5.4) and Assumptions 5.4 and 5.5. Parts (ii) and (iii) follow immediately from (5.2), (5.3), and Assumptions 5.4 and 5.5. Part (iv) is straightforward in view of (5.9).

39

Proof of Theorem 5.9. Let Gj be the events of probabilities at least 1 − γj respectively appearing in Assumptions 5.4, 5.6, 5.8. Assume that all these events hold, as well as the event G. Then !−1 ! !−1 r r r r2 2σ∗ r , ωk∗ . 1+ ∗ 1− ∗ 1− ∗ − ωk (s) ≤ ∗ ck (s)vk ccJ(β ∗ ) ccJ(β ∗ ) cJend (s) c∗J c (s) +

end

+

By assumption, |βk∗ | > 2ωk∗ for k ∈ J(β ∗ ). Note that the following two cases can occur. First, if k ∈ J(β ∗ )c (so that βk∗ = 0) then, using (5.3) and Assumptions 5.4 and 5.8, we obtain |βbk | ≤ ωk , which

implies βek = 0. Second, if k ∈ J(β ∗ ), then using again (5.3) we get ||βk∗ | − |βbk || ≤ |βk∗ − βbk | ≤ ωk ≤ ωk∗ . Since |βk∗ | > 2ωk∗ for k ∈ J(β ∗ ), we obtain that |βbk | > ωk , so that βek = βbk and the signs of βk∗ and βbk

coincide. This yields the result.

Proof of Theorem 5.11. Fix an arbitrary subset J of {1, . . . , K}. Acting as in (9.10) with J instead

of J(β ∗ ), we get:

X X X X |xk∗ βk∗ | |xk∗ βk∗ | ≤ |xk∗ βk∗ | − xk∗ βbk + 2 xk∗ βbk +

k∈J c

k∈J c

k∈J

+c

This yields (9.15)

q

b ∗) − Q(β

q

b b Q(β)

k∈J c

≤ |∆J |1 + 2 DX −1 β ∗ J c 1 + c|∆|1 . |∆J c |1 ≤ |∆J |1 + 2 DX −1 β ∗ J c 1 + c|∆|1 .

Assume now that we are on the event G. Consider the two possible cases. First, if 2 DX −1 β ∗ J c 1 ≤ eJ and, in particular, (9.13) holds with the sensitivities κ |∆J |1 , then ∆ ∈ C e•,J instead of κ•,J(β ∗ ) . From

this, using the definition of the sensitivity κ ep,J , we get that |∆|p is bounded from above by the first term of the maximum in (5.14). Second, if 2 DX −1 β ∗ J c 1 > |∆J |1 , then for any p ∈ [1, ∞] we have a simple bound

|∆|p ≤ |∆|1 = |∆J c |1 + |∆J |1 ≤

6 DX −1 β ∗ J c 1 . 1−c

In conclusion, |∆|p is smaller than the maximum of the two bounds.

Proof of Theorem 6.2. Throughout the proof, we assume that we are on the event of probability at least 1 − α2 where (6.4) holds. It follows easily from (6.4) that 1 T ∗ b b (9.16) n Z X(β − β ) ≤ bz ∗ . ∞

40

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

Next, an argument similar to (9.6) and Theorem 9.4 yield that, with probability at least 1 − α1 , v u n u1 X 1 T ∗ t (z li ui − θl∗ )2 = r1 F (θ ∗ , β ∗ ). (9.17) Z U − θ ≤ r max 1 n l=1,...,L n 1 ∞ i=1

In what follows, we assume that we are on the event of probability at least 1 − α1 − α2 where both (9.16) and (9.17) are satisfied. We will use the properties of F (θ, β) stated in the next lemma that we prove in Section 9.4. Lemma 9.5. We have b − F (θ, b β) b ≤ |θb − θ ∗ |1 , F (θ ∗ , β) b − F (θ ∗ , β ∗ )| ≤ z ∗ DX −1 (βb − β ∗ ) ≤ bbz ∗ . |F (θ ∗ , β)

(9.18) (9.19)

1

We proceed now to the proof of Theorem 6.2. First, we show that the pair (θ, σ1 ) = (θ ∗ , F (θ ∗ , β ∗ )) belongs to the set Ib1 . Indeed, from (9.16) and (9.17) we get 1 T 1 T 1 T ∗ ∗ b − θ∗ b Z (Y − Xβ) ≤ Z U − θ + Z X(β − β ) n n n ∞ ∞ ∞ ≤ r1 F (θ ∗ , β ∗ ) + bbz ∗ .

Thus, the pair (θ, σ1 ) = (θ ∗ , F (θ ∗ , β ∗ )) satisfies the first constraint in the definition of Ib1 . It satisfies

b ≤ F (θ ∗ , β ∗ ) + bbz ∗ by (9.19). the second constraint as well, since F (θ ∗ , β) (9.20)

bσ Now, as (θ ∗ , F (θ ∗ , β ∗ )) ∈ Ib1 and (θ, b1 ) minimizes |θ|1 + cσ1 over Ib1 , we have b 1 + cb σ1 ≤ |θ ∗ |1 + cF (θ ∗ , β ∗ ), |θ|

which implies (9.21)

|∆J(θ∗ )c |1 ≤ |∆J(θ∗ ) |1 + c(F (θ ∗ , β ∗ ) − σ b1 ),

b β) b ≤σ where ∆ = θb − θ ∗ . Using the fact that F (θ, b1 + bbz ∗ , (9.18), and (9.19) we obtain (9.22)

b β) b + bbz ∗ F (θ ∗ , β ∗ ) − σ b1 ≤ F (θ ∗ , β ∗ ) − F (θ,

This inequality and (9.21) yield

≤ |θb − θ ∗ |1 + 2bbz ∗ .

|∆J(θ∗ )c |1 ≤ |∆J(θ∗ ) |1 + c|θb − θ ∗ |1 + 2cbbz ∗ ,

41

or equivalently, |∆J(θ∗ )c |1 ≤

(9.23)

1+c 2c b |∆J(θ∗ ) |1 + bz ∗ . 1−c 1−c

bσ Next, using (9.16), (9.17) and the second constraint in the definition of (θ, b1 ), we find 1 T b − θb |θb − θ ∗ |∞ ≤ Z (Y − Xβ) n ∞ 1 T 1 T + Z U − θ ∗ + Z X(βb − β ∗ ) n n ∞ ∞ ≤ r1 (b σ1 + F (θ ∗ , β ∗ )) + 2bbz ∗ .

This and (9.22) yield

|θb − θ ∗ |∞ ≤ r1 (2b σ1 + |θb − θ ∗ |1 ) + 2(1 + r1 )bbz ∗ .

(9.24)

On the other hand, (9.23) implies

|θb − θ ∗ |1 ≤

(9.25)

≤

2c b 2 |∆J(θ∗ ) |1 + bz ∗ 1−c 1−c 2|J(θ ∗ )| b 2c b |θ − θ ∗ |∞ + bz ∗ . 1−c 1−c

Inequalities (6.7) and (6.8) follow from solving (9.24) and (9.25) with respect to |θb− θ ∗ |∞ and |θb− θ ∗ |1

respectively.

Proof of Theorem 6.4. We first prove part (i). We will assume that we are on the event of probability at least 1 − α1 − α2 − ε where (9.17), (6.9), and (6.11) are simultaneously satisfied. From (9.20) and the fact that (6.9) can be written as F (θ ∗ , β ∗ ) ≤ σ1∗ we obtain σ b1 ≤ |θb − θ ∗ |1 /c + σ1∗ .

(9.26)

Note also that the argument in the proof of Theorem 6.2 and the results of that theorem remain obviously valid with bb replaced by b∗ . Thus, we can use (6.8) with bb replaced by b∗ , and combining it with (9.26) we obtain

(9.27) This and (6.7) yield (6.12).

σ b1 ≤ σ ∗ .

We now prove part (ii) of the theorem. In the rest of the proof, we assume that we are on the event G ′ of probability at least 1 − α1 − ε − γ where (9.17), (6.9), and the events G, Gj defined in the proofs of Theorems 5.2, 5.7 are simultaneously satisfied. Then item (ii) of Theorem 5.7 with p = 1 implies (6.11) with b∗ defined in (6.13). This and (6.12) easily give part (ii) of the theorem.

42

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

To prove part (iii), note that, by Theorem 5.7 (i) and Assumption 5.8, !−1 2 2b σ rs r r bb = (9.28) ≤ b∗ 1− ∗ − κ1 (s) κJend (s) κ∗J c (s) +

end

b ≤ for b∗ defined in (6.13). This and (9.27) imply that the threshold ω satisfies ω , V (b σ1 , bb, J(θ))

V (σ ∗ , b∗ , s1 ) , ω ∗ on the event G ′ . On the other hand, (6.7) guarantees that |θbl − θl∗ | ≤ ω and, by

assumption, |θl∗ | > 2ω ∗ for all l ∈ J(θ ∗ ). In addition, by (5.2) and (6.7) for all l ∈ J(θ ∗ )c we have |θl∗ | < ω, which implies θel = 0. We finish the proof in the same way as the proof of Theorem 5.7. 9.4. Proof of Lemma 9.5. Set fl (θl ) ,

q

b and f (θ) , maxl=1,...,L fl (θl ) ≡ F (θ, β). b The b l (θl , β), Q 1

mappings θ 7→ fl (θl ) are convex, so that by the Dubovitsky-Milutin theorem (see, e.g., Alekseev,

Tikhomirov and Fomin (1987), Chapter 2), the subdifferential of their maximum f is contained in the convex hull of the union of the subdifferentials of the fl : ∂f ⊆ Conv

(9.29)

L1 [

l=1

∂fl

!

.

Since, obviously, ∂fl (θl ) ⊆ [−1, 1], we find that ∂f (θ) ⊆ {w ∈ RL1 : |w|∞ ≤ 1} for all θ ∈ RL1 . Using this property and the convexity of f , we get b ≤ hw, θ ∗ − θi b ≤ |θb − θ ∗ |1 , f (θ ∗ ) − f (θ)

∀ w ∈ ∂f (θ ∗ ),

where h·, ·i denotes the standard inner product in RL1 . This yields (9.18). The proof of (9.19) is based q b l (θ ∗ , β), on similar arguments. Instead of fl , we now introduce the functions gl defined by gl (β) , Q l and set g(β) , maxl=1,...,L1 gl (β) ≡ F (θ ∗ , β). Next, notice that the subdifferential of gl satisfies

∂gl (β) ⊆ {w ∈ RK : |wk | ≤ alk , k = 1, . . . , K} for all β ∈ RK , l = 1, . . . , L1 , where 1 Pn T ∗ nq i=1 z li xki z li (yi − xi β) − θl alk = . 1 Pn T β) − θ ∗ 2 z (y − x li i i=1 i l n

Consequently, by the Cauchy-Schwarz inequality, DX ∂gl (β) ⊆ {w ∈ RK : |w|∞ ≤ z ∗ } for all β ∈ RK , l = 1, . . . , L1 . This and (9.29) with g, gl instead of f , fl imply DX ∂g(β) ⊆ {w ∈ RK : |w|∞ ≤ z ∗ } for

all β ∈ RK . Using this property and the convexity of g, we get

g(β) − g(β ′ ) ≤ hw, (β − β ′ )i ≤ |DX w|∞ DX −1 (β − β ′ ) 1 ≤ z ∗ DX −1 (β − β ′ ) 1 ,

for any β, β ′ ∈ RK . This proves (9.19).

∀ w ∈ ∂g(β),

43

References [1] Alekseev, V.M, V. M. Tikhomirov, and S. V. Fomin (1987): Optimal Control. Consultants Bureau, New York. [2] Amemiya, T. (1974): “The Non-Linear Two-Stage Least Squares Estimator”. Journal of Econometrics, 2, 105–110. [3] Andrews, D. W. K. (1999): “Consistent Moment Selection Procedures for Generalized Method of Moments Estimation”. Econometrica, 67, 543-564. [4] Andrews, D. W. K., and J. H. Stock (2007): “Inference with Weak Instruments”, in: Advances in Economics and Econometrics Theory and Applications, Ninth World Congress, Blundell, R., W. K. Newey, and T. Persson, Eds, 3, 122–174, Cambridge University Press. [5] Angrist, J. D., and A. B. Krueger (1991): “Does Compulsory School Attendance Affect Schooling and Earnings?”. Quarterly Journal of Economics, 106, 979-1014. [6] Bai J., and S. Ng (2009): “Selecting Instrumental Variables in a Data Rich Environment”. Journal of Time Series Econometrics, 1, 105–110. [7] Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2010): “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain”. Preprint: arXiv:1010.4345. [8] Belloni, A., and V. Chernozhukov (2010): “Post L1-Penalized Estimators in High-Dimensional Linear Regression models”. Preprint: arXiv:1001.0188v2. [9] Belloni, A., V. Chernozhukov, and L. Wang (2010): “Square-Root Lasso: Pivotal Recovery of Sparse Signals Via Conic Programming”. Preprint: arXiv:1009.5689. [10] Belloni, A., and V. Chernozhukov (2011a): “L1-Penalized Quantile Regression in High-Dimensional Sparse Models”. The Annals of Statistics, 39, 82–130. [11] Belloni, A., and V. Chernozhukov (2011b): “High Dimensional Sparse Econometric Models: an Introduction”, in: Inverse Problems and High Dimensional Estimation, Stats in the Chˆ ateau 2009, Alquier, P., E. Gautier, and G. Stoltz, Eds., Lecture Notes in Statistics, 203, 127–162, Springer, Berlin. [12] Bickel, P., J. Y. Ritov, and A. B. Tsybakov (2009): “Simultaneous Analysis of Lasso and Dantzig Selector”. The Annals of Statistics, 37, 1705–1732. [13] B¨ uhlmann, P., and S. A. van de Geer (2011): Statistics for High-Dimensional Data. Springer, New-York. [14] Caner, M. (2009): “LASSO Type GMM Estimator”. Econometric Theory, 25, 1–23. [15] Cand`es, E., and T. Tao (2007): “The Dantzig Selector: Statistical Estimation when p is Much Larger Than n”. The Annals of Statistics, 35, 2313–2351. [16] Carrasco, M., and J. P. Florens (2000): “Generalization of GMM to a Continuum of Moment Conditions”. Econometric Theory, 16, 797–834. [17] Carrasco, M., and J.-P. Florens (2008): “On the Asymptotic Efficiency of GMM”. Working Paper. [18] Carrasco, M. (2008): “A Regularization Approach to the Many Instruments Problem”. Working Paper. [19] Chamberlain, G. (1987): “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions”. Journal of Econometrics, 34, 305–334. [20] Chao, J. C., and N. R. Swanson (2005): “Consistent Estimation with a Large Number of Weak Instruments”. Econometrica, 73, 1673-1692.

44

ERIC GAUTIER AND ALEXANDRE TSYBAKOV

[21] Dalalyan, A., and A. B. Tsybakov (2008): “Aggregation by Exponential Weighting, Sharp PAC-Bayesian Bounds and Sparsity”. Journal of Machine Learning Research, 72, 39–61. [22] Donald, S. G., and W. K. Newey (2001): “Choosing the Number of Instruments”. Econometrica, 69, 1161–1191. [23] Donoho, D. L., M. Elad, and V. N. Temlyakov (2006): “Stable Recovery of Sparse Overcomplete Representations in the Presence of Noise”. IEEE Transactions on Information Theory, 52, 6–18. [24] Hahn, J., and J. Hausman (2002): “A New Specification Test for the Validity of Instrumental Variables”. Econometrica, 70, 163-189. [25] Hall, A. R., and F. P. M. Peixe (2003): “A Consistent Method for the Selection of Relevant Instruments”. Econometric Reviews, 22, 269-287. [26] Hansen, C., J. Hausman, and W. K. Newey (2008): “Estimation with Many Instrumental Variables”. Journal of Business and Economic Statistics, 26, 398–422. [27] Hausman, J., W. K. Newey, T. Woutersen, J. Chao, and N. Swanson (2009): “Instrumental Variable Estimation with Heteroskedasticity and Many Instruments”. Working Paper. [28] Jing, B.-Y., Q. M. Shao, and Q. Wang (2003): “Self-Normalized Cram´er-Type Large Deviations for Independent Random Variables”. The Annals of Probability, 31, 2167–2215. [29] Koltchinskii, V. (2009): “The Dantzig Selector and Sparsity Oracle Inequalities”. Bernoulli, 15, 799–828. [30] Koltchinskii, V. (2011): Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Forthcoming in Lecture Notes in Mathematics, Springer, Berlin. [31] Liao, Z. (2010): “Adaptive GMM Shrinkage Estimation with Consistent Moment Selection ”. Working Paper. [32] Lounici, K. (2008): “Sup-Norm Convergence Rate and Sign Concentration Property of the Lasso and Dantzig Selector”. Electronic Journal of Statistics, 2, 90–102. [33] Newey, W. K. (1990): “Efficient Instrumental Variables Estimation of Nonlinear Models”. Econometrica, 58, 809– 837. [34] Okui, R. (2008): “Instrumental Variable Estimation in the Presence of Many Moment Conditions”. Journal of Econometrics, forthcoming. [35] Rigollet, P., and A. B. Tsybakov (2011): “Exponential Screening and Optimal Rates of Sparse Estimation”. The Annals of Statistics, 35, 731–771. [36] Rosenbaum, M., and A. B. Tsybakov (2010): “Sparse Recovery Under Matrix Uncertainty”. The Annals of Statistics, 38, 2620–2651. [37] Sala-i-Martin, X. (1997): “I Just Ran Two Million Regressions”. The American Economic Review, 87, 178–183. [38] Surm, J. F. (1999): “Using SeDuMi 1.02, a Matlab Toolbox for Optimization Over Symmetric Cones”. Optimization Methods and Software, 11, 625-653. [39] Wooldridge, J. M. (2002): Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge. [40] Ye, F., and C.-H. Zhang (2010): “Rate Minimaxity of the Lasso and Dantzig Selector for the lq Loss in lr Balls”. Journal of Machine Learning Research, 11, 3519–3540.

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & Close