Notes on ''The Asymptotic Variance of Semiparametric Estimators''

Nov 5, 2023 causal-inference

Table of Contents

0. Background: Semiparametric Model

Reference: Lecture notes from Prof. Bodhisattva Sen (Columbia University)

A semiparametric model is a statistical model that involves both parametric and nonparametric (infinite-dimensional) components. However, we are mostly interested in estimation and inference of a finite-dimensional parameter in the model.

Example 1 (population mean)

Suppose that $X_1, \ldots, X_n$ are i.i.d. $P$ belonging to the class $\mathcal{P}$ of distribution. Let $\psi(P) \equiv \mathbb{E}_P\left[X_1\right]$, the mean of the distribution, be the parameter of interest.

Question:

Suppose that $\mathcal{P}$ is the class of all distributions that have a finite variance.
What is the most efficient estimator of $\psi(P)$, i.e., what is the estimator with the best asymptotic performance?

A model $\mathcal{P}$ is simply a collection of probability distributions for the data we observe.

Example 2 (partial linear regression model)

Suppose that we observe i.i.d. data $\left\{X_i \equiv\left(Y_i, Z_i, V_i\right): i=1, \ldots, n\right\}$ from the following partial linear regression model: $$ Y_i=Z_i^{\top} \beta+g\left(V_i\right)+\epsilon_i $$

$Y_i$ is the scalar response variable
$Z_i$ and $V_i$ are vectors of predictors
$g(\cdot)$ is the unknown (nonparametric) function
$\epsilon_i$ is the unobserved error.

For simplicity and to focus on the semiparametric nature of the problem:

Assume that $\left(Z_i, V_i\right) \sim f(\cdot, \cdot)$, where we assume that the density $f(\cdot, \cdot)$ is known, is independent of $\epsilon_i \sim N\left(0, \sigma^2\right)$ (with $\sigma^2$ known).
The model, under these assumptions, has a parametric component $\beta$ and a nonparametric component $g(\cdot)$

“Separated semiparametric model”: We say that the model $\mathcal{P}=\left\{P_{\nu, \eta}\right\}$ is a “separated semiparametric model”, where $\nu$ is a “Euclidean parameter” and $\eta$ runs through a nonparametric class of distributions (or some infinite-dimensional set). This gives a semiparametric model in the strict sense, in which we aim at estimating $\nu$ and consider $\eta$ as a nuisance parameter.

“Frequent questions for semiparametric model”: consider the estimation of a parameter of interest $\nu=\nu(P)$, where the data has distribution $P \in \mathcal{P}$:

(Q1) How well can we estimate $\nu=\nu(P)$ ? What is our “gold standard”?
(Q2) Can we compare absolute “in principle” standards for estimation of $\nu$ in a model $\mathcal{P}$ with estimation of $\nu$ in a submodel $\mathcal{P}_0 \subset \mathcal{P}$ ? What is the effect of not knowing $\eta$ on estimation of $\nu$ when $\mathcal{P}=\left\{P_\theta: \theta \equiv(\nu, \eta) \in \Theta\right\}$ ?
(Q3) How do we construct efficient estimators of $\nu(P)$ ?

A model $\mathcal{P}$ is simply a collection of probability distributions for the data we observe.

1. Introduction

Develop a general form for the asymptotic variance of semiparametric estimators that depend on nonparametric estimators of functions.
The formula is often straightforward to derive, requiring only some calculus.
Although the formula is not based on primitive conditions, it should be useful for semiparametric estimators, just as analogous formulae are for parametric estimators.
The formula gives the form of remainder terms, which facilitates specification of primitive conditions.
It can also be used to make asymptotic efficiency comparisons and to find an efficient estimator in some class.

- 
  - Derive the formula: **Section 2**
  - Propositions about semiparametric estimator
    - **Section 3**
    - **Section 4**
  - High-level regularity conditions: **Section 5**
  - Conditions for $\sqrt{n}$-consistency and asymptotic normality: **Section 6**
  - Primitive conditions for the examples: **Section 7**

2. The Pathwise Derivative Formula For the Asymptotic Variance

Preliminary: one-step estimators and pathwise derivatives

The formula is based on the observation that $\sqrt{n}$-consistent nonparametric estimators are often efficient.

For example, the sample mean is known to be an efficient estimator of the population mean in a nonparametric model where no restrictions, other than regularity conditions (e.g. existence of the second moment) are placed on the distribution of the data.

Idea

Calculate the asymptotic variance of a semiparametric estimator as the variance bound for the functional that it nonparametrically estimates.
In other words, the formula is the variance bound for the functional that is the limit of the estimator under general misspecification.

Let $z_1, \ldots, z_n$ be i.i.d. data, with (true) distribution $F_0$ of $z_i$, and let $\hat{\beta}=\beta_n\left(z_1, \ldots, z_n\right)$ denote a $q \times 1$ vector of estimators. Suppose $\hat{\beta}$ can be associated with a family of distributions and a functional as $$ \hat{\beta} \rightarrow \begin{cases}\mathscr{F}=\{F\} ; & \text { general family of distributions of } z \\ \mu: \mathscr{F} \rightarrow \mathbb{R}^q ; & \text { if } z_i \text { has distribution } F \text { then } {\color{red}\operatorname{plim}(\hat{\beta})=\mu(F)}\end{cases} $$

The word “general” is taken to mean that $\mathscr{F}$ is unrestricted, except for regularity conditions, and allows for general misspecification.
This equation also specifies that $\mu(F)$ is the limit of $\hat{\beta}$ when $z_i$ has distribution $F$.
$\mu(F)$ traces out the limits of $\hat{\beta}$ as $F$ varies within the general family $\mathscr{F}$.
The variance formula for $\hat{\beta}$ is the semiparametric bound for estimation of $\mu(F)$, calculated as in Koshevnik and Levit (1976), Pfanzagl and Wefelmeyer (1982), and others.

Let $\{F_\theta: F_\theta \in \mathscr{F}\}$ denote a one-dimensional subfamily of $\mathscr{F}$, i.e. a path in $\mathscr{F}$, that is equal to the true distribution $F_0$ when $\theta=0$.

Suppose that $F_\theta$ has a density $d F_\theta$ and a corresponding score $${\color{red}S(z)=\frac{\partial \ln \left(d F_\theta\right)}{ \partial \theta}\Big|_{\theta=0}}.$$
Suppose that the set of scores can approximate in mean square any mean zero, finite variance function of $z$.
Let $E[\cdot]$ denote the expectation at the true distribution $F_0$.

The pathwise derivative of $\mu(F)$ is a $q \times 1$ vector $d(z)$ with $E[d(z)]=0$ and $E\left[|d(z)|^2\right]<\infty$ such that for every path, $$ {\color{red}\frac{\partial \mu\left(F_\theta\right)}{\partial \theta}=E[d(z) S(z)].}\qquad(\star) $$

The variance bound for estimation of $\mu(F)$ is $\operatorname{Var}(d(z))$.
Thus, the asymptotic variance formula suggested here is the variance of the pathwise derivative of the functional $\mu(F)$ that is estimated under general misspecification.

Example

Consider the parameter $\beta_0=\int f_0(z)^2 d z$, where $f_0(z)$ is the density function of $z_i$.
One estimator is $\tilde{\beta}=\sum_{i=1}^n \hat{f}\left(z_i\right) / n$, for a nonparametric density estimator $\hat{f}(z)$ of $z_i$.
Suppose $z$ is symmetrically distributed around zero. Then one might hope to improve efficiency by using the antithetic estimate $\hat{f}(-z)$ of the density to form $$\hat{\beta}=\sum_{i=1}^n\frac{[\hat{f}(z_i)+\hat{f}(-z_i)]}{2}.$$
The asymptotic variance can be found by calculating the limit of $\hat{\beta}$ under general misspecification, where $z$ need not be symmetric about zero, and the pathwise derivative of this limit.
- Let $E_F[\cdot]$ denote the expectation at a distribution $F$ and let $E_\theta[\cdot]=E_{F_\theta}[\cdot]$ for a path $F_\theta$.
- By an appropriate uniform law of large numbers the limit of $\hat{\beta}$ is $\mu(F)=\int[f(z)+f(-z)] f(z) d z / 2$.
- Assuming that differentiation inside the integral is allowed, $$ \begin{aligned} \frac{\partial \mu\left(F_\theta\right)}{ \partial \theta} &=\int\left[\frac{\partial f_\theta(z) }{ \partial \theta}\right] f_0(z) d z\\ &\qquad +\frac{1}{2}\left\{\int\left[\frac{\partial f_\theta(-z)}{ \partial \theta}\right] f_0(z) d z\right.\\ &\qquad\qquad\qquad\qquad\qquad +\left.\int\left[\frac{\partial f_\theta(z) }{ \partial \theta}\right] f_0(-z) d z\right\}\\ &=E\left[\left\{f_0(z)+f_0(-z)\right\} S(z)\right]\\ &=E[d(z) S(z)] \end{aligned} $$
Thus, in this example the asymptotic variance formula is ${\color{red}\operatorname{Var}\left(2 f_0(z)\right)}$, which is the well known asymptotic variance of $\tilde{\beta}$, so no efficiency improvement results.

The pathwise derivative generalizes the Gateaux derivative formula for von-Mises estimators:

The pathwise derivative formula works for estimators that are explicit functions of densities or expectations, where the domain of $\mu(F)$ may only include continuous distributions.
The Gateaux derivative formula only applies when the domain of $\mu(F)$ also includes discrete distributions

A precise justification for the asymptotic variance formula is available when $\hat\beta$ is asymptotically equivalent to a sample average: define $\hat{\beta}$ to be asymptotically linear with influence function $\psi(z)$ when $z_i$ has distribution $F_0$: $$ \begin{aligned} & \sqrt{n}\left(\hat{\beta}-\beta_0\right)=\sum_{i=1}^n \psi\left(z_i\right) / \sqrt{n}+O_p(1), \\ & E[\psi(z)]=0, \quad \operatorname{Var}(\psi(z)) \text { finite. } \end{aligned} $$

Asymptotic linearity and the central limit theorem imply $\hat{\beta}$ is asymptotically normal with variance $\operatorname{Var}(\psi(z))$.

Regular path & regular estimator

Define the path $\left\{F_\theta: \theta \in(-\varepsilon, \varepsilon) \subset \mathbb{R}, \varepsilon>0, F_\theta \in \mathscr{F}\right\}$ to be regular if each distribution is absolutely continuous w.r.t the same dominating measure and $S(z)$ satisfies the mean-square derivative condition $$ \int\left[\theta^{-1}\left(d F_\theta^{1 / 2}-d F_0^{1 / 2}\right)-\frac{1}{2} S(z) d F_0^{1 / 2}\right]^2 d z \rightarrow 0, \text { as } \quad \theta \rightarrow 0 . $$
Define $\hat{\beta}$ to be a regular estimator of $\mu(F)$ if for any regular path and $\theta_n=$ $0(1 / \sqrt{n})$, when $z_i$ has distribution $F_{\theta_n}, \sqrt{n}\left(\hat{\beta}-\mu\left(F_{\theta_n}\right)\right)$ has a limiting distribution that does not depend on $\left\{\theta_n\right\}_{n=1}^{\infty}$.

The pathwise derivative asymptotic variance formula

THEOREM 2.1: Suppose that

(i) the set of scores for regular paths is linear;
(ii) for any $\varepsilon>0$ and measureable $s(z)$ with $E[s(z)]=0$ and $E\left[s(z)^2\right]<\infty$, there is a regular path with score $S(z)$ satisfying $E\left[|s(z)-S(z)|^2\right]<\varepsilon$;
(iii) $\hat{\beta}$ is asymptotically linear and regular.

Then there is $d(z)$ such that equation $(\star)$ is satisfied and $\psi(z)=d(z)$

Condition (ii), that the scores can approximate any mean zero function, is the precise version of the “generality” property of $\mathscr{F}$.
Regularity of $\hat{\beta}$ is the precise condition that specifies that $\hat{\beta}$ is a nonparametric estimator of $\mu(F)$.
Innovations: calculating the bound for the functional $\mu(F)$ is nonparametrically estimated by $\hat\beta$
Asymptotic linearity and regularity imply pathwise differentiability
Condition (ii) implies that there is only one influence function and that it equals the pathwise derivative

Theorem 2.1 give a justification for the pathwise derivative formula, rather than an approach to showing asymptotic normality.
A better approach:
- Solve equation $(\star)$ for the pathwise derivative, as a candidate for the influence function
- Formulate regularity conditions for the remainder $\sqrt{n}\left(\hat{\theta}-\theta_0\right)-\sum_{i=1}^n \psi\left(z_i\right) / \sqrt{n}$ to be small.
The formula is a very important part of this approach, because it provides the form of the remainder.
This approach, with formal calculation followed by regularity conditions, is similar to that used in parametric asymptotic theory (e.g. for Edgeworth expansions).

3. Semiparametric $M$-estimators

Let $h$ denote a function, that can depend on the parameters $\beta$ and the data $z$.
Let $m(z, \beta, h)$ be a vector of functions with the same dimension as $\beta$. Here $m(z, \beta, h)$ can depend on the entire function $h$, rather than just its value at particular points, so $m(z, \beta, h)$ is a vector of functionals.
Suppose that $E\left[m\left(z, \beta_0, h_0\right)\right]=0$ for the true values $\beta_0$ and $h_0$.
Let $\hat{h}$ denote an estimator of $h$. A semiparametric $m$-estimator is one that solves a moment equation of the form $$ {\color{red}\frac{1}{n}\sum_{i=1}^n m\left(z_i, \beta, \hat{h}\right)=0} .\qquad (\Delta) $$
The general idea here is that $\hat{\beta}$ is obtained by a procedure that “plugs-in” an estimated function $\hat{h}$.

Example 3.1 Quasi-maximum Likelihood for a Conditional Mean Index

The conditional mean index model: $E[y \mid x]=\tau\left(v\left(x, \beta_0\right)\right)$ for a known function $v(x, \beta)$ and an unknown function $\tau(\cdot)$.
Let $\hat{h}(x, \beta)$ be a nonparametric estimator of $E[y \mid v(x, \beta)]$, such as a kernel estimator.

An estimator of $\beta_0$ suggested by Ichimura (1993) minimizes $\sum_{i=1}^n\left[y_i-\hat{h}\left(x_i, \beta\right)\right]^2$.

When $y_i$ is binary, Klein and Spady (1993) have suggested maximizing $$\sum_{i=1}^n\left\{y_i \ln [\hat{h}(x_i, \beta)]+(1-y_i) \ln [1-\hat{h}(x_i, \beta)]\right\}$$

A generalization of these estimators is a quasi-maximum likelihood estimator (QMLE) for an exponential family.

The estimator will be efficient when the true distribution has the exponential form.

To describe the estimator, let $l(u, \nu)=\exp (A(\nu)+B(u)+C(\nu) u)$ be a linear exponential density, with mean $\nu$.
Consider an estimator $\hat{\beta}$ that maximizes $\sum_{i=1}^n \ln l\left(y_i, \hat{h}\left(x_i, \beta\right)\right)$.
The first order conditions for this estimator make it a special case of equation $(\Delta)$, with $$ m(z, \beta, h)=\left[A_\nu(h(x, \beta))+C_\nu(h(x, \beta)) y\right] \frac{\partial h(x, \beta) } {\partial \beta} $$

Example 3.2: Inverse Density Weighted Least Squares

Let $w(x)=r((x-$ $\zeta)^{\prime} \Omega(x-\zeta)$ ) be an elliptically symmetric density function, where $\zeta$ is a vector and $\Omega$ a positive definite matrix.
Let $\hat{h}\left(x_l\right)$ be an estimator of the density of $x$, such as a kernel estimator.
As shown by Ruud (1986), the weighted least squares estimator $\hat{\beta}=\left[\sum_{l=1}^n w\left(x_i\right) \hat{h}\left(x_l\right)^{-1} x_l x_l^{\prime}\right]^{-1} \sum_{l=1}^n w\left(x_l\right) \hat{h}\left(x_l\right)^{-1} x_i y_i$ will be consistent up to scale, for the coefficients $\gamma_0$ of an index model $E[y \mid x]=\tau\left(x^{\prime} \gamma_0\right)$.
This example is a special case of equation $(\Delta)$ with $$ m(z, \beta, h)=w(x) h(x)^{-1} x\left(y-x^{\prime} \beta\right) . $$
This estimator will be used to illustrate the correction term for density estimates.
Asymptotic normality and $\sqrt{n}$-consistency for $\hat{h}(x)$ a kernel estimator, are shown in Newey and Ruud (1991).

To use the pathwise derivative formula in this derivation, it is necessary to identify the functional that is nonparametrically estimated by $\hat{\beta}$.

Let $h(F)$ denote the limit of $\hat{h}$ when $z$ has distribution $F$.
The limit $\mu(F)$ of $\hat{\beta}$ for a general $F$ should be the solution to $$ E_F[m(z, \mu, h(F))]=0 .\qquad(\odot) $$
Equation $(\Delta)$ sets $\hat{\beta}$ so that sample moments are zero, and the sample moments have a limit of $E_F[m(z, \beta, h(F))]$
- by the law of large numbers and $h(F)$ equal to the limit of $\hat{h}$
$\Rightarrow$ $\hat{\beta}$ is consistent for the solution of $(\odot)$.

Remarks

the estimators depend only on the limit $h(F)$, and not on the particular form of the estimator $\hat{h}$.
Different nonparametric estimators of the same functions should result in the same asymptotic variance.

Proposition 1: The asymptotic variance of semiparametric estimators depends only on the function that is nonparametrically estimated, and not on the type of estimator.

For a path $\left\{F_\theta\right\}$, let $h(\theta)=h\left(F_\theta\right)$. Here, $\mu\left(F_\theta\right)$ will satisfy the population moment equation $$ E_\theta[m(z, \mu, h(\theta))]=0 . $$
Let $m(z, h)=m\left(z, \beta_0, h\right)$. Differentiation under the integral gives $$ \frac{\partial E_\theta\left[m\left(z, h_0\right)\right]}{ \partial \theta}=\int m\left(z, h_0\right)\left[\frac{\partial d F_\theta }{ \partial \theta}\right] d z=E\left[m\left(z, h_0\right) S(z)\right] . $$
Using the chain rule, it follows that $$ \frac{\partial E_\theta[m(z, h(\theta))] }{ \partial \theta}=E\left[m\left(z, h_0\right) S(z)\right]+\frac{\partial E[m(z, h(\theta))] }{ \partial \theta} . $$
Assuming $M \equiv \partial E\left[m\left(z, \beta, h_0\right)\right] /\left.\partial \beta\right|_{\beta_0}$ is nonsingular, by the implicit function theorem $$ \frac{\partial \mu\left(F_\theta\right) }{ \partial \theta}=-M^{-1}\left\{E\left[m\left(z, h_0\right) S(z)\right]+\frac{\partial E[m(z, h(\theta))] }{ \partial \theta}\right\} . $$
The first term is already in an outer product form, so that the pathwise derivative can be found by putting the second term in singular form
Suppose there is a $\alpha(z)$ such that $E[\alpha(z)]=0$ and $$\frac{\partial E[m(z, h(\theta))] }{ \partial \theta}=E[\alpha(z) S(z)]\qquad(\oplus)$$.
Move $-M^{-1}$ inside the expectation, it follows that the pathwise derivative is $d(z)=-M^{-1}\left\{m\left(z, h_0\right)+\alpha(z)\right\}$
By Theorem 2.1 the influence function of $\hat{\beta}$ is $$ \psi(z)=-M^{-1}\left\{m\left(z, \beta_0, h_0\right)+\alpha(z)\right\} $$ Remarks
The leading term $-M^{-1} m\left(z, \beta_0, h_0\right)$ is the usual formula for the influence function of an $m$-estimator with moment functions $m\left(z, \beta, {\color{red}h_0}\right)$.
The solution to equation $(\oplus)$ is an adjustment term for the estimation of $h_0$.
Solving equation $(\oplus)$ is therefore the essential step in discovering how the estimation of $h$ affects the asymptotic variance.
This solution can be interpreted as
- the pathwise derivative of the functional $-M^{-1} E\left[m\left(z, \beta_0, h(F)\right)\right]$
- the influence function of $-M^{-1} \int m\left(z, \beta_0, \hat{h}\right) d F_0(z)$.

When will the adjustment term be zero ?

If the adjustment term is zero, then it should not be necessary to account for the presence of $\hat{h}$, i.e. $\hat{h}$ can be treated as if it were equal to $h_0$
One case: equation $(\Delta)$ is the first-order condition to a maximization problem, and $\hat{h}$ has a limit that maximizes the population value of the same function.

To be specific, suppose that there is a function $q(z, \beta, h)$ and a set of functions $\mathscr{H}(\beta)$, possibly depending on $\beta$ but not on the distribution $F$ of $z$, such that $$ \begin{aligned} & m(z, \beta, h)=\partial q(z, \beta, h) / \partial \beta, \\ & h(F)=\operatorname{argmax}_{\tilde{h} \in \mathscr{H}(\beta)} E_F[q(z, \beta, \tilde{h})] .\qquad(\otimes) \end{aligned} $$

$m(z, \beta, h)$ are the first order conditions for a maximum of the function $q$
$h(F)$ maximizes the expected value of the same function
- i.e. $h(F)$ has been “concentrated out”.

For any parametric model $F_\theta$, since $h(\theta)=h\left(F_\theta\right)$ $\Rightarrow$ $E[q(z, \beta, h(\theta))]$ is maximized at $\theta=0$ $\Rightarrow$ $\partial E[q(z, \beta, h(\theta))] / \partial \theta=0$. Differentiating again with respect to $\beta$, $$ \begin{aligned} 0 & =\partial^2 E[q(z, \beta, h(\theta))] / \partial \theta \partial \beta\\ &=\partial E[\partial q(z, \beta, h(\theta)) / \partial \beta] / \partial \theta \\ & =\partial E[m(z, \beta, h(\theta))] / \partial \theta . \end{aligned} $$

Evaluating this equation at $\beta_0$, it follows that $\alpha(z)=0$ will solve equation $(\oplus)$, and hence the adjustment term is zero.

Proposition 2: If equation $(\otimes)$ is satisfied, then the estimation of $h$ can be ignored in calculating the asymptotic variance, i.e. it is the same as if $\hat{h}=h_0$

A more direct condition

Suppose that $m(z, h)$ depends on $h$ only through its value $h(v)$ at a subvector $v$ of $z$, i.e. $m(z, h)=m(z, h(v))$ where the last function depends on a real vector argument in $m(z, h)$.
Let $h(v, \theta)$ denote the limiting value of $\hat{h}\left(v, \beta_0\right)$ for a path. For $D(z)=\partial m\left(z, \beta_0, h\right) /\left.\partial h\right|_{h=h_0(v)}$, differentiation gives $$ \frac{\partial E[m(z, h(\theta))] }{ \partial \theta}=E\left[D(z) \frac{\partial h(v, \theta) }{ \partial \theta}\right]=\frac{\partial E[D(z) h(v, \theta)] }{ \partial \theta} . $$
If this derivative is zero for all $h(v, \theta)$, then $\alpha(z)=0$ will solve equation $(\oplus)$, and the adjustment term is zero.
One simple condition for this is that $E[D(z) \mid v]=0$.

More generally, the adjustment term will be zero if $h(v, \theta)$ is an element of a set to which $D(z)$ is orthogonal.

Proposition 3: If $E[D(z) \mid v]=0$, or more generally, for all $F, h(v, F)$ is an element of a set $\mathscr{H}$ such that $E[D(z) \tilde{h}(v)]=0$ for all $\tilde{h} \in \mathscr{H}$, then estimation of $h$ can be ignored in calculating the asymptotic variance.

Remarks:

This condition can be checked by straightforward calculation, unlike Proposition 2, which requires finding $q(z, \beta, h)$ satisfying equation $(\otimes)$.

4. Functions of Mean-square Projection and Densities

5. Regularity Conditions

6. Series Estimation of Projection Functionals

7. Power Series Estimators For Semiparametric Individual Effects and Average Derivatives

References

KOSHEVNIK, Y. A., AND B. Y. LEVIT (1976): “On a Non-parametric Analogue of the Information Matrix”, Theory of Probability and Applications, 21, 738-753.
PFANZAGL, J., AND WEFELMEYER (1982): “Contributions to a General Asymptotic Statistical Theory”, New York: Springer-Verlag.

Li Zhe

PhD Student

My research interests include distributed statistical modelling & inference, network data modelling.