---
title: "Higher-Order MLE Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Higher-Order MLE Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

```{r setup}
library(nabla)
```

## Motivation

The `D` operator composes to arbitrary order: `D(f, x, order = k)` returns
the $k$-th derivative tensor. Gradients ($k = 1$) and Hessians ($k = 2$)
are standard MLE tools. Third-order derivatives unlock additional diagnostic
information — in particular, the **asymptotic skewness** of the MLE.

This vignette demonstrates `D(f, x, order = 3)` on a Gamma MLE problem,
computing:

1. The score and observed information (first and second derivatives)
2. The third-order derivative tensor (a $2 \times 2 \times 2$ array)
3. The asymptotic skewness of the MLE
4. Monte Carlo validation of the theoretical skewness

## The Gamma model

Given $x_1, \ldots, x_n \sim \text{Gamma}(\alpha, \beta)$ with shape
$\alpha > 0$ and rate $\beta > 0$, the log-likelihood in terms of
sufficient statistics $S_1 = \sum \log x_i$ and $S_2 = \sum x_i$ is:

$$\ell(\alpha, \beta) = n\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha - 1)S_1 - \beta S_2$$

This model is ideal for third-order AD because:

- It involves `lgamma`, exercising `nabla`'s special function support
- The Hessian depends only on $(\alpha, \beta)$, not the data
  (observed information = expected information)
- Third derivatives involve `psigamma(alpha, deriv = 2)` (tetragamma)

```{r}
# Simulate data
set.seed(42)
n <- 200
alpha_true <- 3
beta_true <- 2
x_data <- rgamma(n, shape = alpha_true, rate = beta_true)

# Sufficient statistics
S1 <- sum(log(x_data))
S2 <- sum(x_data)

# Log-likelihood using sufficient statistics
ll_gamma <- function(theta) {
  alpha <- theta[1]; beta <- theta[2]
  n * alpha * log(beta) - n * lgamma(alpha) +
    (alpha - 1) * S1 - beta * S2
}
```

## Finding the MLE

`optim` with AD gradients finds the MLE:

```{r}
fit <- optim(
  par = c(2, 1),
  fn = function(theta) -ll_gamma(theta),
  gr = function(theta) -gradient(ll_gamma, theta),
  method = "L-BFGS-B",
  lower = c(1e-4, 1e-4)
)
theta_hat <- fit$par
names(theta_hat) <- c("alpha", "beta")
theta_hat
```

## Score and observed information

At the MLE, the score (gradient of the log-likelihood) should be
approximately zero. The observed information matrix $J$ is the negative
Hessian.

```{r}
# Score at MLE — should be near zero
score <- gradient(ll_gamma, theta_hat)
score

# Observed information
H <- hessian(ll_gamma, theta_hat)
J <- -H
cat("Observed information matrix:\n")
J
```

Verify against analytical formulas. For the Gamma model:

- $J_{11} = n \cdot \psi'(\alpha)$ where $\psi'$ is trigamma
- $J_{12} = -n / \beta$
- $J_{22} = n\alpha / \beta^2$

```{r}
alpha_hat <- theta_hat[1]; beta_hat <- theta_hat[2]
J_analytical <- matrix(c(
  n * trigamma(alpha_hat),   -n / beta_hat,
  -n / beta_hat,              n * alpha_hat / beta_hat^2
), nrow = 2, byrow = TRUE)

cat("Max |AD - analytical| in J:", max(abs(J - J_analytical)), "\n")
```

Standard errors from the inverse information:

```{r}
J_inv <- solve(J)
se <- sqrt(diag(J_inv))
cat("SE(alpha_hat) =", se[1], "\n")
cat("SE(beta_hat)  =", se[2], "\n")
```

## Third-order derivative tensor

`D(f, x, order = 3)` returns the third-order derivative tensor
$T_{rst} = \partial^3 \ell / \partial\theta_r\partial\theta_s\partial\theta_t$.
For a two-parameter model, this is a $2 \times 2 \times 2$ array with 8
entries (symmetry reduces the distinct values).

```{r}
T3 <- D(ll_gamma, theta_hat, order = 3)
T3
```

Verify each entry analytically. For the Gamma model:

| Entry | Formula | Note |
|-------|---------|------|
| $T_{111}$ | $-n \cdot \psi''(\alpha)$ | tetragamma |
| $T_{112} = T_{121} = T_{211}$ | $0$ | |
| $T_{122} = T_{212} = T_{221}$ | $n / \beta^2$ | |
| $T_{222}$ | $-2n\alpha / \beta^3$ | |

```{r}
T3_analytical <- array(0, dim = c(2, 2, 2))
T3_analytical[1, 1, 1] <- -n * psigamma(alpha_hat, deriv = 2)
# T3[1,1,2] and permutations = 0 (already zero)
T3_analytical[1, 2, 2] <- n / beta_hat^2
T3_analytical[2, 1, 2] <- n / beta_hat^2
T3_analytical[2, 2, 1] <- n / beta_hat^2
T3_analytical[2, 2, 2] <- -2 * n * alpha_hat / beta_hat^3

cat("Max |AD - analytical| in T3:", max(abs(T3 - T3_analytical)), "\n")
```

`★ Insight ─────────────────────────────────────`
The `D(f, x, order = 3)` call applies the `D` operator three times, running
$p = 2$ forward passes at each level for $2^3 = 8$ total passes. Each pass
propagates triply-nested dual numbers through `lgamma`, which internally
chains through `digamma`, `trigamma`, and `psigamma(alpha, deriv = 2)`.
`─────────────────────────────────────────────────`

## Asymptotic skewness of the MLE

For a regular model where the Hessian is non-random (as in the Gamma model),
the asymptotic skewness of the MLE $\hat\theta_a$ is:

$$\gamma_1(\hat\theta_a) = \frac{1}{\sqrt{n}} \sum_{r,s,t}
\frac{J^{-1}_{ar} J^{-1}_{as} J^{-1}_{at} \cdot m_{rst}}{[J^{-1}_{aa}]^{3/2}}$$

where $m_{rst} = 4 \cdot T_{rst} / n$ captures the per-observation third
derivative contribution to the asymptotic third cumulant.

```{r}
mle_skewness <- function(J_inv, T3, n) {
  p <- nrow(J_inv)
  skew <- numeric(p)
  for (a in seq_len(p)) {
    s <- 0
    for (r in seq_len(p)) {
      for (s_ in seq_len(p)) {
        for (t in seq_len(p)) {
          s <- s + J_inv[a, r] * J_inv[a, s_] * J_inv[a, t] *
            4 * T3[r, s_, t] / n
        }
      }
    }
    skew[a] <- s / J_inv[a, a]^(3/2)
  }
  skew
}

theoretical_skew <- mle_skewness(J_inv, T3, n)
cat("Theoretical skewness of alpha_hat:", theoretical_skew[1], "\n")
cat("Theoretical skewness of beta_hat: ", theoretical_skew[2], "\n")
```

A positive skewness for $\hat\alpha$ and $\hat\beta$ means their sampling
distributions are right-skewed — there is a longer tail of overestimates
than underestimates. This is typical for shape and rate parameters, which
are bounded below by zero.

## Monte Carlo validation

Validating the theoretical skewness by simulating many datasets and
computing the MLE for each:

```{r}
skewness <- function(x) {
  m <- mean(x); s <- sd(x)
  mean(((x - m) / s)^3)
}

set.seed(42)
B <- 5000
alpha_hats <- numeric(B)
beta_hats <- numeric(B)

for (b in seq_len(B)) {
  x_b <- rgamma(n, shape = alpha_true, rate = beta_true)
  S1_b <- sum(log(x_b))
  S2_b <- sum(x_b)

  ll_b <- function(theta) {
    a <- theta[1]; be <- theta[2]
    n * a * log(be) - n * lgamma(a) + (a - 1) * S1_b - be * S2_b
  }

  fit_b <- optim(
    par = c(2, 1),
    fn = function(theta) -ll_b(theta),
    method = "L-BFGS-B",
    lower = c(1e-4, 1e-4)
  )
  alpha_hats[b] <- fit_b$par[1]
  beta_hats[b] <- fit_b$par[2]
}

empirical_skew <- c(skewness(alpha_hats), skewness(beta_hats))
```

Compare theoretical vs empirical skewness:

```{r}
comparison <- data.frame(
  parameter = c("alpha", "beta"),
  theoretical = round(theoretical_skew, 4),
  empirical = round(empirical_skew, 4)
)
comparison
```

The agreement between the theoretical prediction (computed from third-order
AD) and the Monte Carlo estimate validates both the AD computation and the
asymptotic skewness formula.

## Visualizing MLE asymmetry

The sampling distributions of the MLEs show the asymmetric shape predicted
by the skewness analysis:

```{r fig-mle-histograms, fig.width=7, fig.height=3.5}
oldpar <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# alpha_hat
hist(alpha_hats, breaks = 50, freq = FALSE, col = "grey85", border = "grey60",
     main = expression(hat(alpha)),
     xlab = expression(hat(alpha)))
curve(dnorm(x, mean = mean(alpha_hats), sd = sd(alpha_hats)),
      col = "steelblue", lwd = 2, add = TRUE)
abline(v = alpha_true, col = "firebrick", lty = 2, lwd = 2)
legend("topright",
       legend = c("Normal approx", expression(alpha[true])),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n", cex = 0.8)

# beta_hat
hist(beta_hats, breaks = 50, freq = FALSE, col = "grey85", border = "grey60",
     main = expression(hat(beta)),
     xlab = expression(hat(beta)))
curve(dnorm(x, mean = mean(beta_hats), sd = sd(beta_hats)),
      col = "steelblue", lwd = 2, add = TRUE)
abline(v = beta_true, col = "firebrick", lty = 2, lwd = 2)
legend("topright",
       legend = c("Normal approx", expression(beta[true])),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n", cex = 0.8)
par(oldpar)
```

The right tail is visibly heavier than the left, consistent with the
positive skewness values. The normal approximation (blue curve) misses
this asymmetry — which is precisely what third-order derivative analysis
captures.

## Summary

Third-order derivatives computed by `D(f, x, order = 3)` provide
information beyond the standard gradient and Hessian:

- **The $2 \times 2 \times 2$ tensor** captures how the curvature of the
  log-likelihood changes across the parameter space. For the Gamma model,
  `nabla` propagates through `lgamma` → `digamma` → `trigamma` →
  `psigamma(deriv = 2)` automatically.
- **Asymptotic skewness** of the MLE quantifies departure from normality.
  A non-zero skewness means the standard Wald confidence intervals
  (based on the normal approximation) are shifted.
- **Monte Carlo validation** confirms the theoretical predictions, giving
  confidence that the AD computation is correct.

The `D` operator makes this analysis composable: `D(f, x, order = 3)`
is just three applications of the same operator that gives gradients and
Hessians. No separate code paths, no symbolic algebra, no finite-difference
tuning.