---
title: "Real cancer drivers walkthrough"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Real cancer drivers walkthrough}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
# Skip evaluation of all chunks on CRAN's auto-check farm to fit the
# 10-minute build budget. Locally, on CI, and under devtools::check(),
# NOT_CRAN=true and all chunks evaluate normally. The vignette source
# (which CRAN users see in browseVignettes() / vignette()) is unchanged.
NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true")
knitr::opts_chunk$set(eval = NOT_CRAN)
```

# Real cancer drivers walkthrough

This vignette uses the bundled `dataset_real_cancer_drivers_4` dataset to
illustrate a real biological analysis: how do four canonical cancer driver
catalogs overlap?

The four sources are:

* **Vogelstein** — the 138-gene catalog from
  Vogelstein et al. (Science 2013), often cited as the "core" oncogene set.
* **COSMIC_CGC** — the COSMIC Cancer Gene Census (Sondka et al. 2018), a
  curated list of genes causally implicated in cancer.
* **OncoKB** — the MSK precision-oncology knowledge base annotation level
  ≥ "Oncogenic" (Chakravarty et al. 2017).
* **IntOGen** — pan-cancer driver mutations from the IntOGen pipeline
  (Martínez-Jiménez et al. 2020).

```{r setup-data}
library(vennDiagramLab)
ds <- load_sample("dataset_real_cancer_drivers_4")
ds@set_names
```

## Set sizes

```{r sizes}
sapply(ds@items, length)
```

The lists are very different in size — Vogelstein is the smallest curated
set; OncoKB is the most permissive at this annotation tier.

## Universe

The dataset was built from a 20,000-gene background (`universe_size`):

```{r universe}
ds@universe_size
```

This is the population N used in the hypergeometric over-representation
tests (see `vignette("v05_statistics_deep_dive")`).

## Analyze

```{r analyze}
result <- analyze(ds)
result@model
length(result@regions)
```

The default model for 4 sets is `venn-4-set` (Edwards-style).

## Set sizes (inclusive) and intersection layout

```{r set-sizes-table}
result@set_sizes
```

## A summary at a glance

`broom::glance()` returns a one-row tibble with the headline numbers:

```{r glance}
broom::glance(result)
```

## Render the venn diagram

The default render uses the dataset's set names as labels. To shorten them
for the diagram, pass a per-letter override:

```{r render-custom}
svg <- render_venn_svg(
    result,
    set_names = c(A = "Vogelstein", B = "COSMIC", C = "OncoKB", D = "IntOGen"),
    title = "Cancer driver overlap (4 sources)"
)
nchar(svg)
```

(See `vignette("v08_custom_styling_and_export")` for color overrides and
post-render SVG manipulation.)

## UpSet view

For 4+ sets, an UpSet plot is often easier to read than the Venn diagram —
each intersection size is a bar, sorted by cardinality.

```{r upset, eval = NOT_CRAN && (getRversion() >= "4.6")}
upset_plot <- render_upset(result, sort_by = "size")
upset_plot
```

(The chunk above is gated on `R >= 4.6` because the CRAN release of
`ComplexUpset` (1.3.3) is incompatible with `ggplot2 >= 4.0` on older R —
see `?vennDiagramLab::render_upset` for context.)

## Top significant intersections

`broom::tidy()` returns one row per set pair, with all five pairwise metrics
plus the BH-FDR-adjusted hypergeometric p-value:

```{r tidy}
top_pairs <- broom::tidy(result)
top_pairs[order(top_pairs$p_adjusted), c("set_a", "set_b", "intersection",
                                          "jaccard", "p_adjusted",
                                          "significant")]
```

Every pair is significant at FDR < 0.05 (as expected — these catalogs are
designed to overlap on biology).

## Item-level annotation

`broom::augment()` returns one row per gene with set-membership flags and
the region label.

```{r augment}
gene_table <- broom::augment(result)
head(gene_table)
nrow(gene_table)        # total unique genes across all four sets
table(gene_table$region_label)   # how many genes in each region
```

## Save the region summary

```{r save-summary, eval = FALSE}
to_region_summary_tsv(result, "cancer_drivers_regions.tsv")
```

## What's next

* `vignette("v05_statistics_deep_dive")` — interpret the Jaccard / Dice /
  hypergeometric numbers in detail.
* `vignette("v07_pdf_reports")` — turn this analysis into a multi-page PDF.
* `vignette("v08_custom_styling_and_export")` — customize colors, embed in a
  ggplot, export to PDF/PNG.