The moc.gapbk package implements the
Multi-Objective Clustering Algorithm Guided by a-Priori
Biological Knowledge (MOC-GaPBK) proposed by Parraga-Alava and
others (2018). The algorithm combines:
It receives two distance matrices and produces a set of non-dominated clustering solutions. The second matrix is typically used to encode a-priori biological knowledge (for example, semantic similarity between genes).
library(moc.gapbk)
set.seed(2025)
# Toy data: 50 objects (e.g. genes) described by 20 features (e.g. samples).
x <- matrix(stats::runif(50 * 20, min = -5, max = 10),
nrow = 50, ncol = 20)
# Two distance matrices over the same set of objects.
# Here we use amap if available (correlation distance is biologically
# common), and fall back to base R otherwise so the vignette knits
# under any configuration.
if (requireNamespace("amap", quietly = TRUE)) {
d1 <- as.matrix(amap::Dist(x, method = "euclidean"))
d2 <- as.matrix(amap::Dist(x, method = "correlation"))
} else {
d1 <- as.matrix(stats::dist(x, method = "euclidean"))
d2 <- as.matrix(stats::dist(x, method = "manhattan"))
}
res <- moc.gapbk(dmatrix1 = d1,
dmatrix2 = d2,
num_k = 3,
generation = 5,
pop_size = 6)res$population contains the medoids that survived the
last generation, together with the values of the two objective
functions, the Pareto ranking and the crowding distance.
res$matrix.solutions is a data frame whose columns are
the clustering assignments produced by each non-dominated solution.
res$clustering exposes the same information as a list of
named integer vectors, ready to be passed to validation indices,
plotting helpers, etc.
The full algorithm activates the intensification and diversification
strategies through the local_search argument. Because
Pareto Local Search has quadratic cost on the size of the Pareto front,
this option is disabled by default in the vignette and the example below
is shown but not evaluated.
In bioinformatics workflows, dmatrix1 is usually a
distance derived from numerical expression profiles (for example,
correlation or Euclidean distance on log-expression values), while
dmatrix2 is a distance derived from a-priori biological
knowledge (for example, semantic similarity between Gene Ontology
terms). The Xie-Beni validity index is computed independently on each
matrix and acts as one of the two objective functions of the NSGA-II
engine.
Versions before 0.2.0 exported the function as moc.gabk
(with a single p). That name is preserved as a deprecated
alias and emits a warning; all new code should call
moc.gapbk directly.
Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining 11(1), 1-16. https://doi.org/10.1186/s13040-018-0178-4