Generating interesting
high-dimensional data structures

Jayani P. G. Lakshika

Joint work with Prof Dianne Cook, Dr Paul Harrison, Dr Michael Lydeamore, Dr Thiyanga S. Talagala

Existing software


Lack of control over cluster shapes



cardinalR





collection of various high-dimensional data structures in R

What do we implement in our R package?

  • Generation of geometric structures in arbitrary dimensions.

  • Control over background noise, clustering,
    and sample positioning.

  • Generation of explainable, challenging synthetic datasets for
    benchmarking high-dimensional methods.

Shape generators

Branching

Cone

Cube

Linear

Mobius

Polynomial

Pyramid

Scurve

Sphere

Swiss roll

Trefoil

Trigonometric

Branching

expbranches <- gen_expbranches(n = 1000, p = 4, k = 4)
linearbranches <- gen_linearbranches(n = 1000, p = 4, k = 4)
curvybranches <- gen_curvybranches(n = 1000, p = 4, k = 4)
orglinearbranches <- gen_orglinearbranches(n = 1000, p = 4, k = 4)
orgcurvybranches <- gen_orgcurvybranches(n = 1000, p = 4, k = 4)

Pyramid

pyrrect <- gen_pyrrect(n = 1000, p = 4)
pyrtri <- gen_pyrtri(n = 1000, p = 4)
pyrstar <- gen_pyrstar(n = 1000, p = 4)
pyrholes <- gen_pyrholes(n = 1000, p = 4)

Cube

gridcube <- gen_gridcube(n = 1000, 
                         p = 4)
unifcube <- gen_unifcube(n = 1000, 
                         p = 4)
cubehole <- gen_cubehole(n = 3000, 
                         p = 4)

Polynomial

quadratic <- gen_quadratic(n = 1000, p = 4)
cubic <- gen_cubic(n = 1000, p = 4)

Sphere

circle <- gen_circle(n = 1000, p = 4)
curvycycle <- gen_curvycycle(n = 1000, p = 4)
unifsphere <- gen_unifsphere(n = 1000, p = 4)
gridedsphere <- gen_gridedsphere(n = 10000, p = 4)
clusteredspheres <- gen_clusteredspheres(n = c(1000, 100), k = 3, p = 4, r = c(15, 3),
                                         loc = 10 / sqrt(3)) |>
  dplyr::select(-cluster)
hemisphere <- gen_hemisphere(n = 1000, p = 4)

Trigonometric

crescent <- gen_crescent(n = 1000, p = 4)
curvycylinder <- gen_curvycylinder(n = 1000, p = 4, h = 10)
sphericalspiral <- gen_sphericalspiral(n = 1000, p = 4, spins = 1)
helicalspiral <- gen_helicalspiral(n = 1000, p = 4)
conicspiral <- gen_conicspiral(n = 1000, p = 4, spins = 1)
nonlinear <- gen_nonlinear(n = 1000, p = 4, hc = 1, non_fac = 0.5)






Implementation

pyrstar <- gen_pyrstar(
  n = 800, p = 4, h = 5, rb = 3)
conicspiral <- gen_conicspiral(
  n = 500, p = 4, spins = 1)
hemisphere <- gen_hemisphere(
  n = 1000, p = 4)

Letโ€™s combine !!!

Generate clusters

Different shaped clusters (shape) with different

  • sample sizes (n)

  • location (loc)

  • scale (scale)

  • rotation (rotation)

with or without background noise (is_bkg)

three_clusts <- gen_multicluster(
 n = c(700, 300, 500), p = 4, k = 3,
 loc = matrix(c(
   0, 0, 0, 0,
   5, 0, 0, 0,
   3, 4, 10, 7
 ), nrow = 3, byrow = TRUE),
 scale = c(8, 2, 5),
 shape = c("pyrstar", "conicspiral", "hemisphere"),
 rotation = NULL,
 is_bkg = TRUE)

Examples

mobgau <- make_mobiusgau(n = c(1000, 500), p = 4)

mobgau
# A tibble: 1,500 ร— 5
       x1      x2      x3      x4 cluster 
    <dbl>   <dbl>   <dbl>   <dbl> <chr>   
1 -0.109  -0.0144  0.108  -0.0820 cluster2
2 -0.450  -0.650  -0.122   0.0414 cluster1
3 -0.496  -0.411   0.230   0.0438 cluster1
4 -0.0147  0.0454 -0.0634 -0.0633 cluster2
5  0.370  -0.539   0.0263  0.0287 cluster1
6 -0.0122 -0.690   0.0270  0.0710 cluster1
# โ„น 1,494 more rows
loc_matrix <- matrix(
  c(0, 0, 0, 0,
    5, 9, 0, 0, 
    3, 4, 10, 7
 ), nrow = 3, byrow = TRUE)

multigau <- make_multigau(n = c(300, 200, 500), p = 4, k = 3, loc = loc_matrix, scale = c(0.2, 1.5, 0.5))

multigau
# A tibble: 1,000 ร— 5
      x1    x2      x3      x4 cluster 
   <dbl> <dbl>   <dbl>   <dbl> <chr>   
1 2.89   4.13  10.1     7.15   cluster3
2 3.06   3.98   9.92    7.00   cluster3
3 0.0125 0.144 -0.0449  0.0189 cluster1
4 3.01   4.12   9.94    6.99   cluster3
5 5.46   9.39  -0.598  -0.380  cluster2
6 2.83   4.05  10.1     6.78   cluster3
# โ„น 994 more rows

Application

positions <- geozoo::simplex(p=4)$points
positions <- positions * 0.8

## To generate data
five_clusts <- gen_multicluster(n = c(2250, 1500, 750, 1250, 1750), 
                                p = 4, k = 5, loc = positions,
                                scale = c(0.4, 0.35, 0.3, 1, 0.3),
                                shape = c("helicalspiral", 
                                          "hemisphere", "unifcube", 
                                          "cone", "gaussian"),
                                rotation = NULL,
                                is_bkg = FALSE)

Data

Dimension reduction layouts

  1. tSNE, b. UMAP, c. PAHTE, d. TriMAP, e. PaCMAP, and f. PCA.

Summary

โœจ Key features

  • ๐Ÿ”ข Flexible dimensions (p) and number of clusters (k)
  • ๐Ÿ“ Custom cluster locations with loc matrix
  • ๐ŸŽจ Geometric variety: "pyrstar", "conicspiral", "hemisphere", etc.
  • ๐ŸŒ€ Optional rotation for complexity
  • ๐ŸŒซ๏ธ Background noise (is_bkg = TRUE) to simulate real-world conditions
  • โš–๏ธ Individual cluster scales and shapes

๐Ÿš€ cardinalR empowers researchers To:

  • ๐Ÿ”ฌ Simulate interpretable high-dimensional data structures
  • ๐Ÿงช Benchmark and compare NLDR and clustering methods
  • ๐Ÿงญ Explore algorithm performance under controlled challenges
  • ๐ŸŽฏ Develop and validate new analytical tools
  • ๐ŸŒ Enable reproducible, cross-domain experimentation

๐Ÿ’ก Build better algorithms by knowing what your data really looks like.

Jayani P.G. Lakshika


Collaborators: Prof Dianne Cook, Dr Paul Harrison, Dr Michael Lydeamore, Dr Thiyanga S. Talagala