library(tidyverse)
library(tidyr)
library(naniar)
library(stringr)
library(janitor)
library(plotly)
library(factoextra)
library(gridExtra)
library(kableExtra)
library(knitr)
library(tableone)
library(fmsb)
Now, we get our neat dataset, superheroes
. For this
dataset, we get boxplots for the numerical variables
intelligence
, strength
, speed
,
durability
, power
, combat
,
height_cm
, weight_kg
which helps us to
understand their distribution. For the categorical variables
publisher
, alignment
, gender
,
eye_color
, hair_color
, we also analyzed them
accordingly. At the end of this section, we performed a clustering
analysis of all heroes without missing data and categorized them into
three groups: the balanced group, the strength group, and the mediocre
group.
superheros <- read_csv("data/superheroes.csv")
# Box plot
numeric_vars <- superheros |>
select(intelligence, strength, speed, durability, power, combat, height_cm, weight_kg)
plots <- lapply(names(numeric_vars), function(var) {
plot_ly(y = numeric_vars[[var]], type = "box", name = var)
})
boxplot_combined <- subplot(plots, nrows = 3, margin = 0.05)
boxplot_combined
# Correlation Matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
cor_heatmap <- plot_ly(
z = cor_matrix,
x = colnames(cor_matrix),
y = colnames(cor_matrix),
type = "heatmap",
colors = colorRampPalette(c("lightblue", "blue", "darkblue"))(100)
) %>%
layout(
title = "Correlation Matrix Heatmap",
xaxis = list(title = "Variables"),
yaxis = list(title = "Variables")
)
cor_heatmap
The first figure shows the distribution of different variables using box plots to visualize the statistical characteristics of multiple superhero attributes.
Intelligence
: The median is around 50, with most values
concentrated between 25 and 75, and some low outliers.
Strength
: The distribution is fairly even, with almost
no outliers. The median is approximately 50, and the range spans from 0
to 100.
Speed
: Most data points are concentrated in the middle
range, with a median slightly above 50, and no obvious outliers.
Durability
: The median is around 50, and the overall
distribution is even, ranging from 0 to 100, with almost no
outliers.
Power
: The median is close to 50, with a wide range of
values. The overall distribution is relatively symmetric.
Combat
: The median is near 50, and the data has a wide
range, with an even distribution and no clear outliers.
Height (cm)
: Height shows considerable variation, with
some significant outliers, particularly above 500 cm, indicating some
extremely tall superheroes. The median is below 200 cm, with most data
points below 500 cm.
Weight (kg)
: The median weight is around 100 kg. There
are some very large outliers (over 400 kg), with most values
concentrated below 200 kg.
The second heatmap represents the correlation matrix between several superhero attributes, including intelligence, strength, speed, durability, power, combat, height (in cm), and weight (in kg). The color gradient ranges from light blue to dark blue, where darker shades indicate a stronger positive correlation between two variables. There is a strong correlation between strength and durability (0.65). This suggests that superheroes who are strong also tend to be durable. Height and weight also show a very high correlation (0.69), indicating that taller superheroes tend to be heavier.
publisher_counts <- superheros |>
count(publisher) |>
arrange(desc(n))
publisher_bar <- plot_ly(publisher_counts, x = ~reorder(publisher, -n), y = ~n, type = 'bar', marker = list(color = '#aec7e8')) |>
layout(title = "Publisher Distribution (Bar Chart - Sorted by Count)",
xaxis = list(title = "Publisher"),
yaxis = list(title = "Count"))
publisher_bar
This bar chart represents the distribution of superheroes by publisher, sorted by the number of heroes. The chart shows that Marvel Comics and DC Comics are by far the most dominant publishers, with Marvel having the highest count at 339 heroes, followed by DC Comics with 188 heroes.
alignment_counts <- superheros |>
count(alignment)
alignment_pie <- plot_ly(alignment_counts, labels = ~alignment, values = ~n, type = 'pie',
marker = list(colors = c('#98df8a', '#ffbb78', '#9edae5', '#f7b6d2'))) |>
layout(title = "Alignment Distribution",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
alignment_pie
This pie chart illustrates the distribution of superhero alignments. The majority of superheroes, about 67.6%, fall under the “Good” category, represented in orange. The second largest segment is the “Bad” alignment, accounting for 28.2%, shown in green. The “Neutral” category, depicted in light blue, represents 3.28% of the characters. Lastly, a tiny slice, labeled as “null” in pink, makes up 0.958% of the characters.
marvel_alignment <- superheros |>
filter(publisher == "Marvel Comics") |>
count(alignment)
dc_alignment <- superheros |>
filter(publisher == "Dc Comics") |>
count(alignment)
# Plot alignment distribution for Marvel Comics
marvel_plot <- plot_ly(marvel_alignment, labels = ~alignment, values = ~n, type = 'pie',
marker = list(colors = c('#aec7e8', '#ffbb78', '#98df8a', '#f4cccc'))) |>
layout(title = "Alignment Distribution - Marvel Comics",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
# Plot alignment distribution for DC Comics
dc_plot <- plot_ly(dc_alignment, labels = ~alignment, values = ~n, type = 'pie',
marker = list(colors = c('#c5b0d5', '#c49c94', '#f9cb9c', '#d9d9d9'))) |>
layout(title = "Alignment Distribution - DC Comics",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
marvel_plot
dc_plot
Since Marvel and DC are the most popular publishers, we checked the distributions of superhero alignments within these two publishers. The two pie chats show the similar distribution as the overall alignments distribution.
eye_color_counts <- superheros |>
count(eye_color)
eye_color_bar <- plot_ly(eye_color_counts,
x = ~n,
y = ~reorder(eye_color, n),
type = 'bar',
orientation = 'h',
marker = list(color = "#dbdb8d")) |>
layout(title = "Eye Color Distribution",
xaxis = list(title = "Count"),
yaxis = list(title = "Eye Color"))
hair_color_counts <- superheros |>
count(hair_color)
hair_color_bar <- plot_ly(hair_color_counts,
x = ~n,
y = ~reorder(hair_color, n),
type = 'bar',
orientation = 'h',
marker = list(color = "#c7c7c7")) |>
layout(title = "Hair Color Distribution",
xaxis = list(title = "Count"),
yaxis = list(title = "Hair Color"))
eye_color_bar
hair_color_bar
This figure contains two bar charts displaying the distribution of superhero eye colors and hair colors. Overall, blue eyes and black hair are the most common features among superheroes.
superheros1 <- superheros |>
filter(!is.na(alignment), !is.na(gender))
vars <- c("intelligence", "strength", "speed", "durability", "power", "combat", "height_cm", "weight_kg")
alignment_summary <- CreateTableOne(vars = vars, strata = "alignment", data = superheros1, test = TRUE)
alignment_summary_df <- print(alignment_summary, quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
alignment_summary_df |> as.data.frame() |>
rename(`p-value` = p) |>
mutate(test = "F-test") |>
kable()
Bad | Good | Neutral | p-value | test | |
---|---|---|---|---|---|
n | 199 | 475 | 22 | F-test | |
intelligence (mean (SD)) | 68.21 (21.80) | 63.09 (18.66) | 59.77 (17.49) | 0.012 | F-test |
strength (mean (SD)) | 48.77 (32.95) | 40.89 (32.23) | 54.14 (39.52) | 0.009 | F-test |
speed (mean (SD)) | 38.86 (23.55) | 39.92 (23.95) | 50.77 (28.77) | 0.092 | F-test |
durability (mean (SD)) | 63.02 (30.52) | 55.96 (29.12) | 70.55 (31.72) | 0.007 | F-test |
power (mean (SD)) | 66.48 (29.17) | 62.79 (29.57) | 73.82 (31.33) | 0.129 | F-test |
combat (mean (SD)) | 62.94 (22.42) | 61.73 (23.27) | 60.59 (23.34) | 0.818 | F-test |
height_cm (mean (SD)) | 191.61 (26.83) | 184.19 (57.20) | 237.41 (166.78) | 0.001 | F-test |
weight_kg (mean (SD)) | 140.12 (119.62) | 95.53 (80.17) | 209.50 (223.86) | <0.001 | F-test |
gender_summary <- CreateTableOne(vars = vars, strata = "gender", data = superheros1, test = TRUE)
gender_summary_df <- print(gender_summary, quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
gender_summary_df |> as.data.frame() |>
rename(`p-value` = p) |>
mutate(test = "T-test") |>
kable()
Female | Male | p-value | test | |
---|---|---|---|---|
n | 200 | 496 | T-test | |
intelligence (mean (SD)) | 62.46 (15.21) | 65.27 (21.20) | 0.135 | T-test |
strength (mean (SD)) | 38.36 (32.53) | 45.75 (32.88) | 0.013 | T-test |
speed (mean (SD)) | 37.03 (21.17) | 41.21 (25.07) | 0.069 | T-test |
durability (mean (SD)) | 49.82 (28.06) | 62.10 (29.87) | <0.001 | T-test |
power (mean (SD)) | 61.44 (26.49) | 65.47 (30.66) | 0.153 | T-test |
combat (mean (SD)) | 60.65 (21.77) | 62.59 (23.45) | 0.376 | T-test |
height_cm (mean (SD)) | 175.54 (21.88) | 193.35 (67.45) | 0.002 | T-test |
weight_kg (mean (SD)) | 78.84 (76.98) | 126.88 (110.70) | <0.001 | T-test |
This table summarizes the demographic characteristics of superheroes by alignment (“Bad,” “Good,” “Neutral”) and gender (“Female,” “Male”). The mean and standard deviation are provided for attributes such as intelligence, strength, speed, and others. Significant differences are observed among alignment categories for attributes like intelligence, strength, durability, height, and weight, as indicated by F-test p-values. In the gender comparison, significant differences are found in strength, durability, height, and weight, with males generally showing higher values. These statistical tests highlight notable variations in superhero traits based on alignment and gender.
In this section, we use K-means methods to cluster the superheroes
into different groups based on 8 numeric variables:
intelligence
, strength
, speed
,
durability
, power
, combat
,
height_cm
, weight_kg
. If one superhero has any
missing value of these 8 variables, we exclude this superhero from our
clustering analysis. Finally, 428 superheroes are included in our
analysis.
numeric_vars <- superheros |>
select(intelligence, strength, speed, durability, power, combat, height_cm, weight_kg)
superheros_clean <- superheros[complete.cases(numeric_vars), ]
numeric_vars_clean <- superheros_clean |>
select(intelligence, strength, speed, durability, power, combat, height_cm, weight_kg)
numeric_vars_scaled <- scale(numeric_vars_clean)
set.seed(123)
kmeans_result_2 <- kmeans(numeric_vars_scaled, centers = 2, nstart = 25)
plot_2 <- fviz_cluster(kmeans_result_2, data = numeric_vars_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "k = 2")
kmeans_result_3 <- kmeans(numeric_vars_scaled, centers = 3, nstart = 25)
plot_3 <- fviz_cluster(kmeans_result_3, data = numeric_vars_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "k = 3")
kmeans_result_4 <- kmeans(numeric_vars_scaled, centers = 4, nstart = 25)
plot_4 <- fviz_cluster(kmeans_result_4, data = numeric_vars_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "k = 4")
grid.arrange(plot_2, plot_3, plot_4, ncol = 3)
We clustered the 428 superheroes into 2, 3, 4 groups, respectively. We find that dividing superheroes into two categories results in under classification; dividing them into four categories results in reclassification; and dividing them into three categories is the ideal situation.
This is the list of superheroes in each clustering group. Let’s see where your superhero is!
superheros_clean$cluster <- kmeans_result_3$cluster
superheros_clean |>
group_by(cluster) |>
summarise(heroes = paste(name, collapse = ", ")) |>
kable()
cluster | heroes |
---|---|
1 | Abe Sapien, Abin Sur, Absorbing Man, Agent Zero, Air-Walker, Alan Scott, Animal Man, Annihilus, Apocalypse, Aqualad, Aquaman, Arachne, Ardina, Ares, Atlas, Atlas, Aurora, Azazel, Battlestar, Beast, Beast Boy, Beta Ray Bill, Big Barda, Bishop, Bizarro, Black Adam, Black Bolt, Black Manta, Blackout, Blob, Booster Gold, Brainiac, Cable, Cannonball, Captain Atom, Captain Britain, Captain Marvel, Captain Marvel, Captain Marvel Ii, Carnage, Century, Cheetah Iii, Citizen Steel, Cloak, Cyborg, Darth Vader, Deadman, Deadpool, Deathstroke, Doc Samson, Doctor Doom, Doctor Fate, Doctor Strange, Doppelganger, Elastigirl, Emma Frost, Etrigan, Evil Deadpool, Evilhawk, Exodus, Firelord, Firestorm, Flash, Flash Ii, Flash Iii, Flash Iv, Franklin Richards, Frenzy, Gamora, Ghost Rider, Gladiator, Goku, Gorilla Grodd, Guy Gardner, Hal Jordan, Hawkgirl, Hela, Hercules, Human Torch, Husk, Hybrid, Hyperion, Iceman, Impulse, Invisible Woman, Iron Fist, Iron Man, Jack Of Hearts, John Stewart, Kang, Klaw, Krypto, Kyle Rayner, Lady Deathstrike, Legion, Lex Luthor, Lizard, Loki, Luke Cage, Mach-Iv, Magneto, Man-Thing, Mantis, Martian Manhunter, Marvel Girl, Maxima, Mephisto, Mera, Metallo, Mimic, Miss Martian, Mister Sinister, Molten Man, Moonstone, Mr Incredible, Namor, Namora, Namorita, Naruto Uzumaki, Northstar, Nova, Nova, Odin, One Punch Man, Phoenix, Plastic Man, Polaris, Power Girl, Predator, Quicksilver, Ray, Red Tornado, Sabretooth, Sandman, Scarlet Spider, Scarlet Witch, Sentry, Shadow King, Shatterstar, She-Thing, Sif, Silver Surfer, Sinestro, Siren, Skaar, Spider-Girl, Spider-Gwen, Spider-Man, Spider-Woman, Starfire, Stargirl, Steel, Steppenwolf, Superboy, Superboy-Prime, Supergirl, Superman, T-1000, Thing, Thor, Thor Girl, Thunderstrike, Thundra, Tiger Shark, Toxin, Toxin, Triton, Ultragirl, Vegeta, Venom, Venom Ii, Vindicator, Vision, War Machine, Warlock, Warpath, Wolverine, Wonder Girl, Wonder Man, Wonder Woman, X-23, X-Man, Zoom |
2 | A-Bomb, Abomination, Alien, Amazo, Anti-Venom, Bloodaxe, Colossus, Darkseid, Destroyer, Doomsday, Drax The Destroyer, Hellboy, Hulk, Juggernaut, Killer Croc, Kilowog, Living Brain, Lobo, Machine Man, Modok, Onslaught, Red Hulk, Rhino, Sasquatch, Scorpion, She-Hulk, Solomon Grundy, Spawn, Thanos, Ultron, Venom Iii, Wolfsbane |
3 | Adam Strange, Agent Bob, Ajax, Alfred Pennyworth, Angel, Angel Dust, Angel Salvadore, Ant-Man, Ant-Man Ii, Archangel, Arclight, Ariel, Armor, Atom Girl, Atom Ii, Bane, Banshee, Bantam, Batgirl, Batgirl Iv, Batgirl Vi, Batman, Batman, Batman Ii, Big Man, Black Canary, Black Canary, Black Cat, Black Knight Iii, Black Lightning, Black Mamba, Black Panther, Black Widow, Blackwing, Blackwulf, Blade, Bling!, Blink, Blizzard Ii, Boom-Boom, Brainiac 5, Buffy, Bullseye, Bumblebee, Callisto, Captain America, Catwoman, Chamber, Changeling, Cheetah, Cheetah Ii, Clock King, Copycat, Cottonmouth, Crystal, Cyclops, Daredevil, Darkhawk, Darkstar, Dash, Dazzler, Deadshot, Deathlok, Demogoblin, Diamondback, Doctor Octopus, Domino, Electro, Elektra, Elongated Man, Enchantress, Falcon, Feral, Firebird, Firestar, Forge, Franklin Storm, Gambit, Goblin Queen, Gravity, Green Arrow, Green Goblin, Green Goblin Ii, Han Solo, Harley Quinn, Havok, Hawk, Hawkeye, Hawkeye Ii, Heat Wave, Hellcat, Hope Summers, Huntress, Hydro-Man, Indiana Jones, Ink, Jack-Jack, James T. Kirk, Jean Grey, Jennifer Kale, Jessica Jones, John Wraith, Joker, Jolt, Jubilee, Justice, Kingpin, Kraven Ii, Kraven The Hunter, Leader, Light Lass, Lightning Lad, Lightning Lord, Longshot, Luke Skywalker, Man-Wolf, Mandarin, Maverick, Medusa, Meltdown, Metron, Micro Lad, Mister Fantastic, Mister Freeze, Mockingbird, Moon Knight, Morlun, Moses Magnum, Mr Immortal, Ms Marvel Ii, Multiple Man, Mysterio, Mystique, Nebula, Nick Fury, Nightcrawler, Nightwing, Oracle, Paul Blart, Penguin, Phantom Girl, Plantman, Plastique, Poison Ivy, Professor X, Professor Zoom, Psylocke, Punisher, Purple Man, Pyro, Question, Quill, Ra’s Al Ghul, Rambo, Raven, Red Arrow, Red Hood, Red Robin, Red Skull, Rick Flag, Robin, Robin Ii, Robin Iii, Robin V, Rocket Raccoon, Rogue, Ronin, Rorschach, Sage, Scarecrow, Scarlet Spider Ii, Shadow Lass, Shadowcat, Shang-Chi, Shocker, Shriek, Silverclaw, Siryn, Snowbird, Songbird, Space Ghost, Spider-Woman Iii, Spock, Spyke, Star-Lord, Static, Storm, Sunspot, Swarm, Synch, Taskmaster, Tempest, The Comedian, Thunderbird, Tigra, Tinkerer, Toad, Triplicate Girl, Two-Face, Vanisher, Vibe, Violet Parr, Vixen, Vulture, Walrus, Warp, Wasp, Winter Soldier, Wyatt Wingfoot, Yellowjacket, Yellowjacket Ii, Yoda, Zatanna |
We find that Iron Man, the all-around hero we know, is placed in the first group, strength heroes such as the Hulk are placed in the second group, and mortal heroes such as Captain America are placed in the third group.
superheros_clean |>
group_by(cluster) |>
summarise(
intelligence_mean = mean(intelligence, na.rm = TRUE),
intelligence_sd = sd(intelligence, na.rm = TRUE),
strength_mean = mean(strength, na.rm = TRUE),
strength_sd = sd(strength, na.rm = TRUE),
speed_mean = mean(speed, na.rm = TRUE),
speed_sd = sd(speed, na.rm = TRUE),
durability_mean = mean(durability, na.rm = TRUE),
durability_sd = sd(durability, na.rm = TRUE),
power_mean = mean(power, na.rm = TRUE),
power_sd = sd(power, na.rm = TRUE),
combat_mean = mean(combat, na.rm = TRUE),
combat_sd = sd(combat, na.rm = TRUE),
height_mean = mean(height_cm, na.rm = TRUE),
height_sd = sd(height_cm, na.rm = TRUE),
weight_mean = mean(weight_kg, na.rm = TRUE),
weight_sd = sd(weight_kg, na.rm = TRUE)
) |>
pivot_longer(-cluster, names_to = c("variable", ".value"), names_sep = "_") |>
mutate(mean_sd = paste0(round(mean, 2), " (", round(sd, 2), ")")) |>
select(variable, cluster, mean_sd) |>
pivot_wider(names_from = cluster, values_from = mean_sd, names_prefix = "cluster") |>
kable()
variable | cluster1 | cluster2 | cluster3 |
---|---|---|---|
intelligence | 67.3 (18.18) | 64.16 (22.83) | 62.77 (19.05) |
strength | 63.93 (28.43) | 76.31 (26.18) | 19.15 (15.04) |
speed | 54.56 (25.02) | 43.06 (17.41) | 28.38 (14.3) |
durability | 80.62 (17.83) | 85.72 (19.8) | 38.65 (19.79) |
power | 81.31 (23.54) | 71.81 (28.23) | 49.62 (25.07) |
combat | 68.07 (19.15) | 69.47 (20.79) | 59.68 (24.07) |
height | 184.7 (15.56) | 238.72 (46.7) | 175.79 (15.58) |
weight | 111.26 (56.52) | 412.97 (180.88) | 74.5 (24.59) |
We then calculated the average of the six capability values and two physical values for each clustering to dig into our clustering result, as shown in the table above. We found that the first group of heroes, such as Iron Man, is more versatile (“They are Hexagonal Warriors”). We call the first group “Balanced Group”; The second group of heroes has more strength and staying power, such as Hulk: “Smash! Smash! Smash!” The second group is called the “Strength Group”; The third group of heroes, who are mostly mortals, have significantly lower values than the first two groups, except for intelligence. But they are also the leaders of the team, and their tenacity is infectious to the whole team, as in the case of Captain America. The third group is called the “Mediocre Group”.
Below is a visualization of superhero 6 capabilities values in each group from the previous section, using radar charts!
cluster_averages <- superheros_clean |>
group_by(cluster) |>
summarise(
intelligence = mean(intelligence, na.rm = TRUE),
strength = mean(strength, na.rm = TRUE),
speed = mean(speed, na.rm = TRUE),
durability = mean(durability, na.rm = TRUE),
power = mean(power, na.rm = TRUE),
combat = mean(combat, na.rm = TRUE)
)
radar_data <- as.data.frame(cluster_averages)
radar_data <- radar_data[, -1]
radar_data <- rbind(rep(100, 6), rep(0, 6), radar_data)
colors_border <- c('#f08080', '#90ee90', '#87cefa')
colors_in <- c('#ffcccb', '#c6e7c6', '#b0e0e6')
radarchart(radar_data, axistype = 2,
pcol = colors_border,
pfcol = adjustcolor(colors_in, alpha.f = 0.6), plwd = 2.5, plty = 1,
cglcol = "grey", cglty = 1,
axislabcol = "black")
legend(x = "topright", legend = c('Balanced Group', 'Strength Group', 'Mediocre Group'), pch = 20, col = colors_border, text.col = "black", cex = 0.8, pt.cex = 1)