library(tidyverse)
library(naniar)
library(stringr)
library(janitor)
library(dplyr)
library(plotly)
library(factoextra)
library(gridExtra)
library(kableExtra)
library(knitr)
library(tableone)
library(fmsb)
library(modelr)
library(mice)
library(mgcv)
library(randomForest)
library(readr)
library(patchwork)
library(car)
set.seed(1)
The superhero universe is rich with complex characters who embody a wide range of skills, powers, and moral alignments. This project aims to explore the attributes and dynamics of superheroes and villains, drawing comparisons between their strengths, weaknesses, affiliations, and other characteristics. Finally, We will build an interactive tool where users can input their desired weight of each capability, such as intelligence, strength, or power, to find heroes and villains that match their preferences. This feature will allow superhero comic fanatics to explore characters based on personalized criteria, creating a more engaging and customizable experience.
Our initial goal was broad: to compare superheroes across multiple universes and identify ones that resonate with individual preferences.
As we delved deeper into the data set and explored its diverse dimensions, our questions became more focused:
Are there correlations among superheroes’ power dimensions or
physical characteristics?
Do different publishers have preferences in creating characters with specific alignments or characteristics?
How do power levels vary across characters with different alignments?
Can we cluster superheroes based on their numeric attributes, and what insights can these clusters provide?
What are the key factors that determine a superhero’s overall power?
All of those questions drive our analysis and help to satisfy our curiosity.
Our data source is from kaggle
website. This dataset covers detailed information on 731 superheroes
and villains from various comic universes. It includes a wide range of
attributes such as capability statistics, biographical information,
physical appearance, and affiliations. The data was collected using the
SuperHero API.The original dataset is called
superheroes_data
.
superheroes_df <- read_csv(
"./data/superheroes_data.csv",
na = c("NA", ".", "", "null", "-")
) |>
janitor::clean_names() |>
mutate(
height_cm = str_extract(height, "\\d+(?= cm)") |> as.numeric(),
height_cm = if_else(height_cm == 0, NA_real_, height_cm),
weight_kg = str_extract(weight, "\\d+(?= kg)") |> as.numeric(),
weight_kg = if_else(weight_kg == 0, NA_real_, weight_kg),
hair_color = if_else(str_detect(hair_color, "/"), "Dual Color", hair_color),
eye_color = str_remove_all(eye_color, "\\(.*?\\)") |> str_trim(),
eye_color = if_else(str_detect(eye_color, "/"), "Dual Color", eye_color)
) |>
mutate(across(
where(is.character) & !any_of("url"),
~ str_to_title(.)
)) |>
dplyr::select(
-aliases, -height, -weight, -base, -occupation, -group_affiliation, -relatives, -alter_egos, -place_of_birth, -first_appearance, -race
)
superheros <- superheroes_df
publisher_counts <- superheros |>
group_by(publisher) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
single_hero_publishers <- filter(publisher_counts, hero_count == 1) |>
pull(publisher)
superheros <- superheros |>
mutate(publisher = ifelse(publisher %in% single_hero_publishers, "Others", publisher))
alignment_counts <- superheros |>
group_by(alignment) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
gender_counts <- superheros |>
group_by(gender) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
eye_color_counts <- superheros |>
group_by(eye_color) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
single_hero_eye_color <- filter(eye_color_counts, hero_count == 1) |>
pull(eye_color)
superheros <- superheros |>
mutate(eye_color = ifelse(eye_color %in% single_hero_eye_color, "Others", eye_color))
hair_color_counts <- superheros |>
group_by(hair_color) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
single_hero_hair_color <- filter(hair_color_counts, hero_count == 1) |>
pull(hair_color)
superheros <- superheros |>
mutate(hair_color = ifelse(hair_color %in% single_hero_hair_color, "Others", hair_color))
However, we are not interested in each variables in
superheroes_data
dataset. In addition, some of the data in
this original dataset is not readable and analyzable. As a result, we
conducted the following data cleaning process:
When importing data, we set “NA”, “.”, ““,”null”, and “-” as missing values;
We harmonized the height of heroes in cm and their weight in kg;
We marked heroes with a height and weight of 0 as missing values: they are meaningless;
If a hero has two hair colors or two eye colors (e.g. Blue/Yellow), we marked this color as ‘Dual Color’;
We also reclassified some categorical variables. Specifically, if
a publisher
contains only one hero, then the publisher to
which this hero belongs will be noted as ‘Others’. We did the same for
the variables alignment
, eye_color
, and
hair_color
.
We selected only the variables we were interested and deleted
aliases
, base
, occupation
,
group_affiliation
, relatives
,
alter_egos
, place_of_birth
,
first_appearance
, race
. These variables are
documented in a way that is difficult to read and difficult to
clean.
The cleaned dataset is tidy, readable and analyzable. It includes the following variables:
id
: A unique identifier for each character.
name
: The superhero’s alias or code name.
intelligence
: A numerical representation of the
character’s intelligence level.
strength
: A numerical value representing the character’s
physical strength.
speed
: A numerical representation of how fast the
character can move.
durability
: A measure of the character’s resilience and
ability to withstand damage.
power
: A numerical value representing the character’s
overall power or abilities.
combat
: A score depicting the character’s combat skills
and experience.
full-name
: The character’s real or full name, as opposed
to their superhero alias.
publisher
: The company responsible for creating and
publishing the character.
alignment
: Whether the character is good, evil, or
neutral.
gender
: The gender of the character.
height_cm
: The character’s height, given in
centimeters.
weight_kg
: The character’s weight, provided in
kilograms.
eye-color
: The color of the character’s eyes.
hair-color
: The color of the character’s hair.
url
: A link to an image of the character or more
detailed information.
superheros <- read_csv("./data/superheroes.csv")
# Box plot
numeric_vars <- superheros |>
dplyr::select(intelligence, strength, speed, durability, power, combat, height_cm, weight_kg)
plots <- lapply(names(numeric_vars), function(var) {
plot_ly(y = numeric_vars[[var]], type = "box", name = var)
})
boxplot_combined <- subplot(plots, nrows = 3, margin = 0.05)
boxplot_combined
# Correlation Matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
cor_heatmap <- plot_ly(
z = cor_matrix,
x = colnames(cor_matrix),
y = colnames(cor_matrix),
type = "heatmap",
colors = colorRampPalette(c("lightblue", "blue", "darkblue"))(100)
) %>%
layout(
title = "Correlation Matrix Heatmap",
xaxis = list(title = "Variables"),
yaxis = list(title = "Variables")
)
cor_heatmap
The numerical data in the superhero dataset reveals detailed insights into the attributes of various characters.
Intelligence values range from 6 to 100, with a median of 63, and an interquartile range (IQR) from 50 to 75, indicating that most superheroes possess above-average intelligence, though there are a few outliers below 19.
Strength varies widely from 4 to 100, with a median of 35, and an IQR spanning 10 to 80, suggesting high variability in physical power among characters.
Speed also shows considerable variation, ranging from 1 to 100, with a median of 33 and an upper fence at 88, highlighting the presence of faster-than-average individuals as outliers.
Durability, which measures resilience, ranges from 1 to 110, with a median of 60 and an IQR from 30 to 90, suggesting that durability is skewed slightly higher for many characters.
Power spans from 0 to 100, with a median of 63.5 and a compact IQR between 38 and 100, indicating that many characters have significant abilities, although some have very low power levels.
Combat skills range from 5 to 101, with a median of 64 and an IQR from 42 to 80, suggesting that fighting ability is moderately high for most characters.
Physical traits exhibit greater variation. Height ranges from a minimum of 64 cm to an extreme outlier of 975 cm, with a median of 183 cm and an IQR from 175 to 191 cm.
Similarly, weight spans from 14 kg to an outlier of 900 kg, with a median of 81 kg and an IQR between 61.75 and 108 kg. The correlation between the weight and the height is going to be examed later.
The correlation matrix heatmap provides a comprehensive view of the relationships between the numerical variables in the superhero dataset. Notably, weight (kg) and height (cm) show a strong positive correlation, indicating that taller superheroes tend to weigh more, which aligns with physical expectations.
Strength is positively correlated with power and durability, suggesting that characters with greater strength often exhibit higher power and durability. Moreover, power and durability also have a noticeable positive correlation themselves, reflecting that superheroes with greater endurance often possess high energy-based abilities together.
Combat shows moderate positive correlations with intelligence and power, implying that characters proficient in combat tend to be both strategically intelligent and powerful. Speed, however, exhibits weaker correlations with other attributes, especially height, indicating it is a more distinct characteristic.
publisher_counts <- superheros |>
count(publisher) |>
arrange(desc(n))
publisher_bar <- plot_ly(publisher_counts, x = ~reorder(publisher, -n), y = ~n, type = 'bar', marker = list(color = '#aec7e8')) |>
layout(title = "Publisher Distribution (Bar Chart - Sorted by Count)",
xaxis = list(title = "Publisher"),
yaxis = list(title = "Count"))
publisher_bar
alignment_counts <- superheros |>
count(alignment)
alignment_pie <- plot_ly(alignment_counts, labels = ~alignment, values = ~n, type = 'pie',
marker = list(colors = c('#98df8a', '#ffbb78', '#9edae5', '#f7b6d2'))) |>
layout(title = "Alignment Distribution",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
alignment_pie
marvel_alignment <- superheros |>
filter(publisher == "Marvel Comics") |>
count(alignment)
dc_alignment <- superheros |>
filter(publisher == "Dc Comics") |>
count(alignment)
# Plot alignment distribution for Marvel Comics
marvel_plot <- plot_ly(marvel_alignment, labels = ~alignment, values = ~n, type = 'pie',
marker = list(colors = c('#aec7e8', '#ffbb78', '#98df8a', '#f4cccc'))) |>
layout(title = "Alignment Distribution - Marvel Comics",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
# Plot alignment distribution for DC Comics
dc_plot <- plot_ly(dc_alignment, labels = ~alignment, values = ~n, type = 'pie',
marker = list(colors = c('#c5b0d5', '#c49c94', '#f9cb9c', '#d9d9d9'))) |>
layout(title = "Alignment Distribution - DC Comics",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
marvel_plot
dc_plot
The publisher distribution analysis reveals publishers of different characters. It is worth noting that Marvel Comics and DC Comics dominate the superhero dataset, collectively contributing the majority of characters. Marvel leads significantly with around 339 characters, followed by DC with approximately 188 characters. These two publishers are the most well-known in the superhero universe and represent the most iconic and expansive rosters. Due to their prominence and influence, further investigation will focus specifically on the data related to Marvel and DC characters to uncover patterns, differences, and trends unique to these two leading publishers.
In the complete dataset, 67.6% of characters are classified as good, indicating that most superheroes have a positive alignment. Characters with a bad alignment constitute 28.2%, while neutral characters make up only 3.28%, and less than 1% fall into the “null” or undefined category. Within Marvel, the proportion of good characters is slightly lower at 64.9%, suggesting a slightly more diverse representation of moral alignments compared to the overall dataset. Bad characters constitute 31.9%, indicating a larger share of villains in Marvel than neutral characters, which are only 2.95%. The “null” category is negligible at 0.295%. For DC Comics, 61.7% of characters are good, showing a similar but slightly smaller majority compared to Marvel. DC also has a significant portion of bad characters at 31.4%, closely mirroring Marvel’s percentage. However, DC has a higher proportion of neutral characters of 6.38%, suggesting a tendency to explore more morally ambiguous characters. The “null” category remains minimal at 0.532%.
In summary, both Marvel and DC emphasize good-aligned characters, but Marvel leans slightly more toward a traditional hero-villain dichotomy, while DC shows a higher representation of neutral or ambiguous characters. These differences reflect the nuanced storytelling and character development strategies employed by the two publishers.
eye_color_counts <- superheros |>
count(eye_color)
eye_color_bar <- plot_ly(eye_color_counts,
x = ~n,
y = ~reorder(eye_color, n),
type = 'bar',
orientation = 'h',
marker = list(color = "#dbdb8d")) |>
layout(title = "Eye Color Distribution",
xaxis = list(title = "Count"),
yaxis = list(title = "Eye Color"))
hair_color_counts <- superheros |>
count(hair_color)
hair_color_bar <- plot_ly(hair_color_counts,
x = ~n,
y = ~reorder(hair_color, n),
type = 'bar',
orientation = 'h',
marker = list(color = "#c7c7c7")) |>
layout(title = "Hair Color Distribution",
xaxis = list(title = "Count"),
yaxis = list(title = "Hair Color"))
eye_color_bar
hair_color_bar
The distribution of eye color and hair color among superheroes reveals interesting patterns in their physical characteristics:
The most common eye color is blue, observed in 225 characters, making it the dominant eye color in the dataset. This is followed by brown eyes with a significant count of 126, and green eyes rounding out the top three. Less common colors such as red, black, and yellow are also present, likely representing unique or supernatural traits among certain characters. Rare eye colors like purple, gold, and amber, as well as characters with dual-colored eyes, indicate a diverse spectrum of visual aesthetics.
For hair color, black is the most frequently observed, with 161 characters having this trait. Blond and brown hair are also highly common, 101 and 86 respectively, reflecting traditional human traits. Interestingly, the category “No Hair” is also prominent, suggesting a significant number of bald or non-human characters. Less common hair colors like red, white, and auburn add variety, while rare colors such as purple, green, and silver showcase the creative range in superhero designs. Unique colors like blue, yellow, and orange appear rarely, often associated with otherworldly or supernatural beings.
superheros1 <- superheros |>
filter(!is.na(alignment), !is.na(gender))
vars <- c("intelligence", "strength", "speed", "durability", "power", "combat", "height_cm", "weight_kg")
alignment_summary <- CreateTableOne(vars = vars, strata = "alignment", data = superheros1, test = TRUE)
alignment_summary_df <- print(alignment_summary, quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
alignment_summary_df |> as.data.frame() |>
rename(`p-value` = p) |>
mutate(test = "F-test") |>
kable()
Bad | Good | Neutral | p-value | test | |
---|---|---|---|---|---|
n | 199 | 475 | 22 | F-test | |
intelligence (mean (SD)) | 68.21 (21.80) | 63.09 (18.66) | 59.77 (17.49) | 0.012 | F-test |
strength (mean (SD)) | 48.77 (32.95) | 40.89 (32.23) | 54.14 (39.52) | 0.009 | F-test |
speed (mean (SD)) | 38.86 (23.55) | 39.92 (23.95) | 50.77 (28.77) | 0.092 | F-test |
durability (mean (SD)) | 63.02 (30.52) | 55.96 (29.12) | 70.55 (31.72) | 0.007 | F-test |
power (mean (SD)) | 66.48 (29.17) | 62.79 (29.57) | 73.82 (31.33) | 0.129 | F-test |
combat (mean (SD)) | 62.94 (22.42) | 61.73 (23.27) | 60.59 (23.34) | 0.818 | F-test |
height_cm (mean (SD)) | 191.61 (26.83) | 184.19 (57.20) | 237.41 (166.78) | 0.001 | F-test |
weight_kg (mean (SD)) | 140.12 (119.62) | 95.53 (80.17) | 209.50 (223.86) | <0.001 | F-test |
gender_summary <- CreateTableOne(vars = vars, strata = "gender", data = superheros1, test = TRUE)
gender_summary_df <- print(gender_summary, quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
gender_summary_df |> as.data.frame() |>
rename(`p-value` = p) |>
mutate(test = "T-test") |>
kable()
Female | Male | p-value | test | |
---|---|---|---|---|
n | 200 | 496 | T-test | |
intelligence (mean (SD)) | 62.46 (15.21) | 65.27 (21.20) | 0.135 | T-test |
strength (mean (SD)) | 38.36 (32.53) | 45.75 (32.88) | 0.013 | T-test |
speed (mean (SD)) | 37.03 (21.17) | 41.21 (25.07) | 0.069 | T-test |
durability (mean (SD)) | 49.82 (28.06) | 62.10 (29.87) | <0.001 | T-test |
power (mean (SD)) | 61.44 (26.49) | 65.47 (30.66) | 0.153 | T-test |
combat (mean (SD)) | 60.65 (21.77) | 62.59 (23.45) | 0.376 | T-test |
height_cm (mean (SD)) | 175.54 (21.88) | 193.35 (67.45) | 0.002 | T-test |
weight_kg (mean (SD)) | 78.84 (76.98) | 126.88 (110.70) | <0.001 | T-test |
This table summarizes the demographic characteristics of superheroes by alignment (“Bad,” “Good,” “Neutral”) and gender (“Female,” “Male”). The mean and standard deviation are provided for attributes such as intelligence, strength, speed, and others. Significant differences are observed among alignment categories for attributes like intelligence, strength, durability, height, and weight, as indicated by F-test p-values. In the gender comparison, significant differences are found in strength, durability, height, and weight, with males generally showing higher values. These statistical tests highlight notable variations in superhero traits based on alignment and gender.
Then, we performed a clustering analysis on superheroes using the
k-means algorithm. Initially, the attributes, intelligence
,
strength
, speed
, durability
,
power
, combat
, height_cm
, and
weight_kg
were extracted into a subset called
numeric_vars
. Missing values were removed from this subset
to ensure only complete cases (428 cases) were used in the clustering
analysis, which is crucial for obtaining reliable clustering results.
The cleaned data (numeric_vars_clean
) was then standardized
using the scale()
function, ensuring that all variables are
on the same scale and do not unduly influence the clustering due to
differences in magnitude.
The analysis involved applying the k-means algorithm with different
numbers of clusters (k = 2, k = 3, and k = 4) to the scaled data,
represented by numeric_vars_scaled
. The k-means algorithm
attempts to partition the data into k distinct clusters based on
minimizing within-cluster variances. The use of nstart = 25
indicates that the algorithm will be run 25 times with different initial
centroid configurations, and the best result (in terms of minimizing
within-cluster sum of squares) will be chosen. This approach helps avoid
local minima and provides a more robust clustering outcome. To help
users better understand the k-means, a flowchart about how this
algorithm works was attached here:
For each value of k (2, 3, and 4), the resulting clusters are
visualized using the fviz_cluster()
function, which plots
the superheroes based on their clustering assignment. The
geom = "point"
argument indicates that individual data
points are plotted, and the ellipse.type = "convex"
argument represents each cluster with a convex hull to visually
distinguish different groups.
numeric_vars <- superheros |>
dplyr::select(intelligence, strength, speed, durability, power, combat, height_cm, weight_kg)
superheros_clean <- superheros[complete.cases(numeric_vars), ]
numeric_vars_clean <- superheros_clean |>
dplyr::select(intelligence, strength, speed, durability, power, combat, height_cm, weight_kg)
numeric_vars_scaled <- scale(numeric_vars_clean)
set.seed(1)
kmeans_result_2 <- kmeans(numeric_vars_scaled, centers = 2, nstart = 25)
plot_2 <- fviz_cluster(kmeans_result_2, data = numeric_vars_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "k = 2")
kmeans_result_3 <- kmeans(numeric_vars_scaled, centers = 3, nstart = 25)
plot_3 <- fviz_cluster(kmeans_result_3, data = numeric_vars_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "k = 3")
kmeans_result_4 <- kmeans(numeric_vars_scaled, centers = 4, nstart = 25)
plot_4 <- fviz_cluster(kmeans_result_4, data = numeric_vars_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()) +
labs(title = "k = 4")
grid.arrange(plot_2, plot_3, plot_4, ncol = 3)
The figure above shows the clustering analysis, where k-means method was applied with different values for the number of clusters (k=2,3,4).
In the first plot (k=2), the data points are grouped into two cluster. Compared to k = 3, two clusters are insufficient to adequately capture the diversity within the data, resulting in misclassification and poor separation between distinct groups.
In the third plot (k=4), the data is divided into four clusters. There is evidence of over-classification, with cluster 1 and cluster 3 overlapping to much with each other.
In the third plot (k=3), the clustering with three clusters appears to strike the right balance. The distinct groups are more clearly defined compared to k=2. Unlike k=4, there is no evidence of excessive fragmentation. This value, k=3, captures the major variations while maintaining well-separated and interpretable groups.
Here is the list of superheroes in each clustering group.You may find your favourite superhero here.
superheros_clean$cluster <- kmeans_result_3$cluster
superheros_clean |>
group_by(cluster) |>
summarise(heroes = paste(name, collapse = ", ")) |>
kable()
cluster | heroes |
---|---|
1 | Abe Sapien, Abin Sur, Absorbing Man, Agent Zero, Air-Walker, Alan Scott, Animal Man, Annihilus, Apocalypse, Aqualad, Aquaman, Arachne, Ardina, Ares, Atlas, Atlas, Aurora, Azazel, Battlestar, Beast, Beast Boy, Beta Ray Bill, Big Barda, Bishop, Bizarro, Black Adam, Black Bolt, Black Manta, Blackout, Blob, Booster Gold, Brainiac, Cable, Cannonball, Captain Atom, Captain Britain, Captain Marvel, Captain Marvel, Captain Marvel Ii, Carnage, Century, Cheetah Iii, Citizen Steel, Cloak, Cyborg, Darth Vader, Deadman, Deadpool, Deathstroke, Doc Samson, Doctor Doom, Doctor Fate, Doctor Strange, Doppelganger, Elastigirl, Emma Frost, Etrigan, Evil Deadpool, Evilhawk, Exodus, Firelord, Firestorm, Flash, Flash Ii, Flash Iii, Flash Iv, Franklin Richards, Frenzy, Gamora, Ghost Rider, Gladiator, Goku, Gorilla Grodd, Guy Gardner, Hal Jordan, Hawkgirl, Hela, Hercules, Human Torch, Husk, Hybrid, Hyperion, Iceman, Impulse, Invisible Woman, Iron Fist, Iron Man, Jack Of Hearts, John Stewart, Kang, Klaw, Krypto, Kyle Rayner, Lady Deathstrike, Legion, Lex Luthor, Lizard, Loki, Luke Cage, Mach-Iv, Magneto, Man-Thing, Mantis, Martian Manhunter, Marvel Girl, Maxima, Mephisto, Mera, Metallo, Mimic, Miss Martian, Mister Sinister, Molten Man, Moonstone, Mr Incredible, Namor, Namora, Namorita, Naruto Uzumaki, Northstar, Nova, Nova, Odin, One Punch Man, Phoenix, Plastic Man, Polaris, Power Girl, Predator, Quicksilver, Ray, Red Tornado, Sabretooth, Sandman, Scarlet Spider, Scarlet Witch, Sentry, Shadow King, Shatterstar, She-Thing, Sif, Silver Surfer, Sinestro, Siren, Skaar, Spider-Girl, Spider-Gwen, Spider-Man, Spider-Woman, Starfire, Stargirl, Steel, Steppenwolf, Superboy, Superboy-Prime, Supergirl, Superman, T-1000, Thing, Thor, Thor Girl, Thunderstrike, Thundra, Tiger Shark, Toxin, Toxin, Triton, Ultragirl, Vegeta, Venom, Venom Ii, Vindicator, Vision, War Machine, Warlock, Warpath, Wolverine, Wonder Girl, Wonder Man, Wonder Woman, X-23, X-Man, Zoom |
2 | A-Bomb, Abomination, Alien, Amazo, Anti-Venom, Bloodaxe, Colossus, Darkseid, Destroyer, Doomsday, Drax The Destroyer, Hellboy, Hulk, Juggernaut, Killer Croc, Kilowog, Living Brain, Lobo, Machine Man, Modok, Onslaught, Red Hulk, Rhino, Sasquatch, Scorpion, She-Hulk, Solomon Grundy, Spawn, Thanos, Ultron, Venom Iii, Wolfsbane |
3 | Adam Strange, Agent Bob, Ajax, Alfred Pennyworth, Angel, Angel Dust, Angel Salvadore, Ant-Man, Ant-Man Ii, Archangel, Arclight, Ariel, Armor, Atom Girl, Atom Ii, Bane, Banshee, Bantam, Batgirl, Batgirl Iv, Batgirl Vi, Batman, Batman, Batman Ii, Big Man, Black Canary, Black Canary, Black Cat, Black Knight Iii, Black Lightning, Black Mamba, Black Panther, Black Widow, Blackwing, Blackwulf, Blade, Bling!, Blink, Blizzard Ii, Boom-Boom, Brainiac 5, Buffy, Bullseye, Bumblebee, Callisto, Captain America, Catwoman, Chamber, Changeling, Cheetah, Cheetah Ii, Clock King, Copycat, Cottonmouth, Crystal, Cyclops, Daredevil, Darkhawk, Darkstar, Dash, Dazzler, Deadshot, Deathlok, Demogoblin, Diamondback, Doctor Octopus, Domino, Electro, Elektra, Elongated Man, Enchantress, Falcon, Feral, Firebird, Firestar, Forge, Franklin Storm, Gambit, Goblin Queen, Gravity, Green Arrow, Green Goblin, Green Goblin Ii, Han Solo, Harley Quinn, Havok, Hawk, Hawkeye, Hawkeye Ii, Heat Wave, Hellcat, Hope Summers, Huntress, Hydro-Man, Indiana Jones, Ink, Jack-Jack, James T. Kirk, Jean Grey, Jennifer Kale, Jessica Jones, John Wraith, Joker, Jolt, Jubilee, Justice, Kingpin, Kraven Ii, Kraven The Hunter, Leader, Light Lass, Lightning Lad, Lightning Lord, Longshot, Luke Skywalker, Man-Wolf, Mandarin, Maverick, Medusa, Meltdown, Metron, Micro Lad, Mister Fantastic, Mister Freeze, Mockingbird, Moon Knight, Morlun, Moses Magnum, Mr Immortal, Ms Marvel Ii, Multiple Man, Mysterio, Mystique, Nebula, Nick Fury, Nightcrawler, Nightwing, Oracle, Paul Blart, Penguin, Phantom Girl, Plantman, Plastique, Poison Ivy, Professor X, Professor Zoom, Psylocke, Punisher, Purple Man, Pyro, Question, Quill, Ra’s Al Ghul, Rambo, Raven, Red Arrow, Red Hood, Red Robin, Red Skull, Rick Flag, Robin, Robin Ii, Robin Iii, Robin V, Rocket Raccoon, Rogue, Ronin, Rorschach, Sage, Scarecrow, Scarlet Spider Ii, Shadow Lass, Shadowcat, Shang-Chi, Shocker, Shriek, Silverclaw, Siryn, Snowbird, Songbird, Space Ghost, Spider-Woman Iii, Spock, Spyke, Star-Lord, Static, Storm, Sunspot, Swarm, Synch, Taskmaster, Tempest, The Comedian, Thunderbird, Tigra, Tinkerer, Toad, Triplicate Girl, Two-Face, Vanisher, Vibe, Violet Parr, Vixen, Vulture, Walrus, Warp, Wasp, Winter Soldier, Wyatt Wingfoot, Yellowjacket, Yellowjacket Ii, Yoda, Zatanna |
We also digged into the average capability statistics within each cluster group to learn about the feature of each group.
superheros_clean |>
group_by(cluster) |>
summarise(
intelligence_mean = mean(intelligence, na.rm = TRUE),
intelligence_sd = sd(intelligence, na.rm = TRUE),
strength_mean = mean(strength, na.rm = TRUE),
strength_sd = sd(strength, na.rm = TRUE),
speed_mean = mean(speed, na.rm = TRUE),
speed_sd = sd(speed, na.rm = TRUE),
durability_mean = mean(durability, na.rm = TRUE),
durability_sd = sd(durability, na.rm = TRUE),
power_mean = mean(power, na.rm = TRUE),
power_sd = sd(power, na.rm = TRUE),
combat_mean = mean(combat, na.rm = TRUE),
combat_sd = sd(combat, na.rm = TRUE),
height_mean = mean(height_cm, na.rm = TRUE),
height_sd = sd(height_cm, na.rm = TRUE),
weight_mean = mean(weight_kg, na.rm = TRUE),
weight_sd = sd(weight_kg, na.rm = TRUE)
) |>
pivot_longer(-cluster, names_to = c("variable", ".value"), names_sep = "_") |>
mutate(mean_sd = paste0(round(mean, 2), " (", round(sd, 2), ")")) |>
dplyr::select(variable, cluster, mean_sd) |>
pivot_wider(names_from = cluster, values_from = mean_sd, names_prefix = "cluster") |>
kable()
variable | cluster1 | cluster2 | cluster3 |
---|---|---|---|
intelligence | 67.3 (18.18) | 64.16 (22.83) | 62.77 (19.05) |
strength | 63.93 (28.43) | 76.31 (26.18) | 19.15 (15.04) |
speed | 54.56 (25.02) | 43.06 (17.41) | 28.38 (14.3) |
durability | 80.62 (17.83) | 85.72 (19.8) | 38.65 (19.79) |
power | 81.31 (23.54) | 71.81 (28.23) | 49.62 (25.07) |
combat | 68.07 (19.15) | 69.47 (20.79) | 59.68 (24.07) |
height | 184.7 (15.56) | 238.72 (46.7) | 175.79 (15.58) |
weight | 111.26 (56.52) | 412.97 (180.88) | 74.5 (24.59) |
The table presents the average values (with standard deviations) of several attributes for superheroes categorized into three clusters. Each row represents a different variable, including intelligence, strength, speed, durability, power, combat, height, and weight. Cluster 1, characterized by moderate values for most attributes, shows relatively balanced abilities, particularly in intelligence and power. Cluster 2 displays the highest average values for strength, durability, height, and weight, indicating that members of this cluster are likely to be physically dominant. Conversely, Cluster 3 exhibits lower values for attributes like strength, speed, and weight, suggesting that the superheroes in this cluster might rely less on physical prowess compared to the other clusters.
The radar plot below is a visualization of superhero 6 capabilities values in each group from the previous section.
library(fmsb)
cluster_averages <- superheros_clean |>
group_by(cluster) |>
summarise(
intelligence = mean(intelligence, na.rm = TRUE),
strength = mean(strength, na.rm = TRUE),
speed = mean(speed, na.rm = TRUE),
durability = mean(durability, na.rm = TRUE),
power = mean(power, na.rm = TRUE),
combat = mean(combat, na.rm = TRUE)
)
radar_data <- as.data.frame(cluster_averages)
radar_data <- radar_data[, -1]
radar_data <- rbind(rep(100, 6), rep(0, 6), radar_data)
colors_border <- c('#f08080', '#90ee90', '#87cefa')
colors_in <- c('#ffcccb', '#c6e7c6', '#b0e0e6')
radarchart(radar_data, axistype = 2,
pcol = colors_border,
pfcol = adjustcolor(colors_in, alpha.f = 0.6), plwd = 2.5, plty = 1,
cglcol = "grey", cglty = 1,
axislabcol = "black")
legend(x = "topright", legend = paste("Cluster", 1:3), pch = 20, col = colors_border, text.col = "black", cex = 1.2, pt.cex = 2)
Among all these variables such as intelligence, strength, speed, durability, combat, height, weight, gender, alignment, we were interested in knowing what factors contribute the most to a superhero’s power, which is defined as representing the character’s overall power or abilities. Therefore, the goal is to identify key predictors of superhero power, evaluate model performance, and validate the results using advanced methods such as Random Forest.
A univariate analysis was performed to determine which predictors
(independent variables) are significantly associated with
power
(dependent variable) at 0.05 significance level.
superheros <- read_csv("./data/superheroes.csv", show_col_types = FALSE)
superheros_clean1 <- superheros %>%
filter(complete.cases(intelligence, strength, speed, durability, combat, height_cm, weight_kg, gender, alignment, power))
superheros_clean1$gender <- as.factor(superheros_clean1$gender)
superheros_clean1$alignment <- as.factor(superheros_clean1$alignment)
univariate_analysis = function(var, var_name){
model = summary(lm(power ~ var, data = superheros_clean1))
p_value = model$coefficients[2, 4]
return(data.frame(variable = var_name, p_value = p_value))
}
results = rbind(
univariate_analysis(superheros_clean1$intelligence, "Intelligence"),
univariate_analysis(superheros_clean1$strength, "Strength"),
univariate_analysis(superheros_clean1$speed, "Speed"),
univariate_analysis(superheros_clean1$durability, "Durability"),
univariate_analysis(superheros_clean1$combat, "Combat"),
univariate_analysis(superheros_clean1$height_cm, "Height"),
univariate_analysis(superheros_clean1$weight_kg, "Weight"),
univariate_analysis(superheros_clean1$gender, "Gender"),
univariate_analysis(superheros_clean1$alignment, "Alignment")
)
results$p_value <- round(results$p_value, 4)
library(knitr)
kable(results, col.names = c("Variable", "P-Value"), digits = 4, align = "c")
Variable | P-Value |
---|---|
Intelligence | 0.0000 |
Strength | 0.0000 |
Speed | 0.0000 |
Durability | 0.0000 |
Combat | 0.0615 |
Height | 0.1282 |
Weight | 0.0394 |
Gender | 0.3960 |
Alignment | 0.6718 |
Based on the results, intelligence
,
strength
, speed
, durability
,
weight
are significantly associated with the power with
p-value < 0.05.
Next, A linear regression model was built using the significant predictors from univariate analysis.
sig_model = lm(power ~ intelligence + strength + speed + durability + weight_kg, data = superheros_clean1)
sig_model %>%
broom::tidy() %>%
dplyr::select(term, estimate, p.value) %>%
knitr::kable(digits = 4)
term | estimate | p.value |
---|---|---|
(Intercept) | 19.6052 | 0.0001 |
intelligence | 0.2065 | 0.0010 |
strength | 0.1037 | 0.0491 |
speed | 0.3254 | 0.0000 |
durability | 0.2704 | 0.0000 |
weight_kg | -0.0165 | 0.1917 |
All of the variables are statistically significant associated with
the power at 0.05 level except for weight
. From the table,
we can conclude that as the intelligence
,
strength
, speed
, and durability
increase respectively, superheroes’ power
also increases,
as indicated by the positive estimates.
Forward, backward, and stepwise selection methods were used to refine the model. Based on these methods, three competing models were built:
Model A: Intelligence
, Speed
,
Durability
.
Model B: Intelligence
, Strength
,
Speed
, Durability
.
Model C: All variables: Intelligence
,
Strength
, Speed
, Durability
,
Combat
, Height
, Weight
,
Alignment
, Gender
.
Three models were evaluated using cross-validation repeated 100 times. Root Mean Squared Error (RMSE) distributions were compared.
From the figure, Model B has the relatively lowest RMSE distribution overall. In addition, AIC was also assessed, and Model B also has the lowest value. Therefore, Model B was selected as the final model due to its superior predictive performance and simplicity.
cv_df =
crossv_mc(superheros_clean1, 100) |>
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble))
cv_df =
cv_df |>
mutate(
lm_modelA = map(train, \(df) lm(power ~ intelligence + speed + durability, data = df)),
lm_modelB = map(train, \(df) lm(power ~ intelligence + strength + speed + durability, data = df)),
lm_modelC = map(train, \(df) lm(power ~ intelligence + strength + speed + durability + combat + height_cm + weight_kg + alignment, data = df))) %>%
mutate(
rmse_modelA = map2_dbl(lm_modelA, test, \(mod, df) rmse(model = mod, data = df)),
rmse_modelB = map2_dbl(lm_modelB, test, \(mod, df) rmse(model = mod, data = df)),
rmse_modelC = map2_dbl(lm_modelC, test, \(mod, df) rmse(model = mod, data = df)))
cv_df |>
dplyr::select(starts_with("rmse")) |>
pivot_longer(
everything(),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_") |>
mutate(model = fct_inorder(model)) |>
ggplot(aes(x = model, y = rmse)) + geom_violin()
lm_modelA <- lm(power ~ intelligence + speed + durability, data = superheros_clean1)
lm_modelB <- lm(power ~ intelligence + strength + speed + durability, data = superheros_clean1)
lm_modelC <- lm(power ~ intelligence + strength + speed + durability + combat + height_cm + weight_kg + alignment + gender, data = superheros_clean1)
aic_values <- data.frame(
Model = c("Model A", "Model B", "Model C"),
AIC = c(
AIC(lm_modelA),
AIC(lm_modelB),
AIC(lm_modelC)
)
)
library(knitr)
kable(
aic_values,
col.names = c("Model", "AIC"),
digits = 2,
align = "c",
caption = "AIC Comparison for Models"
)
Model | AIC |
---|---|
Model A | 3869.10 |
Model B | 3868.38 |
Model C | 3875.80 |
To verify and enhance our model for predicting superhero powers, we used a random forest model to complement and validate insights from our statistical analyses, providing an additional predictive layer.
The random forest algorithm is a supervised learning method that constructs multiple decision trees using subsets of the data and features, then aggregates their predictions to improve accuracy and robustness. Compared to single decision trees, random forests are less prone to overfitting due to their ensemble approach, though careful tuning of hyperparameters remains important to ensure optimal performance. This method allowed us to validate our findings from a predictive perspective while mitigating the risk of overfitting inherent in individual tree-based models.
set.seed(1)
rf_model <- randomForest(power ~ combat + strength + speed + durability + intelligence + height_cm + weight_kg + gender + alignment, data = superheros_clean1, mtry = 3, ntree = 100, importance = TRUE)
importance <- importance(rf_model)
varImpPlot(rf_model)
From the variable importance plot, we found that speed, durability,
strength, and intelligence are the most important variables for
improving the model’s predictive performance and better fit as they have
a larger %IncMSE and increase in Node Purity, which is consistent with
our model. Therefore, intelligence
, Strength
,
Speed
, Durability
was identified as the
primary contributors to superheroes’ power
.
Final model we chose:
lm_modelB <- lm(power ~ intelligence + strength + speed + durability)
Look at Model B:
lm_modelB <- lm(power ~ intelligence + strength + speed + durability, data = superheros_clean1) %>%
broom::tidy() %>%
dplyr::select(term, estimate, p.value)
kable(lm_modelB, digits = 4)
term | estimate | p.value |
---|---|---|
(Intercept) | 18.5961 | 0.0001 |
intelligence | 0.2076 | 0.0010 |
strength | 0.0819 | 0.1012 |
speed | 0.3402 | 0.0000 |
durability | 0.2603 | 0.0000 |
Based on the results, intelligence
,
Strength
, Speed
, and Durability
are positively related to power
. For one-unit increase in
speed
, it causes the largest increase in
power
, while one-unit increase in strength
causes the least increase.
Users can use this shinny app to identify 5 top ranked heroes based on their preferences on speed, intelligence, durability, strength, combat, and power, then can pick their hero among these 5 heroes by filtering alighment, gender, hair color, and eye color.
We found that characters with a high combination of mental acuity (intelligence), physical prowess (strength and durability), and agility (speed) tend to possess greater overall power, which align with our expectations for superheroes, emphasizing a blend of physical and mental capabilities. In addition, from cross validation and random forest analysis, we could confirm that speed, durability, strength, and intelligence were the most critical variables for predicting power. One surprising finding was the high contribution of speed to power, as it often takes a backseat to strength or combat skills in storytelling.
Good alignment dominated the data set and this prevalence of good-aligned characters reflects traditional superhero narratives. However, we found the bad alignment had overall higher intelligence, strength, durability, power, combat, height, and weight than good alignment. One reason we thought might explain this was that villains were often designed to present formidable challenges to heroes, necessitating higher physical and mental attributes to justify their role as significant threats.
Male characters tended to have higher strength, durability, and physical attributes, while female characters often exhibited higher agility and intelligence. This gender differences in attributes aligned with stereotypes in superhero depictions, and we hope this analysis can help modern comics create more nuanced and diverse characters that break traditional norms.
To further distinguish superhero types, we clustered characters into three distinct groups: heroes in Cluster 1 had relatively balanced attributes, Cluster 2 included heroes physically dominated with exceptional strength and durability, and strategically inclined heroes with lower physical attributes but higher intelligence and speed were classified in Cluster 3.
While our analysis yielded valuable insights, there are areas for improvement. First, a more integrated dataset with clear and standardized criteria for assigning attribute values would enhance the robustness of the analysis. Moreover, developing an interactive model that allows users to input ability values and appearance characteristics as predictors to generate customized heroes as outputs would significantly improve user experience. This approach would enable users to design heroes tailored to their preferences, fostering creativity and engagement.
Our findings provide a foundation for guiding character creation and team compositions in storytelling, with the potential to enhance diversity in character design and balance in superhero teams. By integrating data-driven insights with creative applications, we hope this work will enrich the superhero genre, offering opportunities to break traditional norms and craft more inclusive narratives.