library(tidyverse)
library(naniar)
library(stringr)
library(janitor)
library(plotly)
library(factoextra)
library(gridExtra)
library(kableExtra)
library(knitr)
Our data source is from kaggle
website. This dataset covers detailed information on 731 superheroes
and villains from various comic universes. It includes a wide range of
attributes such as capability statistics, biographical information,
physical appearance, and affiliations. The data was collected using the
SuperHero API.The original dataset is called
superheroes_data
.
However, we are not interested in each variables in
superheroes_data
dataset. In addition, some of the data in
this original dataset is not readable and analyzable. As a result, we
conducted the following data cleaning process:
When importing data, we set “NA”, “.”, ““,”null”, and “-” as missing values;
We harmonized the height of heroes in cm and their weight in kg;
We marked heroes with a height and weight of 0 as missing values: they are meaningless;
If a hero has two hair colors or two eye colors (e.g. Blue/Yellow), we marked this color as ‘Dual Color’;
We also reclassified some categorical variables. Specifically, if
a publisher
contains only one hero, then the publisher to
which this hero belongs will be noted as ‘Others’. We did the same for
the variables alignment
, eye_color
, and
hair_color
.
We selected only the variables we were interested and deleted
aliases
, base
, occupation
,
group_affiliation
, relatives
,
alter_egos
, place_of_birth
,
first_appearance
, race
. These variables are
documented in a way that is difficult to read and difficult to
clean.
superheroes_df <- read_csv(
"data/superheroes_data.csv",
na = c("NA", ".", "", "null", "-")
) |>
janitor::clean_names() |>
mutate(
height_cm = str_extract(height, "\\d+(?= cm)") |> as.numeric(),
height_cm = if_else(height_cm == 0, NA_real_, height_cm),
weight_kg = str_extract(weight, "\\d+(?= kg)") |> as.numeric(),
weight_kg = if_else(weight_kg == 0, NA_real_, weight_kg),
hair_color = if_else(str_detect(hair_color, "/"), "Dual Color", hair_color),
eye_color = str_remove_all(eye_color, "\\(.*?\\)") |> str_trim(),
eye_color = if_else(str_detect(eye_color, "/"), "Dual Color", eye_color)
) |>
mutate(across(
where(is.character) & !any_of("url"),
~ str_to_title(.)
)) |>
dplyr::select(
-aliases, -height, -weight, -base, -occupation, -group_affiliation, -relatives, -alter_egos, -place_of_birth, -first_appearance, -race
)
superheros <- superheroes_df
publisher_counts <- superheros |>
group_by(publisher) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
single_hero_publishers <- filter(publisher_counts, hero_count == 1) |>
pull(publisher)
superheros <- superheros |>
mutate(publisher = ifelse(publisher %in% single_hero_publishers, "Others", publisher))
alignment_counts <- superheros |>
group_by(alignment) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
gender_counts <- superheros |>
group_by(gender) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
eye_color_counts <- superheros |>
group_by(eye_color) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
single_hero_eye_color <- filter(eye_color_counts, hero_count == 1) |>
pull(eye_color)
superheros <- superheros |>
mutate(eye_color = ifelse(eye_color %in% single_hero_eye_color, "Others", eye_color))
hair_color_counts <- superheros |>
group_by(hair_color) |>
summarise(hero_count = n()) |>
arrange(desc(hero_count))
single_hero_hair_color <- filter(hair_color_counts, hero_count == 1) |>
pull(hair_color)
superheros <- superheros |>
mutate(hair_color = ifelse(hair_color %in% single_hero_hair_color, "Others", hair_color))
The cleaned dataset is tidy, readable and analyzable. It includes the following variables:
id
: A unique identifier for each character.
name
: The superhero’s alias or code name.
intelligence
: A numerical representation of the
character’s intelligence level.
strength
: A numerical value representing the
character’s physical strength.
speed
: A numerical representation of how fast the
character can move.
durability
: A measure of the character’s resilience
and ability to withstand damage.
power
: A numerical value representing the
character’s overall power or abilities.
combat
: A score depicting the character’s combat
skills and experience.
full-name
: The character’s real or full name, as
opposed to their superhero alias.
publisher
: The company responsible for creating and
publishing the character.
alignment
: Whether the character is good, evil, or
neutral.
gender
: The gender of the character.
height_cm
: The character’s height, given in
centimeters.
weight_kg
: The character’s weight, provided in
kilograms.
eye-color
: The color of the character’s
eyes.
hair-color
: The color of the character’s
hair.
url
: A link to an image of the character or more
detailed information.
Click here to explore these superhero attributes!