eric | Sept. 13, 2024, 5:51 p.m.
Data cleaning is a crucial first step in any data analysis or machine learning pipeline. Without clean and properly structured data, the results of any analysis can be misleading or invalid. In this guide, we will cover how to get data from various sources and clean it in R to ensure that it's ready for further analysis.
There are multiple ways to bring data into R from various formats like CSV, Excel, databases, APIs, and more.
The most common format for datasets is the CSV (Comma Separated Values) file. The read.csv() function in R makes it straightforward to load CSV files. There are two versions of read_csv():
- read_csv() assumes that the field separator is a comma and the decimal mark is a period.
- read_csv2() assumes that the field separator is a semicolon and the decimal mark is a comma.
# Load data from a CSV file
data <- read.csv2("path/to/your/file.csv", stringsAsFactors = FALSE, skip = 1)
- stringsAsFactors = FALSE ensures that string columns are not automatically converted to factors, which can be useful for data cleaning (Note: this is overridden by as.is and colClasses, both of which allow finer control.)
- skip is useful when the first row(s) of the CSV file is/are empty or contain(s) data that you don't want.
Example: Assuming we have a file.csv with the content
bla | bla | bla | bla | bla |
A | B | C | D | E |
1 | 2 | 3 | 4 | 5 |
6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 |
15 |
the command above gives us
> data_csv <- read.csv2("/home/eric/Documents/file.csv", stringsAsFactors = FALSE, skip = 1) > data_csv A.B.C.D.E 1 1,2,3,4,5 2 6,7,8,9,10 3 11,12,13,14,15
For Excel files, the readxl package allows for reading data from .xls or .xlsx files.
library(readxl)
# Load data from an Excel file
data <- read_excel("path/to/your/file.xlsx", sheet = 1, skip = 1)
- sheet specifies which sheet to read from the Excel file.
- skip specifies the number of rows to skip before reading data.
Example: The command above - with the same data as in Section 1 - gives us:
> data <- read_excel("/home/eric/Documents/file.xlsx", sheet = 1, skip = 1) > data # A tibble: 3 × 5 A B C D E <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 2 3 4 5 2 6 7 8 9 10 3 11 12 13 14 15
To pull data from a MySQL or PostgreSQL database, you can use the DBI and RMySQL or RPostgres packages, respectively.
library(DBI)
# Connect to a postgresql database
con <- dbConnect(RPostgres::Postgres(), dbname = "mydb", host = "localhost",
user = "username", password = "password")
# Query data
data <- dbGetQuery(con, "SELECT * FROM my_table")
# Close the connection
dbDisconnect(con)
Example: Assuming we have a my_table that looks like this
A | B | C | D | E |
a1 | b1 | c1 | d1 | e1 |
a2 | b2 | c2 | d2 | e2 |
a3 | b3 | c3 | d3 | e3 |
a4 | b4 | c4 | d4 | e4 |
a5 | b5 | c5 | d5 | e5 |
the command above would give you
> con <- dbConnect(RPostgres::Postgres(),dbname = "mydb",host = "localhost", port = 5432, user = "your_username", password = "your_password")
> data <- dbGetQuery(con, "SELECT * FROM my_table") > data id a b c d e 1 1 a1 b1 c1 d1 e1 2 2 a2 b2 c2 d2 e2 3 3 a3 b3 c3 d3 e3 4 4 a4 b4 c4 d4 e4 5 5 a5 b5 c5 d5 e5
To get data from an API, you can use the httr2 and jsonlite packages to send requests and parse JSON data.
library(httr2)
library(jsonlite)
# Send a GET request to the API
response <- request("https://api.example.com/data") %>% req_perform()
# Check if the request wa succesful
if (resp_status(response) == 200) {
# Parse the JSON response
data <- response %>% resp_body_json()
}
Explanation:
- httr2::request() creates a request object, which can be further modified or sent using req_perform().
- resp_status() is used to extract the status code from the response. Code 200 means that the request was successful.
- resp_body_json() directly parses the JSON response body into an R object.
Example: Loading and parsing JSON data from the Star Trek API, finding data on spacecraft:
# Load the packages
library(httr2)
library(jsonlite)
# Define the API URL
url <- "https://stapi.co/api/v1/rest/spacecraft/search"
# Make the API request using httr2
response <- request(url) %>%
req_perform()
# Check if the request was successful
if (resp_status(response) == 200) {
print("Request was successful")
# Parse the JSON content
data <- response %>%
resp_body_json()
# View the structure of the parsed data
str(data)
# Extract and print the spacecraft's name
spacecraft <- data$spacecrafts[[1]]
print(spacecraft)
print(paste("Spacecraft's name:", spacecraft$name))
} else {
stop("Request failed with status: ", resp_status(response))
}
Running the code above gives us the data on the first spacecraft in the list from the API:
[1] "Request was successful"
$uid
[1] "SRMA0000245113"
$name
[1] "042"
$registry
[1] "042"
$status
[1] "Active"
$dateStatus
[1] "3189"
$spacecraftClass
$spacecraftClass$uid
[1] "SCMA0000226396"
$spacecraftClass$name
[1] "Worker bee"
$owner
NULL
$operator
NULL
[1] "Spacecraft's name: 042"
Once you've loaded the data into R, it's important to examine it before proceeding with cleaning.
# Get the structure of the data
str(data)
# View the first few rows of the data
head(data)
# Summary statistics of the data
summary(data)
- str() shows the structure of the dataset, including data types.
- head() provides a preview of the first few rows.
- summary() gives summary statistics such as mean, median, and missing values.
Example: Let's try to examine the data in this CSV-file, containing data on several Star Trek starships.
> data <- read.csv("Documents/star_trek_table.csv") > str(data) 'data.frame': 20 obs. of 8 variables: $ Ship.Name : chr "USS Enterprise (NCC-1701)" "USS Enterprise (NCC-1701-A)" "USS Enterprise (NCC-1701-B)" "USS Enterprise (NCC-1701-C)" ... $ Registry.Number : chr "NCC-1701" "NCC-1701-A" "NCC-1701-B" "NCC-1701-C" ... $ Class : chr "Constitution-class" "Constitution-class" "Excelsior-class" "Ambassador-class" ... $ Captain.s. : chr "Capt. James T. Kirk, Capt. Pike" "Capt. James T. Kirk" "Capt. John Harriman" "Capt. Rachel Garrett" ... $ Launch.Date : chr "2245" "2286" "2293" "2344" ... $ Notable.Appearances..Series.Movies.: chr "Star Trek: TOS, Movies I-IV, VI" "Star Trek IV-VI" "Star Trek: Generations" "Star Trek: TNG (Episode: \"Yesterday's Enterprise\")" ... $ Ship.s.Complement : int 430 500 750 700 1014 885 150 50 136 80 ... $ Max.Speed..Warp.Factor. : chr "Warp 8" "Warp 9" "Warp 9.6" "Warp 9" ... > head(data) Ship.Name Registry.Number Class 1 USS Enterprise (NCC-1701) NCC-1701 Constitution-class 2 USS Enterprise (NCC-1701-A) NCC-1701-A Constitution-class 3 USS Enterprise (NCC-1701-B) NCC-1701-B Excelsior-class 4 USS Enterprise (NCC-1701-C) NCC-1701-C Ambassador-class 5 USS Enterprise (NCC-1701-D) NCC-1701-D Galaxy-class 6 USS Enterprise (NCC-1701-E) NCC-1701-E Sovereign-class Captain.s. Launch.Date 1 Capt. James T. Kirk, Capt. Pike 2245 2 Capt. James T. Kirk 2286 3 Capt. John Harriman 2293 4 Capt. Rachel Garrett 2344 5 Capt. Jean-Luc Picard 2363 6 Capt. Jean-Luc Picard 2372 Notable.Appearances..Series.Movies. Ship.s.Complement Max.Speed..Warp.Factor. 1 Star Trek: TOS, Movies I-IV, VI 430 Warp 8 2 Star Trek IV-VI 500 Warp 9 3 Star Trek: Generations 750 Warp 9.6 4 Star Trek: TNG (Episode: "Yesterday's Enterprise") 700 Warp 9 5 Star Trek: TNG, Movies VII, VIII 1014 Warp 9.6 6 Movies VIII-X 885 Warp 9.99 > summary(data) Ship.Name Registry.Number Class Captain.s. Launch.Date Length:20 Length:20 Length:20 Length:20 Length:20 Class :character Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character Mode :character Notable.Appearances..Series.Movies. Ship.s.Complement Max.Speed..Warp.Factor. Length:20 Min. : 4.0 Length:20 Class :character 1st Qu.: 122.0 Class :character Mode :character Median : 390.0 Mode :character Mean : 421.2 3rd Qu.: 712.5 Max. :1014.0
Ship.Name Registry.Number Class Captain.s. Launch.Date Length:20 Length:20 Length:20 Length:20 Length:20 Class :character Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character Mode :character
Data cleaning involves transforming raw data into a format that is consistent, accurate, and usable for analysis. This process typically includes handling missing data, removing duplicates, correcting inconsistencies, and transforming data types.
Missing data can occur for various reasons, and it's important to handle it carefully. You can either remove missing values or replace them with some meaningful statistic like the mean, median, or mode.
sum(is.na(data)) # Total missing values
colSums(is.na(data)) # Missing values per column
Example: For our starship data, we get:
> sum(is.na(data)) # Total missing values [1] 7 > colSums(is.na(data)) # Missing values per column Ship.Name Registry.Number 0 0 Class Captain.s. 0 1 Launch.Date Notable.Appearances..Series.Movies. 6 0 Ship.s.Complement Max.Speed..Warp.Factor. 0 0
The column containing the launch data is the only one that has a lot of missing data (6 of 20 rows).
One way to resolve the problem of missing values, might be to simply remove rows
data_clean <- na.omit(data)
Example: Running this on our dataset above gives us just 14 rows of 20:
> data_clean <- na.omit(data) > summary(data_clean) Ship.Name Registry.Number Class Captain.s. Launch.Date Length:14 Length:14 Length:14 Length:14 Length:14 Class :character Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character Mode :character Notable.Appearances..Series.Movies. Ship.s.Complement Max.Speed..Warp.Factor. Length:14 Min. : 4.0 Length:14 Class :character 1st Qu.: 139.5 Class :character Mode :character Median : 355.5 Mode :character Mean : 424.6 3rd Qu.: 737.5 Max. :1014.0
Instead of just removing data rows, you can replace missing values with the mean, median, or another statistic using the `dplyr` package.
library(dplyr)
# Replace missing values in a column with the mean
data <- data %>%
mutate(column_name = ifelse(is.na(column_name), mean(column_name, na.rm = TRUE), column_name))
- %>% - data is piped to the mutate function
- mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column, like above) and delete columns (by setting their value to NULL).
- ifelse has the form ifelse(test, yes, no)
- mean calculates the mean for the whole column and replaces any NA values with that mean
Example: To use this strategy on our dataset, we would first need to convert the launch year (Launch.Date) from a string to an integer.
library(dplyr)
data <- data %>%
mutate(Launch.Date = suppressWarnings(as.integer(Launch.Date)))
data <- data %>%
mutate(Launch.Date = ifelse(is.na(Launch.Date), as.integer(mean(Launch.Date, na.rm = TRUE)), Launch.Date))
Duplicates can cause bias in your analysis, so it's important to remove them.
# Remove duplicate rows
data_clean <- data[!duplicated(data), ]
Every so often, the data types might not be appropriate for the analysis you're planning. For example, a numeric column might be mistakenly read as a character column.
# Convert character column to numeric
data$column_name <- as.numeric(data$column_name)
# Convert character to factor
data$column_name <- as.factor(data$column_name)
Example: In our starship dataset, head(data) and summary(data) above, show us that all the data except for the ships' compliment has been read as strings. However, both the launch year (Launch.Date) and the speed should be numbers. Let's fix the launch year first:
> data$Launch.Date [1] "2245" "2286" "2293" "2344" "2363" "2372" "2371" "2370" "2256" NA "2256" [12] "2287" NA "2374" NA NA "2370s" "2285" NA NA > data <- data %>% + mutate(Launch.Date = suppressWarnings(as.integer(Launch.Date))) > data$Launch.Date [1] 2245 2286 2293 2344 2363 2372 2371 2370 2256 NA 2256 2287 NA 2374 NA NA NA 2285 [19] NA NA > data$Launch.Date[17] <- 2375 > data$Launch.Date [1] 2245 2286 2293 2344 2363 2372 2371 2370 2256 NA 2256 2287 NA 2374 NA NA 2375 2285 [19] NA NA
First, we convert the strings with mutate and the as.integer-function. Second, we set the NA coming from "2370s" to 2375, which is most likely better than just leaving it at NA, depending on what we want to do next.
The maximum speed (warp factor) columns is a bit more challenging. It would be reasonable to have Warp in the column heading instead of in every row in front of the number. That way we would get only numbers, except for one row — the one for USS Discovery (NCC-1031) — which has a spore drive indicated in the max speed column. It would be tempting to just delete this string to get the number, but having a spore drive sounds like a big deal, and we don't want to lose important information. So, before we start cleaning up the Max.Speed..Warp.Factor.-column, let's extract the spore drive information and put it in a separate column:
data <- data %>%
mutate(Spore.Drive = grepl("Spore Drive", Max.Speed..Warp.Factor., fixed = TRUE))
The last two columns in the dataset are now:
Max.Speed..Warp.Factor. Spore.Drive 1 Warp 8 FALSE 2 Warp 9 FALSE 3 Warp 9.6 FALSE 4 Warp 9 FALSE 5 Warp 9.6 FALSE 6 Warp 9.99 FALSE 7 Warp 9.975 FALSE 8 Warp 9.982 FALSE 9 Warp 9.99 / Spore Drive TRUE
...
Next, we need to get rid of everything that isn't a number or a decimal point in the Max.Speed..Warp.Factor.-column:
data <- data %>%
mutate(Max.Speed..Warp.Factor. =
as.numeric(
gsub("[^0-9.-]", "", Max.Speed..Warp.Factor.)
)
)
The columns now look like this:
Max.Speed..Warp.Factor. Spore.Drive 1 8.000 FALSE 2 9.000 FALSE 3 9.600 FALSE 4 9.000 FALSE 5 9.600 FALSE 6 9.990 FALSE 7 9.975 FALSE 8 9.982 FALSE 9 9.990 TRUE
...
In addition to string to number-conversion, we should also convert the ships' classes to factors. Let's clean up the class names first, though. They all have “-class” in the name, which is superfluous.
> data <- data %>% mutate(Class = gsub("-class", "", Class))
> data$Class [1] "Constitution" "Constitution" "Excelsior" "Ambassador" "Galaxy" [6] "Sovereign" "Intrepid" "Defiant" "Crossfield" "California" [11] "Walker" "Excelsior" "Miranda" "Prometheus" "Akira" [16] "Galaxy" "Nova" "Miranda" "Constellation" "Luna"
The Class column still just contain strings. We need to convert them to factors:
> data <- data %>% mutate(Class = as.factor(Class)) > data$Class [1] Constitution Constitution Excelsior Ambassador Galaxy Sovereign [7] Intrepid Defiant Crossfield California Walker Excelsior [13] Miranda Prometheus Akira Galaxy Nova Miranda [19] Constellation Luna 16 Levels: Akira Ambassador California Constellation Constitution Crossfield ... Walker
Outliers are data points that are significantly different from other observations. While they are not always incorrect, they may require special attention.
# Identify outliers using boxplot
boxplot(data$column_name)
# Remove outliers based on interquartile range (IQR)
Q1 <- quantile(data$column_name, 0.25)
Q3 <- quantile(data$column_name, 0.75)
IQR <- Q3 - Q1
data_clean <- data[data$column_name >= (Q1 - 1.5 * IQR) & data$column_name <= (Q3 + 1.5 * IQR), ]
Example: Let's do boxplots of all the numerical columns:
boxplot(data$Launch.Date)
This is interesting. There is a big jump between the first launch and the rest. Why is that?
boxplot(data$Ship.s.Complement)
That a starship has a ship's complement ranging from a bit below to 200 to above a 1000, seems reasonable. However, there is one ship with a complement of 4. An error?
boxplot(data$Max.Speed..Warp.Factor.)
These ranges appear to be reasonable, given that technology advances, and with it, warp speeds.
Renaming columns to more meaningful names can make your dataset easier to understand and work with.
# Rename columns
colnames(data) <- c("new_name1", "new_name2", "new_name3")
Example: Some names in our data set are pretty long-winded and ambiguous.
> colnames(data) [1] "Ship.Name" "Registry.Number" [3] "Class" "Captain.s." [5] "Launch.Date" "Notable.Appearances..Series.Movies." [7] "Ship.s.Complement" "Max.Speed..Warp.Factor." [9] "Spore.Drive"
Let's fix that:
> colnames(data) <-c("Name","RegNo","Class","Captain","Launch","Appearances","Complement","Warp Factor","Spore Drive") > colnames(data) [1] "Name" "RegNo" "Class" "Captain" "Launch" "Appearances" [7] "Complement" "Warp Factor" "Spore Drive"
You may need to recode variables, especially when converting categorical data into a different format or merging categories.
> library(dplyr)
> x <- c(1, 2, 3, 4, 5, 6, 7)
> case_match(x,
+ 1 ~ "One",
+ 2 ~ "Two",
+ 3 ~ "Three",
+ 4 ~ "Four",
+ .default = "Other")
[1] "One" "Two" "Three" "Four" "Other" "Other" "Other"
case_match() provides many possibilities for matching levels, including ranges, conditions, and defaults. It replaces recode(). Note that for logical vectors, if_else() is recommended. For complex criteria, you may use case_when().
Example: Let's say neo-communists take over the known universe and start rewriting history. One of the first things to go are the historic starship classes. The historically correct classes are (suddenly):
> data -> communist_universe > communist_universe$Class <- case_match(data$Class, + "Constellation" ~ "Lenin", + "Excelsior" ~ "Stalin", + .default = "Marx" + ) > communist_universe Name RegNo Class Captain Launch 1 USS Enterprise NCC-1701 Marx Capt. James T. Kirk 2245 2 USS Enterprise NCC-1701 Marx Capt. Pike 2245 3 USS Enterprise NCC-1701-A Marx Capt. James T. Kirk 2286 4 USS Enterprise NCC-1701-B Stalin Capt. John Harriman 2293 5 USS Enterprise NCC-1701-C Marx Capt. Rachel Garrett 2344 6 USS Enterprise NCC-1701-D Marx Capt. Jean-Luc Picard 2363 7 USS Enterprise NCC-1701-E Marx Capt. Jean-Luc Picard 2372 8 USS Voyager NCC-74656 Marx Capt. Kathryn Janeway 2371 9 USS Defiant NX-74205 Marx Capt. Benjamin Sisko 2370 10 USS Discovery NCC-1031 Marx Capt. Gabriel Lorca 2256 11 USS Discovery NCC-1031 Marx Capt. Pike 2256 12 USS Cerritos NCC-75567 Marx Capt. Carol Freeman 2319 13 USS Shenzhou NCC-1227 Marx Capt. Philippa Georgiou 2256 14 USS Excelsior NCC-2000 Stalin Capt. Hikaru Sulu 2287 15 USS Saratoga NCC-31911 Marx Capt. J.G. Sanders 2319 16 USS Prometheus NX-59650 Marx Various (Multivector Assault Mode) 2374 17 USS Thunderchild NCC-63549 Marx <NA> 2319 18 USS Yamato NCC-71807 Marx Capt. Donald Varley 2319 19 USS Equinox NCC-72381 Marx Capt. Rudolph Ransom 2375 20 USS Reliant NCC-1864 Marx Capt. Clark Terrell 2285 21 USS Stargazer NCC-2893 Lenin Capt. Jean-Luc Picard 2319 22 USS Titan NCC-80102 Marx Capt. William Riker 2319...
For more complex recoding, you can use case_match() or case_when():
case_when(
x %in% c("a", "b") ~ "A",
x %in% "c" ~ "C",
x %in% c("d", "e") ~ "D"
)
case_match(
x,
c("a", "b") ~ "A",
"c" ~ "C",
c("d", "e") ~ "D"
)
If your dataset contains text data, it might include extra whitespaces, special characters, or inconsistent capitalization. These can be cleaned using `stringr` package.
library(stringr)
# Trim leading and trailing whitespaces
data$column_name <- str_trim(data$column_name)
# Convert text to lowercase
data$column_name <- tolower(data$column_name)
In addition, it might be a good idea to explore the textual information further, to see whether there are any superfluous or duplicate information.
Example: We can still improve our dataset. The name contains both the name and the registry number. The latter is already available in the RegNo column. Moreover, in the Class-columns, the word "class" is repeated for each and every ship class.
Let's deal with the name-columns first. Each entry is of the form "USS Shipname (Registry-Number)”, so if we look for a space and a beginning parenthesis, we can get rid of everything from that position and to the right:
> data$Name <- str_replace(data$Name, regex(" \\([^)]+\\)"), "") > data$Name [1] "USS Enterprise" "USS Enterprise" "USS Enterprise" "USS Enterprise" [5] "USS Enterprise" "USS Enterprise" "USS Voyager" "USS Defiant" [9] "USS Discovery" "USS Cerritos" "USS Shenzhou" "USS Excelsior" [13] "USS Saratoga" "USS Prometheus" "USS Thunderchild" "USS Yamato" [17] "USS Equinox" "USS Reliant" "USS Stargazer" "USS Titan"
We can also change the Name from string to factor:
> data$Name <- as.factor(data$Name)
In some cases, you may need to reshape your dataset, i.e., transform it from wide format to long format or vice versa. The `tidyr` package offers helpful functions to do this.
library(tidyr)
data_long <- pivot_longer(data, cols = c("column1", "column2"), names_to = "variable", values_to = "value")
Example: If we use columns Name and Class from our starship dataframe, we get:
> library(tidyr) > data_long <- pivot_longer(data, cols = c("Class", "Name"), names_to = "variable", values_to = "value") > head(data_long)
# A tibble: 6 × 9 RegNo Captain Launch Appearances Complement `Warp Factor` `Spore Drive` variable value <chr> <chr> <dbl> <chr> <int> <dbl> <lgl> <chr> <chr> 1 NCC-1701 Capt. Jam… 2245 Star Trek:… 430 8 FALSE Class Cons… 2 NCC-1701 Capt. Jam… 2245 Star Trek:… 430 8 FALSE Name USS … 3 NCC-1701-A Capt. Jam… 2286 Star Trek … 500 9 FALSE Class Cons… 4 NCC-1701-A Capt. Jam… 2286 Star Trek … 500 9 FALSE Name USS … 5 NCC-1701-B Capt. Joh… 2293 Star Trek:… 750 9.6 FALSE Class Exce… 6 NCC-1701-B Capt. Joh… 2293 Star Trek:… 750 9.6 FALSE Name USS …
This is just to show how it works. It probably wouldn't make much sense in this case.
separate_longer_delim(data, cols, delim, ...)
This function uses a delimiter (delim), like a comma or a space, to separate strings in given columns (cols) in the dataframe (data) into separate rows.
Example: The longer format could be useful because we have several captains for each row. If we want a row with one captain in each:
> data <- separate_longer_delim(data, Captain, delim = ",") > head(data) Name RegNo Class Captain Launch 1 USS Enterprise NCC-1701 Constitution Capt. James T. Kirk 2245 2 USS Enterprise NCC-1701 Constitution Capt. Pike 2245 3 USS Enterprise NCC-1701-A Constitution Capt. James T. Kirk 2286 4 USS Enterprise NCC-1701-B Excelsior Capt. John Harriman 2293 5 USS Enterprise NCC-1701-C Ambassador Capt. Rachel Garrett 2344 6 USS Enterprise NCC-1701-D Galaxy Capt. Jean-Luc Picard 2363
We can now convert the strings in Captain to factors:
> data$Captain <- as.factor(data$Captain)
data_wide <- pivot_wider(data, names_from = "variable", values_from = "value")
Example: If we want an overview of who has been captain on each ship, we could make a dataframe with one column for each registry number (RegNo) and values from the Captain-column:
> data_wide <- data %>% pivot_wider(names_from = RegNo, values_from = Captain) > head(data_wide) # A tibble: 6 × 27 Name Class Launch Appearances Complement `Warp Factor` `Spore Drive` `NCC-1701` `NCC-1701-A` <fct> <fct> <dbl> <chr> <int> <dbl> <lgl> <fct> <fct> 1 USS E… Cons… 2245 "Star Trek… 430 8 FALSE Capt. Jam… NA 2 USS E… Cons… 2286 "Star Trek… 500 9 FALSE NA Capt. James… 3 USS E… Exce… 2293 "Star Trek… 750 9.6 FALSE NA NA 4 USS E… Amba… 2344 "Star Trek… 700 9 FALSE NA NA 5 USS E… Gala… 2363 "Star Trek… 1014 9.6 FALSE NA NA 6 USS E… Sove… 2372 "Movies VI… 885 9.99 FALSE NA NA # ℹ 18 more variables: `NCC-1701-B` <fct>, `NCC-1701-C` <fct>, `NCC-1701-D` <fct>, # `NCC-1701-E` <fct>, `NCC-74656` <fct>, `NX-74205` <fct>, `NCC-1031` <fct>, # `NCC-75567` <fct>, `NCC-1227` <fct>, `NCC-2000` <fct>, `NCC-31911` <fct>, `NX-59650` <fct>, # `NCC-63549` <fct>, `NCC-71807` <fct>, `NCC-72381` <fct>, `NCC-1864` <fct>, # `NCC-2893` <fct>, `NCC-80102` <fct>
After cleaning, it’s common to group and summarize data to get insights before analysis.
library(dplyr)
# Group by a variable and calculate mean of another variable
data_summary <- data %>%
group_by(grouping_variable) %>%
summarise(mean_value = mean(numeric_variable, na.rm = TRUE))
Example: Let's group the starships by Class and calculate the mean Complement for each class:
> data_summary <- data %>% + group_by(Class) %>% + summarise(mean_value = mean(Complement, na.rm = TRUE)) > data_summary # A tibble: 16 × 2 Class mean_value <fct> <dbl> 1 Akira 500 2 Ambassador 700 3 California 80 4 Constellation 500 5 Constitution 465 6 Crossfield 136 7 Defiant 50 8 Excelsior 750 9 Galaxy 1014 10 Intrepid 150 11 Luna 350 12 Miranda 158. 13 Nova 80 14 Prometheus 4 15 Sovereign 885 16 Walker 215
Once the data is cleaned, you can save it back to your local machine or a database for future use.
write.csv(data_clean, "path/to/cleaned_data.csv", row.names = FALSE)
library(writexl)
write_xlsx(data_clean, "path/to/cleaned_data.xlsx")
Data cleaning is a critical aspect of the data science process, ensuring that the data is free from errors and ready for analysis. R offers powerful packages and functions for handling missing data, correcting data types, dealing with outliers, and reshaping the data. By mastering these techniques, you'll be well on your way to preparing your data for meaningful insights and robust analyses.
Experienced dev and PM. Data science, DataOps, Python and R. DevOps, Linux, clean code and agile. 10+ years working remotely. Polyglot. Startup experience.
LinkedIn Profile
Statistics & R - a blog about - you guessed it - statistics and the R programming language.
R-blog
Erlang Explained - a blog on the marvelllous programming language Erlang.
Erlang Explained