Getting And Cleaning Data With R

eric | Sept. 13, 2024, 5:51 p.m.

Data cleaning is a crucial first step in any data analysis or machine learning pipeline. Without clean and properly structured data, the results of any analysis can be misleading or invalid. In this guide, we will cover how to get data from various sources and clean it in R to ensure that it's ready for further analysis.

Getting Data in R

There are multiple ways to bring data into R from various formats like CSV, Excel, databases, APIs, and more.

1. Loading Data from a CSV File

The most common format for datasets is the CSV (Comma Separated Values) file. The read.csv() function in R makes it straightforward to load CSV files. There are two versions of read_csv():

- read_csv() assumes that the field separator is a comma and the decimal mark is a period.

- read_csv2() assumes that the field separator is a semicolon and the decimal mark is a comma.

# Load data from a CSV file
data <- read.csv2("path/to/your/file.csv", stringsAsFactors = FALSE, skip = 1)

- stringsAsFactors = FALSE ensures that string columns are not automatically converted to factors, which can be useful for data cleaning (Note: this is overridden by as.is and colClasses, both of which allow finer control.)

- skip is useful when the first row(s) of the CSV file is/are empty or contain(s) data that you don't want.

Example: Assuming we have a file.csv with the content

bla bla bla bla bla
A B C D E
1 2 3 4 5
6 7 8 9 10
11 12 13 14

15

the command above gives us

> data_csv <- read.csv2("/home/eric/Documents/file.csv", stringsAsFactors = FALSE, skip = 1)
> data_csv
       A.B.C.D.E
1      1,2,3,4,5
2     6,7,8,9,10
3 11,12,13,14,15

2. Loading Data from Excel Files

For Excel files, the readxl package allows for reading data from .xls or .xlsx files.

library(readxl)
# Load data from an Excel file
data <- read_excel("path/to/your/file.xlsx", sheet = 1, skip = 1)

- sheet specifies which sheet to read from the Excel file.

- skip specifies the number of rows to skip before reading data.

Example: The command above - with the same data as in Section 1 - gives us:

> data <- read_excel("/home/eric/Documents/file.xlsx", sheet = 1, skip = 1)                                                                                            
> data
# A tibble: 3 × 5
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     4     5
2     6     7     8     9    10
3    11    12    13    14    15

3. Loading Data from a Database

To pull data from a MySQL or PostgreSQL database, you can use the DBI and RMySQL or RPostgres packages, respectively.

library(DBI)
# Connect to a postgresql database
con <- dbConnect(RPostgres::Postgres(), dbname = "mydb", host = "localhost",
                 user = "username", password = "password")
# Query data
data <- dbGetQuery(con, "SELECT * FROM my_table")
# Close the connection
dbDisconnect(con)

Example: Assuming we have a my_table that looks like this

A B C D E
a1 b1 c1 d1 e1
a2 b2 c2 d2 e2
a3 b3 c3 d3 e3
a4 b4 c4 d4 e4
a5 b5 c5 d5 e5

the command above would give you

> con <- dbConnect(RPostgres::Postgres(),dbname = "mydb",host = "localhost", port = 5432, user = "your_username", password = "your_password")
> data <- dbGetQuery(con, "SELECT * FROM my_table")
> data
  id  a  b  c  d  e
1  1 a1 b1 c1 d1 e1
2  2 a2 b2 c2 d2 e2
3  3 a3 b3 c3 d3 e3
4  4 a4 b4 c4 d4 e4
5  5 a5 b5 c5 d5 e5
R has connectors for almost all types of databases. For most relational databases, the DBI package is a common interface to work with in R. For example, packages like RSQLite, RMySQL, and RPostgres work seamlessly with DBI to execute queries and fetch results. For MySQL, there is the RMySQL library, MariaDB; RMariaDB, PostgreSQL; RPostgreSQL or RPostgres, SQLite; RSQLite, SQL Server (Microsoft); odbc or RODBC, Oracle Database; ROracle or DBI and odbc. In general, any ODCB-compliant database can work with R using DBI and odbc (Microsoft Access, Teradata, etc.). There are also libraries for NoSQL databases such as MongoDB (mongolite), CouchDB (R4CouchDB), and Cassandra (RCassandra), among others. Finally, Google BigQuery (bigquery), Amazon Redshift (odbc or DBI with RPOstgres), Hive (RHive), and Impala (RImpala) are all covered, as well as in-memory databases like Redis (rredis).

 

4. Fetching Data from an API

To get data from an API, you can use the httr2 and jsonlite packages to send requests and parse JSON data.

library(httr2)
library(jsonlite)

# Send a GET request to the API
response <- request("https://api.example.com/data") %>% req_perform()

# Check if the request wa succesful
if (resp_status(response) == 200) {

  # Parse the JSON response
  data <- response %>% resp_body_json()

}

Explanation:

- httr2::request() creates a request object, which can be further modified or sent using req_perform().

- resp_status() is used to extract the status code from the response. Code 200 means that the request was successful.

- resp_body_json() directly parses the JSON response body into an R object.

Example: Loading and parsing JSON data from the Star Trek API, finding data on spacecraft:

# Load the packages
library(httr2)
library(jsonlite)

# Define the API URL
url <- "https://stapi.co/api/v1/rest/spacecraft/search"

# Make the API request using httr2
response <- request(url) %>%
  req_perform()

# Check if the request was successful
if (resp_status(response) == 200) {
  print("Request was successful")
 
  # Parse the JSON content
  data <- response %>%
    resp_body_json()
 
  # View the structure of the parsed data
  str(data)
 
  # Extract and print the spacecraft's name
  spacecraft <- data$spacecrafts[[1]]
  print(spacecraft)
 
  print(paste("Spacecraft's name:", spacecraft$name))
} else {
  stop("Request failed with status: ", resp_status(response))
}

Running the code above gives us the data on the first spacecraft in the list from the API:

[1] "Request was successful"
$uid
[1] "SRMA0000245113"

$name
[1] "042"

$registry
[1] "042"

$status
[1] "Active"

$dateStatus
[1] "3189"

$spacecraftClass
$spacecraftClass$uid
[1] "SCMA0000226396"

$spacecraftClass$name
[1] "Worker bee"

$owner
NULL

$operator
NULL

[1] "Spacecraft's name: 042"

Examining Your Data

Once you've loaded the data into R, it's important to examine it before proceeding with cleaning.

# Get the structure of the data
str(data)

# View the first few rows of the data
head(data)

# Summary statistics of the data
summary(data)

- str() shows the structure of the dataset, including data types.
- head() provides a preview of the first few rows.
- summary() gives summary statistics such as mean, median, and missing values.

Example: Let's try to examine the data in this CSV-file, containing data on several Star Trek starships.

> data <- read.csv("Documents/star_trek_table.csv")
> str(data)
'data.frame':	20 obs. of  8 variables:
 $ Ship.Name                          : chr  "USS Enterprise (NCC-1701)" "USS Enterprise (NCC-1701-A)" "USS Enterprise (NCC-1701-B)" "USS Enterprise (NCC-1701-C)" ...
 $ Registry.Number                    : chr  "NCC-1701" "NCC-1701-A" "NCC-1701-B" "NCC-1701-C" ...
 $ Class                              : chr  "Constitution-class" "Constitution-class" "Excelsior-class" "Ambassador-class" ...
 $ Captain.s.                         : chr  "Capt. James T. Kirk, Capt. Pike" "Capt. James T. Kirk" "Capt. John Harriman" "Capt. Rachel Garrett" ...
 $ Launch.Date                        : chr  "2245" "2286" "2293" "2344" ...
 $ Notable.Appearances..Series.Movies.: chr  "Star Trek: TOS, Movies I-IV, VI" "Star Trek IV-VI" "Star Trek: Generations" "Star Trek: TNG (Episode: \"Yesterday's Enterprise\")" ...
 $ Ship.s.Complement                  : int  430 500 750 700 1014 885 150 50 136 80 ...
 $ Max.Speed..Warp.Factor.            : chr  "Warp 8" "Warp 9" "Warp 9.6" "Warp 9" ...
> head(data)
                    Ship.Name Registry.Number              Class
1   USS Enterprise (NCC-1701)        NCC-1701 Constitution-class
2 USS Enterprise (NCC-1701-A)      NCC-1701-A Constitution-class
3 USS Enterprise (NCC-1701-B)      NCC-1701-B    Excelsior-class
4 USS Enterprise (NCC-1701-C)      NCC-1701-C   Ambassador-class
5 USS Enterprise (NCC-1701-D)      NCC-1701-D       Galaxy-class
6 USS Enterprise (NCC-1701-E)      NCC-1701-E    Sovereign-class
                       Captain.s. Launch.Date
1 Capt. James T. Kirk, Capt. Pike        2245
2             Capt. James T. Kirk        2286
3             Capt. John Harriman        2293
4            Capt. Rachel Garrett        2344
5           Capt. Jean-Luc Picard        2363
6           Capt. Jean-Luc Picard        2372
                 Notable.Appearances..Series.Movies. Ship.s.Complement Max.Speed..Warp.Factor.
1                    Star Trek: TOS, Movies I-IV, VI               430                  Warp 8
2                                    Star Trek IV-VI               500                  Warp 9
3                             Star Trek: Generations               750                Warp 9.6
4 Star Trek: TNG (Episode: "Yesterday's Enterprise")               700                  Warp 9
5                   Star Trek: TNG, Movies VII, VIII              1014                Warp 9.6
6                                      Movies VIII-X               885               Warp 9.99
> summary(data)
  Ship.Name         Registry.Number       Class            Captain.s.        Launch.Date       
 Length:20          Length:20          Length:20          Length:20          Length:20         
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                                
                                                                                               
 Notable.Appearances..Series.Movies. Ship.s.Complement Max.Speed..Warp.Factor.
 Length:20                           Min.   :   4.0    Length:20              
 Class :character                    1st Qu.: 122.0    Class :character       
 Mode  :character                    Median : 390.0    Mode  :character       
                                     Mean   : 421.2                           
                                     3rd Qu.: 712.5                           
                                     Max.   :1014.0                           
  Ship.Name         Registry.Number       Class            Captain.s.        Launch.Date       
 Length:20          Length:20          Length:20          Length:20          Length:20         
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character 

 

Cleaning Data

Data cleaning involves transforming raw data into a format that is consistent, accurate, and usable for analysis. This process typically includes handling missing data, removing duplicates, correcting inconsistencies, and transforming data types.

1. Handling Missing Data

Missing data can occur for various reasons, and it's important to handle it carefully. You can either remove missing values or replace them with some meaningful statistic like the mean, median, or mode.

- Identifying missing data

sum(is.na(data))  # Total missing values
colSums(is.na(data))  # Missing values per column

Example: For our starship data, we get:

> sum(is.na(data))  # Total missing values
[1] 7
> colSums(is.na(data))  # Missing values per column
                          Ship.Name                     Registry.Number 
                                  0                                   0 
                              Class                          Captain.s. 
                                  0                                   1 
                        Launch.Date Notable.Appearances..Series.Movies. 
                                  6                                   0 
                  Ship.s.Complement             Max.Speed..Warp.Factor. 
                                  0                                   0 

The column containing the launch data is the only one that has a lot of missing data (6 of 20 rows).

- Removing rows with missing values

One way to resolve the problem of missing values, might be to simply remove rows

data_clean <- na.omit(data)

Example: Running this on our dataset above gives us just 14 rows of 20:

> data_clean <- na.omit(data)
> summary(data_clean)
  Ship.Name         Registry.Number       Class            Captain.s.        Launch.Date       
 Length:14          Length:14          Length:14          Length:14          Length:14         
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
 Notable.Appearances..Series.Movies. Ship.s.Complement Max.Speed..Warp.Factor.
 Length:14                           Min.   :   4.0    Length:14              
 Class :character                    1st Qu.: 139.5    Class :character       
 Mode  :character                    Median : 355.5    Mode  :character       
                                     Mean   : 424.6                           
                                     3rd Qu.: 737.5                           
                                     Max.   :1014.0 

- Imputing missing values

Instead of just removing data rows, you can replace missing values with the mean, median, or another statistic using the `dplyr` package.

library(dplyr)
# Replace missing values in a column with the mean
data <- data %>%
  mutate(column_name = ifelse(is.na(column_name), mean(column_name, na.rm = TRUE), column_name))

- %>% - data is piped to the mutate function

- mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column, like above) and delete columns (by setting their value to NULL).

- ifelse has the form ifelse(test, yes, no)

- mean calculates the mean for the whole column and replaces any NA values with that mean

Note that only the column_name is used above, and not data$column_name. This is called data masking and is a feature of several dplyr verbs: arrange(), count(), filter(), group_by(), mutate(), and summarise(). The term "data masking" comes from the fact that it blurs the distinction between env-variables and data-variables, see also vignette("programming", package = "dplyr").

 

Example: To use this strategy on our dataset, we would first need to convert the launch year (Launch.Date) from a string to an integer.

library(dplyr)

data <- data %>%
  mutate(Launch.Date = suppressWarnings(as.integer(Launch.Date)))

data <- data %>%
  mutate(Launch.Date = ifelse(is.na(Launch.Date), as.integer(mean(Launch.Date, na.rm = TRUE)), Launch.Date))

2. Removing Duplicates

Duplicates can cause bias in your analysis, so it's important to remove them.

# Remove duplicate rows
data_clean <- data[!duplicated(data), ]

3. Transforming Data Types

Every so often, the data types might not be appropriate for the analysis you're planning. For example, a numeric column might be mistakenly read as a character column.

# Convert character column to numeric
data$column_name <- as.numeric(data$column_name)

# Convert character to factor
data$column_name <- as.factor(data$column_name)

Example: In our starship dataset, head(data) and summary(data) above, show us that all the data except for the ships' compliment has been read as strings. However, both the launch year (Launch.Date) and the speed should be numbers. Let's fix the launch year first:

> data$Launch.Date
 [1] "2245"  "2286"  "2293"  "2344"  "2363"  "2372"  "2371"  "2370"  "2256"  NA      "2256" 
[12] "2287"  NA      "2374"  NA      NA      "2370s" "2285"  NA      NA     
> data <- data %>%
+     mutate(Launch.Date = suppressWarnings(as.integer(Launch.Date)))
> data$Launch.Date
 [1] 2245 2286 2293 2344 2363 2372 2371 2370 2256   NA 2256 2287   NA 2374   NA   NA   NA 2285
[19]   NA   NA
> data$Launch.Date[17] <- 2375
> data$Launch.Date
 [1] 2245 2286 2293 2344 2363 2372 2371 2370 2256   NA 2256 2287   NA 2374   NA   NA 2375 2285
[19]   NA   NA

First, we convert the strings with mutate and the as.integer-function. Second, we set the NA coming from "2370s" to 2375, which is most likely better than just leaving it at NA, depending on what we want to do next.

The maximum speed (warp factor) columns is a bit more challenging. It would be reasonable to have Warp in the column heading instead of in every row in front of the number. That way we would get only numbers, except for one row — the one for USS Discovery (NCC-1031) — which has a spore drive indicated in the max speed column. It would be tempting to just delete this string to get the number, but having a spore drive sounds like a big deal, and we don't want to lose important information. So, before we start cleaning up the Max.Speed..Warp.Factor.-column, let's extract the spore drive information and put it in a separate column:

data <- data %>%
  mutate(Spore.Drive = grepl("Spore Drive", Max.Speed..Warp.Factor., fixed = TRUE))

The last two columns in the dataset are now:

Max.Speed..Warp.Factor. Spore.Drive
1                   Warp 8       FALSE
2                   Warp 9       FALSE
3                 Warp 9.6       FALSE
4                   Warp 9       FALSE
5                 Warp 9.6       FALSE
6                Warp 9.99       FALSE
7               Warp 9.975       FALSE
8               Warp 9.982       FALSE
9  Warp 9.99 / Spore Drive        TRUE

...

Next, we need to get rid of everything that isn't a number or a decimal point in the Max.Speed..Warp.Factor.-column:

data <- data %>%
  mutate(Max.Speed..Warp.Factor. =
           as.numeric(
             gsub("[^0-9.-]", "", Max.Speed..Warp.Factor.)
           )
         )

The columns now look like this:

   Max.Speed..Warp.Factor. Spore.Drive
1                    8.000       FALSE
2                    9.000       FALSE
3                    9.600       FALSE
4                    9.000       FALSE
5                    9.600       FALSE
6                    9.990       FALSE
7                    9.975       FALSE
8                    9.982       FALSE
9                    9.990        TRUE

...

In addition to string to number-conversion, we should also convert the ships' classes to factors. Let's clean up the class names first, though. They all have “-class” in the name, which is superfluous.

> data <- data %>% mutate(Class = gsub("-class", "", Class))
> data$Class
 [1] "Constitution"  "Constitution"  "Excelsior"     "Ambassador"    "Galaxy"       
 [6] "Sovereign"     "Intrepid"      "Defiant"       "Crossfield"    "California"   
[11] "Walker"        "Excelsior"     "Miranda"       "Prometheus"    "Akira"        
[16] "Galaxy"        "Nova"          "Miranda"       "Constellation" "Luna"   

The Class column still just contain strings. We need to convert them to factors:

> data <- data %>% mutate(Class = as.factor(Class))
> data$Class
 [1] Constitution  Constitution  Excelsior     Ambassador    Galaxy        Sovereign    
 [7] Intrepid      Defiant       Crossfield    California    Walker        Excelsior    
[13] Miranda       Prometheus    Akira         Galaxy        Nova          Miranda      
[19] Constellation Luna         
16 Levels: Akira Ambassador California Constellation Constitution Crossfield ... Walker

4. Dealing with Outliers

Outliers are data points that are significantly different from other observations. While they are not always incorrect, they may require special attention.

# Identify outliers using boxplot
boxplot(data$column_name)

# Remove outliers based on interquartile range (IQR)
Q1 <- quantile(data$column_name, 0.25)
Q3 <- quantile(data$column_name, 0.75)
IQR <- Q3 - Q1
data_clean <- data[data$column_name >= (Q1 - 1.5 * IQR) & data$column_name <= (Q3 + 1.5 * IQR), ]

Example: Let's do boxplots of all the numerical columns:

boxplot(data$Launch.Date)

Boxplot of starhip launch date

This is interesting. There is a big jump between the first launch and the rest. Why is that?

boxplot(data$Ship.s.Complement)

Boxplot of the ships' complement

That a starship has a ship's complement ranging from a bit below to 200 to above a 1000, seems reasonable. However, there is one ship with a complement of 4. An error?

boxplot(data$Max.Speed..Warp.Factor.)

Boxplot of max speed (warp factor)

These ranges appear to be reasonable, given that technology advances, and with it, warp speeds.

5. Renaming Columns

Renaming columns to more meaningful names can make your dataset easier to understand and work with.

# Rename columns
colnames(data) <- c("new_name1", "new_name2", "new_name3")

Example: Some names in our data set are pretty long-winded and ambiguous.

> colnames(data)
[1] "Ship.Name"                           "Registry.Number"                    
[3] "Class"                               "Captain.s."                         
[5] "Launch.Date"                         "Notable.Appearances..Series.Movies."
[7] "Ship.s.Complement"                   "Max.Speed..Warp.Factor."            
[9] "Spore.Drive" 

Let's fix that:

> colnames(data) <-c("Name","RegNo","Class","Captain","Launch","Appearances","Complement","Warp Factor","Spore Drive")
> colnames(data)
[1] "Name"        "RegNo"       "Class"       "Captain"     "Launch"      "Appearances"
[7] "Complement"  "Warp Factor" "Spore Drive"

6. Recoding Variables

You may need to recode variables, especially when converting categorical data into a different format or merging categories.

> library(dplyr)
> x <- c(1, 2, 3, 4, 5, 6, 7)
> case_match(x,
+            1 ~ "One",
+            2 ~ "Two",
+            3 ~ "Three",
+            4 ~ "Four",
+            .default = "Other")
[1] "One"   "Two"   "Three" "Four"  "Other" "Other" "Other"

case_match()  provides many possibilities for matching levels, including ranges, conditions, and defaults. It replaces recode(). Note that for logical vectors, if_else() is recommended. For complex criteria, you may use case_when().

Example: Let's say neo-communists take over the known universe and start rewriting history. One of the first things to go are the historic starship classes. The historically correct classes are (suddenly):

> data -> communist_universe
> communist_universe$Class <- case_match(data$Class,
+   "Constellation" ~ "Lenin",
+   "Excelsior" ~ "Stalin",
+   .default = "Marx"
+ )
> communist_universe
               Name      RegNo  Class                            Captain Launch
1    USS Enterprise   NCC-1701   Marx                Capt. James T. Kirk   2245
2    USS Enterprise   NCC-1701   Marx                         Capt. Pike   2245
3    USS Enterprise NCC-1701-A   Marx                Capt. James T. Kirk   2286
4    USS Enterprise NCC-1701-B Stalin                Capt. John Harriman   2293
5    USS Enterprise NCC-1701-C   Marx               Capt. Rachel Garrett   2344
6    USS Enterprise NCC-1701-D   Marx              Capt. Jean-Luc Picard   2363
7    USS Enterprise NCC-1701-E   Marx              Capt. Jean-Luc Picard   2372
8       USS Voyager  NCC-74656   Marx              Capt. Kathryn Janeway   2371
9       USS Defiant   NX-74205   Marx               Capt. Benjamin Sisko   2370
10    USS Discovery   NCC-1031   Marx                Capt. Gabriel Lorca   2256
11    USS Discovery   NCC-1031   Marx                         Capt. Pike   2256
12     USS Cerritos  NCC-75567   Marx                Capt. Carol Freeman   2319
13     USS Shenzhou   NCC-1227   Marx            Capt. Philippa Georgiou   2256
14    USS Excelsior   NCC-2000 Stalin                  Capt. Hikaru Sulu   2287
15     USS Saratoga  NCC-31911   Marx                 Capt. J.G. Sanders   2319
16   USS Prometheus   NX-59650   Marx Various (Multivector Assault Mode)   2374
17 USS Thunderchild  NCC-63549   Marx                               <NA>   2319
18       USS Yamato  NCC-71807   Marx                Capt. Donald Varley   2319
19      USS Equinox  NCC-72381   Marx               Capt. Rudolph Ransom   2375
20      USS Reliant   NCC-1864   Marx                Capt. Clark Terrell   2285
21    USS Stargazer   NCC-2893  Lenin              Capt. Jean-Luc Picard   2319
22        USS Titan  NCC-80102   Marx                Capt. William Riker   2319...

For more complex recoding, you can use case_match() or case_when():

case_when(
  x %in% c("a", "b") ~ "A",
  x %in% "c" ~ "C",
  x %in% c("d", "e") ~ "D"
)

case_match(
  x,
  c("a", "b") ~ "A",
  "c" ~ "C",
  c("d", "e") ~ "D"
)

7. String Cleaning

If your dataset contains text data, it might include extra whitespaces, special characters, or inconsistent capitalization. These can be cleaned using `stringr` package.

library(stringr)
# Trim leading and trailing whitespaces
data$column_name <- str_trim(data$column_name)

# Convert text to lowercase
data$column_name <- tolower(data$column_name)

In addition, it might be a good idea to explore the textual information further, to see whether there are any superfluous or duplicate information.

Example: We can still improve our dataset. The name contains both the name and the registry number. The latter is already available in the RegNo column. Moreover, in the Class-columns, the word "class" is repeated for each and every ship class.

Let's deal with the name-columns first. Each entry is of the form "USS Shipname (Registry-Number)”, so if we look for a space and a beginning parenthesis, we can get rid of everything from that position and to the right:

> data$Name <- str_replace(data$Name, regex(" \\([^)]+\\)"), "")
> data$Name
 [1] "USS Enterprise"   "USS Enterprise"   "USS Enterprise"   "USS Enterprise"  
 [5] "USS Enterprise"   "USS Enterprise"   "USS Voyager"      "USS Defiant"     
 [9] "USS Discovery"    "USS Cerritos"     "USS Shenzhou"     "USS Excelsior"   
[13] "USS Saratoga"     "USS Prometheus"   "USS Thunderchild" "USS Yamato"      
[17] "USS Equinox"      "USS Reliant"      "USS Stargazer"    "USS Titan"       

We can also change the Name from string to factor:

> data$Name <- as.factor(data$Name)

 

Reshaping Data

In some cases, you may need to reshape your dataset, i.e., transform it from wide format to long format or vice versa. The `tidyr` package offers helpful functions to do this.

1. Pivoting Data

Wide to Long format

library(tidyr)
data_long <- pivot_longer(data, cols = c("column1", "column2"), names_to = "variable", values_to = "value")

Example: If we use columns Name and Class from our starship dataframe, we get:

> library(tidyr)
> data_long <- pivot_longer(data, cols = c("Class", "Name"), names_to = "variable", values_to = "value")
> head(data_long)
# A tibble: 6 × 9
  RegNo      Captain    Launch Appearances Complement `Warp Factor` `Spore Drive` variable value
  <chr>      <chr>       <dbl> <chr>            <int>         <dbl> <lgl>         <chr>    <chr>
1 NCC-1701   Capt. Jam…   2245 Star Trek:…        430           8   FALSE         Class    Cons…
2 NCC-1701   Capt. Jam…   2245 Star Trek:…        430           8   FALSE         Name     USS …
3 NCC-1701-A Capt. Jam…   2286 Star Trek …        500           9   FALSE         Class    Cons…
4 NCC-1701-A Capt. Jam…   2286 Star Trek …        500           9   FALSE         Name     USS …
5 NCC-1701-B Capt. Joh…   2293 Star Trek:…        750           9.6 FALSE         Class    Exce…
6 NCC-1701-B Capt. Joh…   2293 Star Trek:…        750           9.6 FALSE         Name     USS …

This is just to show how it works. It probably wouldn't make much sense in this case.

Longer format

separate_longer_delim(data, cols, delim, ...)

This function uses a delimiter (delim), like a comma or a space, to separate strings in given columns (cols) in the dataframe (data) into separate rows.

Example: The longer format could be useful because we have several captains for each row. If we want a row with one captain in each:

> data <- separate_longer_delim(data, Captain, delim = ",")
> head(data)
            Name      RegNo        Class               Captain Launch
1 USS Enterprise   NCC-1701 Constitution   Capt. James T. Kirk   2245
2 USS Enterprise   NCC-1701 Constitution            Capt. Pike   2245
3 USS Enterprise NCC-1701-A Constitution   Capt. James T. Kirk   2286
4 USS Enterprise NCC-1701-B    Excelsior   Capt. John Harriman   2293
5 USS Enterprise NCC-1701-C   Ambassador  Capt. Rachel Garrett   2344
6 USS Enterprise NCC-1701-D       Galaxy Capt. Jean-Luc Picard   2363

We can now convert the strings in Captain to factors:

> data$Captain <- as.factor(data$Captain)

Long to Wide format

data_wide <- pivot_wider(data, names_from = "variable", values_from = "value")

Example: If we want an overview of who has been captain on each ship, we could make a dataframe with one column for each registry number (RegNo) and values from the Captain-column:

> data_wide <- data %>% pivot_wider(names_from = RegNo, values_from = Captain)
> head(data_wide)
# A tibble: 6 × 27
  Name   Class Launch Appearances Complement `Warp Factor` `Spore Drive` `NCC-1701` `NCC-1701-A`
  <fct>  <fct>  <dbl> <chr>            <int>         <dbl> <lgl>         <fct>      <fct>       
1 USS E… Cons…   2245 "Star Trek…        430          8    FALSE         Capt. Jam… NA          
2 USS E… Cons…   2286 "Star Trek…        500          9    FALSE         NA         Capt. James…
3 USS E… Exce…   2293 "Star Trek…        750          9.6  FALSE         NA         NA          
4 USS E… Amba…   2344 "Star Trek…        700          9    FALSE         NA         NA          
5 USS E… Gala…   2363 "Star Trek…       1014          9.6  FALSE         NA         NA          
6 USS E… Sove…   2372 "Movies VI…        885          9.99 FALSE         NA         NA          
# ℹ 18 more variables: `NCC-1701-B` <fct>, `NCC-1701-C` <fct>, `NCC-1701-D` <fct>,
#   `NCC-1701-E` <fct>, `NCC-74656` <fct>, `NX-74205` <fct>, `NCC-1031` <fct>,
#   `NCC-75567` <fct>, `NCC-1227` <fct>, `NCC-2000` <fct>, `NCC-31911` <fct>, `NX-59650` <fct>,
#   `NCC-63549` <fct>, `NCC-71807` <fct>, `NCC-72381` <fct>, `NCC-1864` <fct>,
#   `NCC-2893` <fct>, `NCC-80102` <fct>

2. Grouping and Summarizing Data

After cleaning, it’s common to group and summarize data to get insights before analysis.

library(dplyr)
# Group by a variable and calculate mean of another variable
data_summary <- data %>%
  group_by(grouping_variable) %>%
  summarise(mean_value = mean(numeric_variable, na.rm = TRUE))

Example: Let's group the starships by Class and calculate the mean Complement for each class:

> data_summary <- data %>%
+   group_by(Class) %>%
+   summarise(mean_value = mean(Complement, na.rm = TRUE))
> data_summary
# A tibble: 16 × 2
   Class         mean_value
   <fct>              <dbl>
 1 Akira               500 
 2 Ambassador          700 
 3 California           80 
 4 Constellation       500 
 5 Constitution        465 
 6 Crossfield          136 
 7 Defiant              50 
 8 Excelsior           750 
 9 Galaxy             1014 
10 Intrepid            150 
11 Luna                350 
12 Miranda             158.
13 Nova                 80 
14 Prometheus            4 
15 Sovereign           885 
16 Walker              215 

Saving Cleaned Data

Once the data is cleaned, you can save it back to your local machine or a database for future use.

1. Save to CSV

write.csv(data_clean, "path/to/cleaned_data.csv", row.names = FALSE)

 

2. Save to Excel

library(writexl)
write_xlsx(data_clean, "path/to/cleaned_data.xlsx")

 

Conclusion

Data cleaning is a critical aspect of the data science process, ensuring that the data is free from errors and ready for analysis. R offers powerful packages and functions for handling missing data, correcting data types, dealing with outliers, and reshaping the data. By mastering these techniques, you'll be well on your way to preparing your data for meaningful insights and robust analyses.

About Me

Experienced dev and PM. Data science, DataOps, Python and R. DevOps, Linux, clean code and agile. 10+ years working remotely. Polyglot. Startup experience.
LinkedIn Profile

By Me

Statistics & R - a blog about - you guessed it - statistics and the R programming language.
R-blog

Erlang Explained - a blog on the marvelllous programming language Erlang.
Erlang Explained