Top Data Wrangling and Data Cleaning Packages for R in 2023: A Comprehensive Guide

CiteDrive
6 min readMay 2, 2023

Data wrangling and cleaning are essential processes in data science, and R has been the go-to programming language for many data professionals. With the increasing demand for data-driven insights, the R community has developed several packages that make data wrangling and cleaning more accessible, efficient, and effective. In this article, we will discuss the top data wrangling and cleaning packages for R in 2023.

This article is brought to you by CiteDrive: Are you writing reports in Quarto or R Markdown? Then you should look into CiteDrive, a project-based and collaborative web-reference management tool based on BibTeX that will assist you in keeping your citations, bibliographies, and references in sync with RStudio. Try out for free.

Tidyverse

Tidyverse is a collection of packages that provide a cohesive framework for data science in R. It includes several packages such as dplyr, ggplot2, tidyr, readr, purrr, and stringr, among others. These packages offer tools for data manipulation, visualization, and data cleaning. For instance, dplyr provides a grammar for data manipulation, while tidyr helps with data reshaping. Tidyverse has gained a lot of popularity in the R community and is considered a must-have package for data science.

dplyr

Let’s say you have a data frame df that contains information about sales transactions, including the date, product name, quantity, and price:

df <- data.frame(
date = c("2022-01-01", "2022-01-01", "2022-01-02", "2022-01-02"),
product = c("A", "B", "A", "B"),
quantity = c(10, 20, 15, 30),
price = c(5.99, 3.99, 6.99, 2.99)
)

Using dplyr, you can easily perform various data manipulation tasks. For example, you can:

  1. Filter the data to include only transactions for product A:
library(dplyr)
filtered_df <- filter(df, product == "A")

This will create a new data frame filtered_df that contains only rows where the product column equals "A".

  1. Group the data by date and summarize the total sales for each day:
grouped_df <- group_by(df, date)
summarized_df <- summarize(grouped_df, total_sales = sum(quantity * price))

This will group the original data frame df by the date column, and then calculate the total sales for each day by multiplying the quantity and price columns and summing the results. The resulting data frame summarized_df will have two columns: date and total_sales.

These are just a few examples of what you can do with dplyr. Its syntax is designed to be easy to read and understand, making it a popular choice for data manipulation tasks in R.

Data.table

Data.table is a package that offers fast and memory-efficient data manipulation capabilities. It provides a syntax that is similar to base R, making it easy to learn for those who are familiar with R. Data.table is particularly useful for handling large datasets, as it can perform operations at speeds that are faster than those of other packages. It also has advanced features such as joins, grouping, and subsetting.

plyr

Plyr is a package that provides tools for splitting, applying, and combining data in R. It offers several functions, including ddply, dlply, and daply, which allow users to apply a function to a dataset and return the results in a convenient format. Plyr is particularly useful when working with data that requires grouping or summarizing.

Using the same data drame from the previouse example:

df <- data.frame(
date = c("2022-01-01", "2022-01-01", "2022-01-02", "2022-01-02"),
product = c("A", "B", "A", "B"),
quantity = c(10, 20, 15, 30),
price = c(5.99, 3.99, 6.99, 2.99)
)

Using plyr, you can easily perform various data manipulation tasks. For example, you can:

  1. Split the data by product and apply a function to calculate the total sales for each product:
library(plyr)
result_df <- ddply(df, .(product), summarise, total_sales = sum(quantity * price))

This will split the original data frame df into two smaller data frames based on the unique values in the product column (i.e., products A and B), and then calculate the total sales for each product by multiplying the quantity and price columns and summing the results. The resulting data frame result_df will have two columns: product and total_sales.

  1. Apply a custom function to each row of the data frame:
custom_function <- function(row) {
if (row$quantity > 20) {
return("large")
} else {
return("small")
}
}
result_df <- apply(df, 1, custom_function)

This will apply the custom_function to each row of the data frame df and return a vector of values indicating whether the quantity for that row is "large" or "small". The resulting data frame result_df will have one column containing the output of the custom function.

plyr

Janitor is a package that provides tools for cleaning and preprocessing datasets in R. It offers several functions for tasks such as removing duplicate rows, converting data types, and removing leading and trailing whitespace. Janitor also provides functions for renaming columns, which can be useful when working with messy datasets.

stringr

Stringr is a package that provides tools for working with character strings in R. It offers functions for tasks such as pattern matching, string splitting, and string manipulation. Stringr is particularly useful when cleaning text data, such as social media posts, emails, and customer feedback.

library(stringr)

# create a character string
my_string <- "Hello, World!"

# extract a substring
substring <- str_sub(my_string, start = 1, end = 5)

# count the number of characters in the string
num_chars <- str_length(my_string)

# split the string into a vector of words
word_vector <- str_split(my_string, pattern = ", ")[[1]]

In this example, we first load the stringr library. We then create a character string my_string. We use various stringr functions to extract a substring from the string (str_sub()), count the number of characters in the string (str_length()), and split the string into a vector of words (str_split()).

Stringr provides a wide range of useful tools for working with character strings in R, such as pattern matching, string manipulation, and text cleaning. These functions make it easier to work with text data in R and can save a significant amount of time when performing data cleaning and analysis tasks.

lubridate

Lubridate is a package that provides tools for working with dates and times in R. It offers functions for tasks such as extracting date components, converting dates to different formats, and performing arithmetic operations with dates. Lubridate is particularly useful when working with time-series data or when analyzing data with a temporal component.

library(lubridate)

# create a date object from a string
my_date <- ymd("2022-01-01")

# extract components of the date
year <- year(my_date)
month <- month(my_date)
day <- day(my_date)

# perform arithmetic with dates
next_month <- my_date + months(1)
next_year <- my_date + years(1)

# calculate the difference between two dates
start_date <- ymd("2022-01-01")
end_date <- ymd("2022-01-31")
days_between <- as.numeric(end_date - start_date)

# format dates for display
formatted_date <- format(my_date, "%A, %B %d, %Y")

In this example, we first load the lubridate library. We then create a date object my_date using the ymd() function, which converts a string to a date object based on its year, month, and day components. We then use various lubridate functions to extract the year, month, and day components of the date, perform arithmetic with dates (adding a month or a year to a date), calculate the difference between two dates in days, and format dates for display.

Lubridate provides a powerful set of tools for working with dates and times in R, making it easier to perform tasks such as date arithmetic, date parsing and formatting, and date component extraction.

Conclusion

In conclusion, data wrangling and cleaning are critical processes in data science, and R offers several packages that make these tasks more accessible, efficient, and effective. The packages discussed in this article, including Tidyverse, Data.table, plyr, janitor, stringr, and lubridate, are among the top data wrangling and cleaning packages for R in 2023. By leveraging these packages, data professionals can streamline their data preprocessing tasks, enabling them to focus on data analysis and insights.

--

--

CiteDrive

CiteDrive: Cloud-based BibTeX manager. Enables easy collaboration, auto-syncing, and multi-format imports/exports. Research-focused, distraction-free.