Removing Duplicate Percentage Entries in R: Efficient Data Cleaning with dplyr

Understanding the Problem

The problem at hand involves cleaning a dataset by removing rows where the percentage is within 10% of another entry for the same subject and block. This means that if there’s a row with a certain percentage, we need to check its neighboring values (previous and next) in the same subject and block to determine if it should be removed or not.

Background

To approach this problem, we’ll use the dplyr library in R, which provides a powerful set of tools for data manipulation and analysis. Specifically, we’ll utilize the mutate(), arrange(), group_by(), filter(), and ungroup() functions.

Solution Approach

The proposed solution involves the following steps:

  1. Convert the Percentage column to numeric values using the parse_number() function from the readr library.
  2. Sort the dataset by Subject, Block, and Percentage in ascending order using the arrange() function.
  3. Group the dataset by Subject and Block, then filter out rows where the percentage is within 10% of its previous or next value using the filter() function.
  4. Finally, ungroup the filtered dataset.

Step-by-Step Code

library(dplyr)
library(readr)

# Load the data into a dataframe
df <- structure(list(
  Stimuli = c(1L, 2L, 3L, 1L, 2L, 3L, 13L, 14L,
              15L, 1L),
  Subject = c(1L, 1L, 1L, 2L, 2L, 2L, 100L, 100L, 100L, 1002L),
  Block = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L),
  TChosen = c(7L, 18L, 24L, 3L, 15L, 31L, 13L, 11L, 25L, 9L),
  Percentage = c("14.29%", "36.73%", "48.98%", "6.12%", "30.61%", 
                "63.27%", "26.53%", "22.45%", "51.02%", "18.37%")),
  class = "data.frame", row.names = c(NA, -10L))
)

# Convert Percentage column to numeric values
df <- df %>% mutate(Percentage = readr::parse_number(Percentage))

# Sort the dataset by Subject, Block, and Percentage in ascending order
df <- df %>% arrange(Subject, Block, Percentage)

# Group the dataset by Subject and Block, then filter out rows where the percentage is within 10% of its previous or next value
df <- df %>% group_by(Subject, Block) %>% 
            filter(Percentage - lag(Percentage, default = -Inf) > 0.1 & 
                   lead(Percentage, default = Inf) - Percentage > 0.1)

# Ungroup the filtered dataset
df <- df %>% ungroup

Explanation

The provided code takes a step-by-step approach to solve the problem:

  1. The first line loads the required libraries: dplyr for data manipulation and readr for parsing numeric values from strings.
  2. The next two lines load the sample dataset into a dataframe, df.
  3. The mutate() function is used to convert the Percentage column to numeric values using the parse_number() function from the readr library.
  4. The arrange() function sorts the dataset by Subject, Block, and Percentage in ascending order.
  5. The group_by() function groups the dataset by Subject and Block.
  6. The filter() function filters out rows where the percentage is within 10% of its previous or next value using the condition (Percentage - lag(Percentage, default = -Inf) > 0.1 & lead(Percentage, default = Inf) - Percentage > 0.1).
  7. Finally, the ungroup() function ungroups the filtered dataset.

Conclusion

By following this step-by-step approach and utilizing the power of dplyr for data manipulation, we can efficiently remove rows from a dataset where the percentage is within 10% of another entry for the same subject and block.


Last modified on 2025-03-30