grep() and sub() and regular expression

LLL Source

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.

But I'd like for this to happen only if the variable name has the word "Start" in it.

I've got a muddled up bit of code that doesn't work. Any help would be appreciated!

# Current data frame:    
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))

# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))

# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
rregexstring-substitution

Answers

answered 2 months ago akrun #1

We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data

out <- dfbefore
out[] <- lapply(dfbefore, sub, 
            pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out

dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

answered 2 months ago MrFlick #2

How about something like this using gsub().

stripcol <- function(x) {
  gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))  
}

dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)

We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.

answered 2 months ago Hack-R #3

doit <- function(x){
  x <- as.character(x)
  if(grepl("Start",x)){
    x <- gsub("_([0-9])","",x)
  }
  return(x)
} 


apply(dfbefore,c(1,2),doit)
    a                          b                      
[1,] "pmm_StartTimev4_E2_C19"   "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"    
[3,] "delivery_C1_C12"          "pmm_StartTo_v4_E2_C19"

comments powered by Disqus