Match two vectors and replace in string

OAM Source

The following problem: I have two data frames where I want to match one vector from data frame data1 with a vector from data frame data2.

data1 <- data.frame(v1 = c("horse", "duck", "bird"), v2 = c(1,2,3))
data2 <- data.frame(v1 = c("car, horse, mouse", "duck, bird", "bird"))

If a character string in data2 is matched it should be replaced by the corresponding value v2 from data1. The result looks like as follows:

for(i in 1:nrow(data1)) data2[,1] <- gsub(data1[i,1], data1[i,2], data2[,1], fixed=T)

However, is there an idea using a vectorized solution instead of a for loop to create a better performance with huge datasets?

Thanks in advance!


What happens when I have the case, that both dataframes donĀ“t have the same length?

data2 <- data.frame(v1 = c("car, horse, mouse", "duck, bird","bird", "bird"))

When I use this solution:

data2$v1 <- mapply(sub, data1$v1, data1$v2, data2$v1)

Then I get the following warning message:

1: In mapply(sub, data1$v1, data1$v2, data2$v1) : longer argument not a multiple of length of shorter 2: In mapply(sub, data1$v1, data1$v2, data2$v1) : longer argument not a multiple of length of shorter

However, the mgsub solution works perfect! Thank you!



answered 4 years ago A5C1D2H2I1M1N2O1R2T1 #1

Most of the arguments in the "stringi" package accept vectorized inputs, so you should be able to use srti_replace_all, like this:

stri_replace_all_fixed(data2$v1, data1$v1, data1$v2)
# [1] "car, 1, mouse" "2, bird"       "3"         

To get your data.frame:

data.frame(v1 = stri_replace_all_fixed(data2$v1, data1$v1, data1$v2))
#              v1
# 1 car, 1, mouse
# 2       2, bird
# 3             3

answered 4 years ago akrun #2

Using the updated data2. The nrows between data1 and data2 are different, Here, we are assuming that any match between v1 columns of both datasets should be replaced by the corresponding value of v2 column in data1.

mgsub(as.character(data1$v1), data1$v2, data2$v1)
#[1] "car, 1, mouse" "2, 3"          "3"             "3"    

Note mgsub has some error handling that deals with situations where a substring is found within a larger string and both are in the 'to be replaced' list. Here's an example with horse and horses:

data1 <- data.frame(v1 = c("horse", "duck", "bird", "horse", "horses"), v2 = 1:5)
data2 <- data.frame(v1 = c("car, horses, mouse", "duck, bird, horse", "bird"))

stri_replace_all_fixed(data2$v1, data1$v1, data1$v2)

## [1] "car, 1s, mouse"    "2, bird, horse"    "3"                 "car, 4s, mouse"    "duck, bird, horse"
## Warning message:
## In stri_replace_all_fixed(data2$v1, data1$v1, data1$v2) :
##   longer object length is not a multiple of shorter object length

mgsub(as.character(data1$v1), data1$v2, data2$v1)

## [1] "car, 5, mouse" "2, 3, 4"       "3"  

mgsub makes sure the longer words are replaced first. This makes mgsub slower but safer. Depending on your data type/needs either solution here may be of use.

comments powered by Disqus