match a String based on regex pattern matching scala

scalacode Source

I wrote the following regex :

val reg = ".+([A-Z_].+).(\\d{4})_(\\d{2})_(\\d{2})_(\\d{2})\\.orc".r 

which is supposed to parse the following strings : "S3//bucket//TS11_YREDED.2018_09_28_02.orc" the parse method is :

val dataExtraction: String => Map[String, String] = {
  string: String => {
    string match {
      case reg(filename, year, month, day) =>
                 Map(FILE_NAME-> filename, YEAR -> year, MONTH -> month, DAY -> day)
      case _  => Map(FILE_NAME-> filename,YEAR -> "", MONTH -> "", DAY -> "")
    }
  }
}
val YEAR = "YEAR"
val MONTH = "MONTH"
val DAY = "DAY"
val FILE_NAME = "FILE_NAME"

but it doesn't work properly it is supposed to ommit the bucket name and parse filename and date

so the expected output shall rather be : Map(FILE_NAME-> TS11_YREDED, YEAR -> , MONTH -> 09, DAY -> 28) Any idea how to fix it please ?

regexscalapattern-matching

Answers

answered 6 days ago Wiktor Stribi┼╝ew #1

The .+ pattern part matches the whole string first and ([A-Z_].+) only captures what remains to be captured and matched by the subsequent patterns.

You may use

"""(?:.*/)?(.*)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc""".r

See this regex demo

Note that the dot must be escaped to match a literal dot.

Details

  • (?:.*/)? - any 0+ chars other than linebreak chars, as many as possible, up to the last / and including it
  • (.*) - Capturing group 1: any 0+ chars, other than linebreak chars, as many as possible
  • \. - a dot
  • (\d{4}) - Capturing group 2: four digits
  • _ - an underscore
  • (\d{2}) - Capturing group 3: two digits
  • _ - an underscore
  • (\d{2}) - Capturing group 4: two digits
  • _\d{2}\.orc - _, 2 digits, . and orc at the end of the string.

Scala demo:

val text = "S3//bucket//TS11_YREDED.2018_09_28_02.orc"
val reg = """(?:.*/)?(.*)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc""".r

var YEAR = "YEAR"
var MONTH = "MONTH"
var DAY = "DAY"
var FILE_NAME = "FILE_NAME"

val dataExtraction: String => Map[String, String] = {
  string: String => {
    string match {
      case reg(filename, year, month, day) =>
                 Map(FILE_NAME-> filename, YEAR -> year, MONTH -> month, DAY -> day)
      case _  => Map(FILE_NAME-> FILE_NAME,YEAR -> YEAR, MONTH -> MONTH, DAY -> DAY)
    }
  }
}

println(dataExtraction(text))
// => Map(FILE_NAME -> TS11_YREDED, YEAR -> 2018, MONTH -> 09, DAY -> 28)

Since you are not using the last capturing group, it can be omitted from the pattern.

answered 6 days ago stack0114106 #2

Check this out:

val file_name = "TS11_YREDED.2018_09_28_02.orc"
val reg = """(.*?)\.(\d{4})_(\d{2})_(\d{2})_(\d{2})\.orc""".r
var file_details = scala.collection.mutable.ArrayBuffer[String]()
reg.findAllIn(file_name).matchData.foreach( m => file_details.appendAll(m.subgroups))
val names=Array("FILE_NAME","YEAR","MONTH","DAY","DUMMY")
for( (x,y) <- names.zip(file_details).toMap)
  println(x + "->" + y)

//DUMMY->02
//DAY->28
//FILE_NAME->TS11_YREDED
//MONTH->09
//YEAR->2018

comments powered by Disqus