How to filter a dataframe by multiple columns?

Frankie Source

I have a problem as below:

I have two dataframes

Dataframe DF1:

ID, Name, age
1   name1  18
2   name2  20

DataFrame DF2:

ID, Name, age
1   name1 18
3   name3 19

I am attempting to filter DF2 to exclude records contained in DF1 by ID and Name so that I can get new DF2 like

ID, Name, age
3   name3  19

and then union these two dataframes to get final result:

ID, Name, age
1   name1  18
2   name2  20
3   name3  19

To do this in T-SQL, I can write a statement like

INSERT INTO DF1 
SELECT ID, Name, age FROM DF2 WHERE NOT EXISTS
(SELECT 1 FROM DF1 WHERE DF1.ID = DF2.ID AND DF1.Name = DF2.Name)

But I find that "insert" is not supported in dataframe in sparkSQL. So my questions are:

How can I filter a dataframe based on multiple column?

How can I union two dataframes together? I am appreciate for any solution.

scalaapache-sparkapache-spark-sql

Answers

answered 2 years ago zero323 #1

UNION followed by DISTINCT

Assuming that the records are unique the simplest way to achieve what you want is to take UNION and follow it by DISTINCT:

val df1 = Seq((1, "name1", 18), (2, "name2", 20)).toDF("ID", "Name", "age")
val df2 = Seq((1, "name1", 18), (3, "name3", 19)).toDF("ID", "Name", "age")

df1.unionAll(df2).distinct.show

// +---+-----+---+
// | ID| Name|age|
// +---+-----+---+
// |  1|name1| 18|
// |  2|name2| 20|
// |  3|name3| 19|
// +---+-----+---+

Characteristics:

  • has to access df1 only once
  • shuffles both df1 and df2 independent of the size

EXCEPT followed by UNION

Another approach is to use EXCEPT followed by UNION:

df1.unionAll(df2.except(df1)).show  // df2.distinct.except to drop duplicates

// +---+-----+---+
// | ID| Name|age|
// +---+-----+---+
// |  1|name1| 18|
// |  2|name2| 20|
// |  3|name3| 19|
// +---+-----+---+

Properties:

  • has to access df1 twice
  • shuffles both frames independent of the size (?)
  • can be used with three frames (df3.unionAll(df2.except(df1)))

LEFT OUTER JOIN followed by SELECT with filter followed by UNION

Finally if you want only partial match LEFT OUTER JOIN with filter followed by UNION should do the trick:

df2.as("df2")
  .join(
    df1.select("id", "name").as("df1"),
    // join on id and name
    $"df1.id" === $"df2.id" && $"df1.name" === $"df2.name",
    "leftouter")
  // This could be replaced by .na.drop(...)
  .where($"df1.id".isNull && $"df1.Name".isNull)
  .select($"df2.id", $"df2.name", $"df2.age")
  .unionAll(df1)
  .show

// ---+-----+---+
// | ID| Name|Age|
// +---+-----+---+
// |  3|name3| 19|
// |  1|name1| 18|
// |  2|name2| 20|
// +---+-----+---+

Properties:

  • has to access df1 twice
  • if one the data frames is small enough to broadcasted it may not require shuflle
  • can be used with three data frames

comments powered by Disqus