PySpark - Compare DataFrames

user10176078 Source

I'm new to PySpark, So apoloigies if this is a little simple, I have found other questions that compare dataframes but not one that is like this, therefore I do not consider it to be a duplicate. I'm trying to compare two dateframes with similar structure. The 'name' will be unique, yet the counts could be different.

So if the count is different I would like it to produce a dataframe or a python dictionary. just like below. Any ideas on how I would achieved something like this?

DF1

+-------+---------+
|name   | count_1 |
+-------+---------+
|  Alice|   1500  |
|    Bob|   1000  |
|Charlie|   150   |
| Dexter|   100   |
+-------+---------+

DF2

+-------+---------+
|name   | count_2 |
+-------+---------+
|  Alice|   1500  |
|    Bob|   200   |
|Charlie|   150   |
| Dexter|   10    |
+-------+---------+

To produce the outcome:

Mismatch

+-------+-------------+--------------+
|name   | df1_count   | df2_count    |
+-------+-------------+--------------+
|    Bob|   1000      |    200       |
| Dexter|   100       |     10       |
+-------+-------------+--------------+

Match

+-------+-------------+--------------+
|name   | df1_count   | df2_count    |
+-------+-------------+--------------+
|  Alice|   1500      |   1500       |
|Charlie|   150       |    150       |
+-------+-------------+--------------+
pythonapache-sparkpysparkapache-spark-sqlpyspark-sql

Answers

answered 2 months ago user10176078 #1

So I create a third DataFrame, joining DataFrame1 and DataFrame2, and then filter by the counts fields to check if they are equal or not:

Mismatch:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count != df3.df2_count).show()

Match:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count == df3.df2_count).show()

Hope this comes in useful for someone

answered 2 months ago BDA #2

You can create a temp view on top of each dataframe and write Spark SQL query in order to do the join. Direct dataframe join and Spark SQL joins are two options that can be explored here. SQL is more intuitive so it might be easy for newbies.

answered 2 months ago Chandan Ray #3

MATCH

Df1.join(Df2,Df1.col(“name”) === Df2.col(“name”) && Df1.col(“count_1”) === Df2.col(“count_2”),”inner”).drop(Df2.col(“name”)).show

MISMATCH

Df1.join(Df2,Df1.col(“name”) === Df2.col(“name”) && Df1.col(“count_1”) === Df2.col(“count_2”),”leftanti”).as(“Df3”).join(Df2,Df2.col(“name”) === col(“Df3.name”),”inner”).drop(Df2.col(“name”)).show

comments powered by Disqus