How to subtract one Scala Spark DataFrame from another (Normalise to the mean)

veryprettyboy Source

I have two Spark DataFrames:

df1 with 80 columns

CO01...CO80

+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+

and df2 with 80 columns

avg(CO01)...avg(CO80) 

which is mean of each column

+------------------+------------------+
|         avg(CO01)|         avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+

How can i subtract df2 from df1 for corresponding values?

I'm looking for solution that does not require to list all the columns.

P.S

In pandas it could be simply done by:

df2=df1-df1.mean()
scalapandasapache-spark

Answers

answered 3 months ago m-bhole #1

Here is what you can do

scala> val df = spark.sparkContext.parallelize(List(
     | (2.06,0.56),
     | (1.96,0.72),
     | (1.70,0.87),
     | (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]

scala>

scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction

scala>

scala> val result = df.columns.foldLeft(df)( (df, col) =>
     | { val avg = df.select(mean(col)).first().getAs[Double](0);
     | df.withColumn(col, subMean(avg)(df(col)))
     | })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]

scala>

scala> result.show(10, false)
    +---------------------+---------------------+
|c1                   |c2                   |
+---------------------+---------------------+
|0.15500000000000025  |-0.13749999999999996 |
|0.05500000000000016  |0.022499999999999964 |
|-0.20499999999999985 |0.1725               |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+

Hope, this helps!

Please note that, this will work for n number of columns as long as all columns in dataframe are of numeric type

comments powered by Disqus