As I understood from documentation, in order to use window and watermark function, you must to...
okay, I want to iterate DataFrame and operate each row with outer scope DataFrame variable but I am...
I've a CSV file that has Embedded json as shown in the figure csv snapshot link to data link to...
I'm using spark-2.3.0-bin-hadoop2.7 and sparkling-water-2.3.5 on Windows 10 64 bit. I've taken the...
writing a simple join operation on spark and trying to retrieve the values of the map. why i'm...
I am passing a function to Spark. That function solves an optimization problem which takes about a...
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I...
I have a RDD of some mutable.Map[(Int, Array[Double])] and I would like to reduce the maps by Int...
Firstly I have two variable at begining of code. numericColumnNames = [] categoricalColumnsNames...
I have seen passing two functions to aggregateByKey method as arguments which is in spark core API....
I am new to Spark sql. I want to generate the following 5 second interval time series every day...
I am trying to write flatMap function in python in a list comprehension way! simpleRDD =...
Let's say I have a key/value RDD like this: ('15', '1188263642') ('15', '1188263867') ('20',...
This question already has an answer here: Why is predicate pushdown...
How can we write a Hive query in select statement for the logic below ? If a column value is...
I have a problem with a large object (400mb pickled) I need to use in a UDF. The object is...
I'm doing a simple assignment in Apache Spark using Python. Let's say I have an RDD: [('python',...
I have a DataFrame in Spark (Python) with a date column. The issue is that these dates are in...
I got below Excetion, While printing data of Dataframe returned from mapPartitions, Caused by:...
I have a 6 node cluster with 8 cores and 32 gb ram each. I am reading a simple csv file from azure...
I am new to Apache Spark, Scala and Hadoop tools. I have setup a new local single node Hadoop...
I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The...
Overview Currently my product maintains a DAL that is separated from business logic and exposed...
Environment : AWS EMR emr-5.11.1 , Zeppelin 0.7.3 , Spark 2.2.1 Problem : Zeppelin pyspark...
we have a spark process that takes around 22 min in Spark 2.2.1 on EMR 5.12.1 and took 7h (Yes,...
I am using Spark 2.3 with Java 1.8 I have a RDD of CSV records say:...
I am inserting 21 million records into a Cassandra table using Spark. The spark job takes around...
When I create a table using SQL in Spark, for example: sql('CREATE TABLE example SELECT a, b FROM...
I am trying to create a pyspark dataframe from data stored in an external database. I use the...
I am trying to solve a data cleaning step in a Machine Learning problem where I should group all...