Spark

Engine

Cluster

Partition

Partition large file to multiple small file to help worker node in spark cluster have job to do. If not partition, only one node will be utilized and another node will be useless

pyspark.sql.DataFrame.repartition is lazy command, Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

vuthanhdatt's Second Brain

Explorer

Spark

Engine

Cluster

Partition

Spark Context

Graph View

Table of Contents