hadoop - What affects amount of data shuffled in spark -


for example im executing queries on spark, , in spark ui can see queries have more shuffle , , shuffle seems amount of data read locally , read between executors.

but im not understanding 1 thing, example query below loaded 7gb hdfs suffle read + shuffled write more 10gb. saw other queries load 7gb hdfs , shuffle 500kb. im not understanding this, can please help? amount of data shuffled not related in data read hdfs?

select    nation, o_year, sum(amount) sum_profit    ( select    n_name nation, year(o_orderdate) o_year,    l_extendedprice * (1 - l_discount) -  ps_supplycost * l_quantity amount           orders o join       (select l_extendedprice, l_discount, l_quantity, l_orderkey, n_name, ps_supplycost         part p join          (select l_extendedprice, l_discount, l_quantity, l_partkey, l_orderkey,                   n_name, ps_supplycost            partsupp ps join             (select l_suppkey, l_extendedprice, l_discount, l_quantity, l_partkey,                      l_orderkey, n_name                              (select s_suppkey, n_name                  nation n join supplier s on n.n_nationkey = s.s_nationkey                ) s1 join lineitem l on s1.s_suppkey = l.l_suppkey             ) l1 on ps.ps_suppkey = l1.l_suppkey , ps.ps_partkey = l1.l_partkey          ) l2 on p.p_name '%green%' , p.p_partkey = l2.l_partkey      ) l3 on o.o_orderkey = l3.l_orderkey   )profit group nation, o_year order nation, o_year desc; 

the shuffle spark’s mechanism re-distributing data it’s grouped differently across partitions. typically involves copying data across executors , machines. pretty clear here shuffled data not dependent on amount of input data. however, depends upon operations perform on input data, leads movement of data across executors( , hence machines). please go through http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations know , understand why shuffling costly process.

looking @ query have pasted, seems doing lot of join operations (haven't looked deep understand ultimate operation doing). , calls moving data across partitions. problem can handled revisiting query , optimizing same or manipulating or pre-procesing input data in manner leads less movement of data ( ex: colocating data has joined fall in same partition). again, example , have determine use case on works best you.


Comments

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

java - Digest auth with Spring Security using javaconfig -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -