hadoop - What affects amount of data shuffled in spark -
for example im executing queries on spark, , in spark ui can see queries have more shuffle , , shuffle seems amount of data read locally , read between executors.
but im not understanding 1 thing, example query below loaded 7gb hdfs suffle read + shuffled write more 10gb. saw other queries load 7gb hdfs , shuffle 500kb. im not understanding this, can please help? amount of data shuffled not related in data read hdfs?
select nation, o_year, sum(amount) sum_profit ( select n_name nation, year(o_orderdate) o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity amount orders o join (select l_extendedprice, l_discount, l_quantity, l_orderkey, n_name, ps_supplycost part p join (select l_extendedprice, l_discount, l_quantity, l_partkey, l_orderkey, n_name, ps_supplycost partsupp ps join (select l_suppkey, l_extendedprice, l_discount, l_quantity, l_partkey, l_orderkey, n_name (select s_suppkey, n_name nation n join supplier s on n.n_nationkey = s.s_nationkey ) s1 join lineitem l on s1.s_suppkey = l.l_suppkey ) l1 on ps.ps_suppkey = l1.l_suppkey , ps.ps_partkey = l1.l_partkey ) l2 on p.p_name '%green%' , p.p_partkey = l2.l_partkey ) l3 on o.o_orderkey = l3.l_orderkey )profit group nation, o_year order nation, o_year desc;
the shuffle spark’s mechanism re-distributing data it’s grouped differently across partitions. typically involves copying data across executors , machines. pretty clear here shuffled data not dependent on amount of input data. however, depends upon operations perform on input data, leads movement of data across executors( , hence machines). please go through http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations know , understand why shuffling costly process.
looking @ query have pasted, seems doing lot of join operations (haven't looked deep understand ultimate operation doing). , calls moving data across partitions. problem can handled revisiting query , optimizing same or manipulating or pre-procesing input data in manner leads less movement of data ( ex: colocating data has joined fall in same partition). again, example , have determine use case on works best you.
Comments
Post a Comment