groupByKey vs reduceByKey in Apache Spark.

Question

Which is better groupByKey or reduceByKey ?

zombie · Answer

On applying&#160;groupByKey()&#160;on a dataset of (K, V) pairs, the data shuffle according to the key value K in another&#160;RDD. In this transformation, lots of unnecessary data transfer over the network.Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.Example:val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)val group = data.groupByKey().collect()group.foreach(println)On applying&#160;reduceByKey&#160;on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.Example:val words = Array("one","two","two","four","five","six","six","eight","nine","ten")val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)data.collect.foreach(println)

nitinrawat895 · Answer

groupByKey:Syntax:sparkContext.textFile("hdfs://")&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .flatMap(line => line.split(" ") )&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .map(word => (word,1))&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .groupByKey()&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .map((x,y) => (x,sum(y)) )groupByKey can cause out of disk problems as data is sent over the network and collected on the reduce workers.reduceByKey:Syntax:sparkContext.textFile("hdfs://")&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .flatMap(line => line.split(" "))&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .map(word => (word,1))&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; .reduceByKey((x,y)=> (x+y))Data is combined at each partition , only one output for one key at each partition to send over network. reduceByKey required combining all your values into another value with the exact same type.

samarth295 · Answer

There is two different ways to compute counts:val words = Array("one", "two", "two", "three", "three", "three")val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.

anonymous · Answer

ReduceByKey is the best for production.

Gunjan Kumar · Answer

Below Images are self explainatry for reducebykey and groupbykey&#160;

MD · Answer

Hi,The groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers. You can see the below example.sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" ") )
                    .map(word => (word,1))
                    .groupByKey()
                    .map((x,y) => (x,sum(y)))Whereas in reducebykey, Data are combined at each partition, only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" "))
                    .map(word => (word,1))
                    .reduceByKey((x,y)=> (x+y))

groupByKey vs reduceByKey in Apache Spark

Your comment on this question:

6 answers to this question.

Your answer

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Related Questions In Apache Spark

Apache Spark vs Apache Spark 2

Concatenate columns in apache spark dataframe

cache tables in apache spark sql

map vs mapValues in Spark

How to print the contents of RDD in Apache Spark?

What is the difference between rdd and dataframes in Apache Spark ?

Cache() vs persist() in Spark

map() vs flatMap() in Spark

What is the difference between Apache Spark SQLContext vs HiveContext?

Ways to create RDD in Apache Spark

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES