SUM function in Apache Pig

I have been learning Big Data and among other things Apache Pig. So initially I thought that once you have loaded the data a SUM applied after a FOREACH would bring the total amount, but that’s not the case. A grouping needs to be performed first else SUM would give an error.

We will perform the below script:

-- This PIG script sums the total amount of sales
-- First load the data from sales.txt file
data = LOAD 'sales.txt' USING PigStorage(',') AS (name:chararray, price:int, country:chararray);

-- Group the data
grouped = GROUP data ALL;

-- Once grouped generate total sum of all sales
total = FOREACH grouped GENERATE SUM(data.price);

-- Print to screen
DUMP total;

Save the above code with .pig extension. Test data will be loaded from below file.

Alice,3000,us
Alice,2000,us
Bob,500,ca
Juan,500,mx
Hans,2000,de
Joan,1000,fr
Piero,6000,it

and execute locally:

pig -x local  totalsales.pig 
17/01/04 14:28:57 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
17/01/04 14:28:57 INFO util.ProcessTree: setsid exited with exit code 0
17/01/04 14:28:58 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
(15000) 

Reference:
Thomas Henson

Leave a Reply