Wednesday, September 19, 2012

Difference between PERCENTILE_CONT and PERCENTILE_DISC

Treading on the previous post on MEDIAN , I would like to point out the Difference between PERCENTILE_CONT and  PERCENTILE_DISC( MEDIAN being special case of PERCENTILE_CONT) . 

There are subtle differences between PERCENTILE_CONT and  PERCENTILE_DISC.  When the count of record set is odd , there are no differences between them .  The calculation of MEDIAN is identical to PERCENTILE_CONT when computed with percentile value of  0.5

Let us look at the first example.


SQL> DROP
  2    TABLE t purge;

Table dropped.

SQL> CREATE
  2    TABLE t AS
  3  SELECT
  4    'TEST' item ,
  5    'E'   AS region ,
  6    level AS wk ,
  7    ROUND(dbms_random.value(1,500)) forecastqty
  8  FROM
  9    dual
 10    CONNECT BY level <= 3 ;

Table created.


SQL>     SELECT
  2        t.* ,
  3        PERCENTILE_CONT(0.5)
  4        WITHIN GROUP ( ORDER BY forecastqty)
  5        OVER (PARTITION BY ITEM , region ) AS PERCENTILE_CONT ,
  6        MEDIAN(forecastqty)
  7        OVER (PARTITION BY ITEM , region ) AS MEDIAN ,
  8        PERCENTILE_DISC(0.5)
  9        WITHIN GROUP ( ORDER BY forecastqty)
 10        OVER (PARTITION BY ITEM , region ) AS PERCENTILE_DISC
 11      FROM
 12    t ;

ITEM REGION             WK FORECASTQTY PERCENTILE_CONT     MEDIAN PERCENTILE_DISC
---- ---------- ---------- ----------- --------------- ---------- ---------------
TEST E                   2          10              41         41              41
TEST E                   1          41              41         41              41
TEST E                   3         326              41         41              41




As you can see there are no difference between the values of PERCENTILE_CONT , PERCENTILE_DISC and MEDIAN. The computation involves the ordering of the result set ( 10 , 41 , 326 ) and picking up the middle value ( 41 ) . 


Let us at look at the second example.



SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for 64-bit Windows: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production




SQL> DROP
  2    TABLE t purge;

Table dropped.

SQL> CREATE
  2    TABLE t AS
  3  SELECT
  4    'TEST' item ,
  5    'E'   AS region ,
  6    level AS wk ,
  7    ROUND(dbms_random.value(1,500)) forecastqty
  8  FROM
  9    dual
 10    CONNECT BY level <= 4 ;

Table created.

SQL>   column region format a10
SQL>   set linesize 800
SQL> select * from t order by forecastqty;

ITEM REGION             WK FORECASTQTY
---- ---------- ---------- -----------
TEST E                   3         137
TEST E                   2         190
TEST E                   1         232
TEST E                   4         400



SQL>     SELECT

  2        t.* ,
  3        PERCENTILE_CONT(0.5)
  4        WITHIN GROUP ( ORDER BY forecastqty)
  5        OVER (PARTITION BY ITEM , region ) AS PERCENTILE_CONT ,
  6        MEDIAN(forecastqty)
  7        OVER (PARTITION BY ITEM , region ) AS MEDIAN ,
  8        PERCENTILE_DISC(0.5)
  9        WITHIN GROUP ( ORDER BY forecastqty)
 10        OVER (PARTITION BY ITEM , region ) AS PERCENTILE_DISC
 11      FROM
 12    t ;



ITEM REGION             WK FORECASTQTY PERCENTILE_CONT     MEDIAN PERCENTILE_DISC
---- ---------- ---------- ----------- --------------- ---------- ---------------
TEST E                   3         137             211        211             190
TEST E                   2         190             211        211             190
TEST E                   1         232             211        211             190
TEST E                   4         400             211        211             190



There are differences between the values of PERCENTILE_CONT , PERCENTILE_DISC . 
The computation of the  PERCENTILE_CONT   and MEDIAN involves the ordering of the result set ( 137 , 190 , 232 , 400  ) , aggregates the middle values ( 190 , 232 ) and then makes the mean / average of the aggregate [ ( 190 + 232 ) / 2  ] .  


The computation of the   PERCENTILE_DISC   involves the ordering of the result set ( 137 , 190 , 232 , 400  ) ,  pickups the first of the middle values ( 190 , in this case ) .  In other words , it picks up the value where cume_dist is lowest of the middle values. 

PERCENTILE_CONT and MEDIAN is not guaranteed to return one of the value in the distribution  . This may not be acceptable for certain business process . In that case , we would use PERCENTILE_DISC.


Note : 
SQL Server 2012 introduced these functions as well.  

Comments welcome.






3 comments: