Monday, August 22, 2011

Histograms Vs Extended Statistics

Oracle 11g introduced "extended statistics "  , by which we can create statistics on combination on columns  that you usually query with .  In theory , by generating extended statistics , optimizer  will generate better plan , with the correct selectivity .


Let us look at an example .


SQL> desc course_ext_stats
 Name                                      Null?    Type
 ----                                      -------  --------
 ID                                                 NUMBER
 COURSE_YEAR                               NOT NULL NUMBER
 COURSE_TERM                               NOT NULL NUMBER
 COURSE_NO                                 NOT NULL VARCHAR2(5)




SQL> exec dbms_stats.gather_table_stats(user , 'COURSE_EXT_STATS');
PL/SQL procedure successfully completed.


SQL> select column_name , num_distinct , density , sample_size , histogram from user_tab_col_statistics where table_name ='COURSE_EXT_STATS';

COLUMN_NAME                    NUM_DISTINCT    DENSITY SAMPLE_SIZE HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID                                   432879 2.3101E-06      432879 NONE
COURSE_YEAR                              16      .0625      432879 NONE
COURSE_TERM                               4        .25      432879 NONE
COURSE_NO                             82656 .000012098      432879 NONE


SQL> set autotrace on
SQL> select * from course_ext_stats where course_year = 2011 and course_term = 5 ;

.....
....
....

3568 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 185418477

--------------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |  6764 |   118K|   390   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COURSE_EXT_STATS |  6764 |   118K|   390   (3)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("COURSE_YEAR"=2011 AND "COURSE_TERM"=5)

Note
-----
   - automatic DOP: skipped because of IO calibrate statistics are missing


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1615  consistent gets
          0  physical reads
          0  redo size
      77978  bytes sent via SQL*Net to client
       2971  bytes received via SQL*Net from client
        239  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       3568  rows processed
 

By looking at the plan , the optimizer estimated 6764 rows , in fact ,  the query fetched only 3568 rows.       


Let us create extended statistics on these columns (  course_year and course_term ) .

SQL> Select  dbms_stats.create_extended_stats(user , 'COURSE_EXT_STATS' , '(COURSE_YEAR, COURSE_TERM)')  from dual ;

DBMS_STATS.CREATE_EXTENDED_STATS(USER,'COURSE_EXT_STATS','(COURSE_YEAR,COURSE_TERM)')
----------------------------------------------------------------------------------------------------------------------
SYS_STUNAZ4NUJOSOZ98MCGXG90RF7      

While  creating extended statistics , Oracle created another column as shown above.  The optimizer gives relatively better number.

SQL>  exec dbms_stats.gather_table_stats(user , 'COURSE_EXT_STATS' , method_opt=>'for all columns size 1' );
PL/SQL procedure successfully completed.

SQL>select * from course_ext_stats where course_year = 2011 and course_term = 5 ;
.....


--------------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |  3928 | 70704 |   390   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COURSE_EXT_STATS |  3928 | 70704 |   390   (3)| 00:00:01 |
--------------------------------------------------------------------------------------



Now , let us generate histograms up to 254 for all columns and see the output.

SQL> exec dbms_stats.delete_table_stats(user , 'COURSE_EXT_STATS');
PL/SQL procedure successfully completed.


SQL> exec dbms_stats.drop_extended_stats(user , 'COURSE_EXT_STATS' , '(COURSE_YEAR, COURSE_TERM)') ;
PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats(user , 'COURSE_EXT_STATS' , method_opt=>'for all columns size 254' );
PL/SQL procedure successfully completed.


SQL> select column_name , num_distinct , density , sample_size , histogram , num_buckets from user_tab_col_statistics where table_name ='COURSE_EXT_STATS';

COLUMN_NAME                    NUM_DISTINCT    DENSITY SAMPLE_SIZE HISTOGRAM       NUM_BUCKETS
------------------------------ ------------ ---------- ----------- --------------- -----------
ID                                   432879 2.3186E-06        5480 HEIGHT BALANCED         254
COURSE_YEAR                              16 1.1593E-06        5480 FREQUENCY                16
COURSE_TERM                               4 1.1593E-06        5480 FREQUENCY                 4
COURSE_NO                             82656 .000026582        5480 HEIGHT BALANCED         254



Execution Plan
----------------------------------------------------------
Plan hash value: 185418477

--------------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |  3264 | 58752 |   390   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COURSE_EXT_STATS |  3264 | 58752 |   390   (3)| 00:00:01 |
--------------------------------------------------------------------------------------
 

Now , the optimizer estimated yet another number for expected rows.
So , which one  is better ?


I guess ,  the answer is "it depends" :-) 


Also , I would need to poke around a bit.