Showing posts with label Sample size. Show all posts
Showing posts with label Sample size. Show all posts

Monday, August 22, 2011

Histograms Vs Extended Statistics

Oracle 11g introduced "extended statistics "  , by which we can create statistics on combination on columns  that you usually query with .  In theory , by generating extended statistics , optimizer  will generate better plan , with the correct selectivity .


Let us look at an example .


SQL> desc course_ext_stats
 Name                                      Null?    Type
 ----                                      -------  --------
 ID                                                 NUMBER
 COURSE_YEAR                               NOT NULL NUMBER
 COURSE_TERM                               NOT NULL NUMBER
 COURSE_NO                                 NOT NULL VARCHAR2(5)




SQL> exec dbms_stats.gather_table_stats(user , 'COURSE_EXT_STATS');
PL/SQL procedure successfully completed.


SQL> select column_name , num_distinct , density , sample_size , histogram from user_tab_col_statistics where table_name ='COURSE_EXT_STATS';

COLUMN_NAME                    NUM_DISTINCT    DENSITY SAMPLE_SIZE HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID                                   432879 2.3101E-06      432879 NONE
COURSE_YEAR                              16      .0625      432879 NONE
COURSE_TERM                               4        .25      432879 NONE
COURSE_NO                             82656 .000012098      432879 NONE


SQL> set autotrace on
SQL> select * from course_ext_stats where course_year = 2011 and course_term = 5 ;

.....
....
....

3568 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 185418477

--------------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |  6764 |   118K|   390   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COURSE_EXT_STATS |  6764 |   118K|   390   (3)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("COURSE_YEAR"=2011 AND "COURSE_TERM"=5)

Note
-----
   - automatic DOP: skipped because of IO calibrate statistics are missing


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1615  consistent gets
          0  physical reads
          0  redo size
      77978  bytes sent via SQL*Net to client
       2971  bytes received via SQL*Net from client
        239  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       3568  rows processed
 

By looking at the plan , the optimizer estimated 6764 rows , in fact ,  the query fetched only 3568 rows.       


Let us create extended statistics on these columns (  course_year and course_term ) .

SQL> Select  dbms_stats.create_extended_stats(user , 'COURSE_EXT_STATS' , '(COURSE_YEAR, COURSE_TERM)')  from dual ;

DBMS_STATS.CREATE_EXTENDED_STATS(USER,'COURSE_EXT_STATS','(COURSE_YEAR,COURSE_TERM)')
----------------------------------------------------------------------------------------------------------------------
SYS_STUNAZ4NUJOSOZ98MCGXG90RF7      

While  creating extended statistics , Oracle created another column as shown above.  The optimizer gives relatively better number.

SQL>  exec dbms_stats.gather_table_stats(user , 'COURSE_EXT_STATS' , method_opt=>'for all columns size 1' );
PL/SQL procedure successfully completed.

SQL>select * from course_ext_stats where course_year = 2011 and course_term = 5 ;
.....


--------------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |  3928 | 70704 |   390   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COURSE_EXT_STATS |  3928 | 70704 |   390   (3)| 00:00:01 |
--------------------------------------------------------------------------------------



Now , let us generate histograms up to 254 for all columns and see the output.

SQL> exec dbms_stats.delete_table_stats(user , 'COURSE_EXT_STATS');
PL/SQL procedure successfully completed.


SQL> exec dbms_stats.drop_extended_stats(user , 'COURSE_EXT_STATS' , '(COURSE_YEAR, COURSE_TERM)') ;
PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats(user , 'COURSE_EXT_STATS' , method_opt=>'for all columns size 254' );
PL/SQL procedure successfully completed.


SQL> select column_name , num_distinct , density , sample_size , histogram , num_buckets from user_tab_col_statistics where table_name ='COURSE_EXT_STATS';

COLUMN_NAME                    NUM_DISTINCT    DENSITY SAMPLE_SIZE HISTOGRAM       NUM_BUCKETS
------------------------------ ------------ ---------- ----------- --------------- -----------
ID                                   432879 2.3186E-06        5480 HEIGHT BALANCED         254
COURSE_YEAR                              16 1.1593E-06        5480 FREQUENCY                16
COURSE_TERM                               4 1.1593E-06        5480 FREQUENCY                 4
COURSE_NO                             82656 .000026582        5480 HEIGHT BALANCED         254



Execution Plan
----------------------------------------------------------
Plan hash value: 185418477

--------------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |  3264 | 58752 |   390   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| COURSE_EXT_STATS |  3264 | 58752 |   390   (3)| 00:00:01 |
--------------------------------------------------------------------------------------
 

Now , the optimizer estimated yet another number for expected rows.
So , which one  is better ?


I guess ,  the answer is "it depends" :-) 


Also , I would need to poke around a bit.