Oracle and linux I/O schedulers (part 1)

Oracle and linux I/O SchedulerPart 1

Following the paper on the block size I decided to write something more on the Linux I/O schedulers and their interaction with oracle.

This paper involves a series of tests stressing oracle with a TPC-C workload while the oracle DB rely on different linux schedulers.

The purpose of the I/O scheduler is to sort and merge the I/O request the I/O queues in order to increase efficiency and boost performance.

Using the /sys pseudo file system you can change and tune the I/O scheduler for a given block device.
For any scheduler there is a different directory tree representing the tuning options.

There are four schedulers available at the moment:The noop scheduler is a FIFO queue. Only the I/O merging is provided. Good if your application already sorts the I/O./sys/block/sdb/queue/scheduler/sys/block/sdb/queue/max_sectors_kb/sys/block/sdb/queue/max_hw_sectors_kb/sys/block/sdb/queue/read_ahead_kb/sys/block/sdb/queue/nr_requestsThe deadline scheduler uses an round-robin algorithm to minimize the latency for any I/O request. It implements merging and sorting plus a deadline mechanism to avoid starvation. It prefers reads above writes/sys/block/sdc/queue/iosched/fifo_batch
/sys/block/sdc/queue/nr_requestsThe anticipatory scheduler try to predict the future workload delaying the I/O in order to merge request and decrease the number of seeks. It implements merging and sorting plus an algorithm to minimize disk head movements.It is suggest for workstation and old hardware.It’s tree:/sys/block/sdb/queue/iosched/write_batch_expire/sys/block/sdb/queue/iosched/read_batch_expire/sys/block/sdb/queue/iosched/antic_expire/sys/block/sdb/queue/iosched/write_expire/sys/block/sdb/queue/iosched/read_expire/sys/block/sdb/queue/iosched/est_time/sys/block/sdb/queue/scheduler/sys/block/sdb/queue/max_sectors_kb/sys/block/sdb/queue/max_hw_sectors_kb/sys/block/sdb/queue/read_ahead_kb/sys/block/sdb/queue/nr_requestsThe cfq is the default for SLES10 (and SLES9). It uses a round-robin trying to be fair dividing the available I/O bandwith amongst all I/O requests.
It implements merging and sorting./sys/block/sdb/queue/iosched/max_depth/sys/block/sdb/queue/iosched/slice_idle/sys/block/sdb/queue/iosched/slice_async_rq/sys/block/sdb/queue/iosched/slice_async/sys/block/sdb/queue/iosched/slice_sync/sys/block/sdb/queue/iosched/back_seek_penalty/sys/block/sdb/queue/iosched/back_seek_max/sys/block/sdb/queue/iosched/fifo_expire_async/sys/block/sdb/queue/iosched/fifo_expire_sync/sys/block/sdb/queue/iosched/queued/sys/block/sdb/queue/iosched/quantum/sys/block/sdb/queue/scheduler/sys/block/sdb/queue/max_sectors_kb/sys/block/sdb/queue/max_hw_sectors_kb/sys/block/sdb/queue/read_ahead_kb/sys/block/sdb/queue/nr_requests

The command:# cat /sys/block/sdb/queue/schedulernoop [anticipatory] deadline cfq

Is going to tell you which scheduler you are using.

On newer kernel you can change the scheduler without a reboot by simply issuing:# echo cfq >  /sys/block/sdb/queue/scheduler

The testing software:

The chosen tool is hammerora which generates a TPC-C workload trying to “hammer” oracle as much as possible. Definitely a good stress test.
In the last version (1.26) I had scalability problems. The number of transactions per minute (tpm) were low and I noticed in my DB wait events lots of ‘read by other session’.
Investigating further I saw the ITEM table (used by hammerora) was growing and lot of tablescan were performed on it.
I simply create an index with this DDL:

    PCTFREE 60;

And the problem disappeared.
I even increased for every index and table the number on inittrans to 255 trying to increase the concurrency.

The difference was 100 folds in the number of tpm.

The virtual users for the initial tests were 10.

My DB:

A (10g release 2 with first patchset).

SQL> show sga

Total System Global Area  838860800 bytes
Fixed Size                  1263572 bytes
Variable Size              83888172 bytes
Database Buffers          746586112 bytes
Redo Buffers                7122944 bytes

SQL> show parameter sga_target

NAME                                 TYPE        VALUE
———————————— ———– ——————————
sga_target                           big integer 800M

SQL> show parameter pga_aggregate_target

NAME                                 TYPE        VALUE
———————————— ———– ——————————
pga_aggregate_target                 big integer 103M

Asynch I/O is activated while direct I/O is disabled (to be sure to use of the feature of the I/O scheduler).

I configured the AWR to take a snapshot every 10 minutes.

I’m going to measure the results using the report create with the AWR (similar to the old statspack).

Disk Layout:

For the first test all the database files are on the same disk: sdb.
They are divided on two reiser file systems: one for the datafile of 4KB block size and one for the redolog of 512 byte.


IBM x335
2 CPU Xeon(TM) 2.00GHz
1,5 GB RAM
6 disks of 36 GB in three different RAID 1 (/dev/sda, /dev/sdb, /dev/sdc)

Operating system:

SUSE Linux Enterprise Server 10 beta8.

I choose this version since it is going to be certified soon with oracle and because it is the first SUSE Enterprise were the I/O scheduler can be changed on the fly.

This last characteristic is really important.

With a simple command like:

echo deadline > /sys/block/sdb/queue/scheduler

the scheduler is changed.

On older SUSE versions like SLES9 the I/O scheduler can be changed at boot time with the parameter elevator=[name of the scheduler] where the name can be: noop, deadline, as, cfq.

Unfortunately with this method you have one scheduler for all the block devices of the system.
It is not possible to combine more I/O scheduler so the tuning capabilities are limited.

Testing methodology:

With 10 virtual users a constant workload is kept on the database.
After 30 minutes the scheduler is changed. The default parameters are kept in place.

After three cycles of all I/O schedulers the AWR snapshots are used to generate reports and to compare them.

First results:

For any scheduler you can see an AWR report following the links:noopanticipatorydeadlinecfq

 transaction per secondlog file sync %user callsphysical readsphysical writes

The winner is the deadline scheduler. It is interesting to see that cfq and anticipatory have to lowest number of transaction per second (around 23 against more than 70 of deadline and noop).
Probably this is due to the high ‘log file sync’ of cfq and anticipatory. They are the clear losers on the redo log file writes!!!

This is worrying since cfq is the default scheduler of SLES distribution (and RedHat AS).

If you are going to implement a OLTP then it is better you test your application using different schedulers. Maybe the default is not right for you.

Deadline seems the best scheduler on this kind of workload but it wins shortly against noop.

It would be interesting to divide redolog from datafile on different block devices. Then set the deadline scheduler on the redolog device and to retest switching the scheduler only on the datafile device (you can set a different scheduler for any block device).

This second test is going to be performed here.