I have recently built a Red Hat Network Satellite system on Red Hat Enterprise Linux 6.1 x86_64 and Satellite 5.4.1 with the embedded database (Oracle DB - not sure which version). Also noteworthy is the fact that this Virtual Machine is running on VMware vSphere 4.1.0 using the LSI Logic Parallel SCSI controller. Anyhow - I have built this type of system numerous times previously and this particular one is running rather poorly. At the time I am writing this, I am still unsure what is causing the issues(s).
1.) processes are taking more time that I am accustomed to (I sound like a typical user now?)
2.) I/O wait on this host seems to be relatively high most of the time
Now - while I am attempting to troubleshoot, I am running a satellite-sync of one of my RHN channels. The software has all been downloaded, and at this point I believe it is simply being catalogued and inserted into the Satellite DB.
I have plenty of memory dedicated to this system, I also assigned 2 CPUs to this host - after noticing that there were 2 to 3 blocked processes, I had hoped another proc would help. It seems to have improved response times when the network is involved.
[root@rhnsat01 ~]# free -m
total used free shared buffers cached
Mem: 3833 3715 118 0 18 2029
-/+ buffers/cache: 1667 2166
Swap: 6015 29 5986
I am really hoping this will fix this. Otherwise - I may have look at separating the different parts of the database (redo, db, index - at least I can assume that this database has all of those components, like a typical Oracle Database).
On epart, in particular, that is really throwing me is why is there very little I/O, but the system appears to be waiting on I/O.
[root@rhnsat01 tmp]# iostat 2 3
Linux 2.6.32-131.21.1.el6.x86_64 (rhnsat01.ncell.com) 12/04/2011 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
7.71 0.00 3.06 55.02 0.00 34.22
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 75.67 1028.61 1718.27 23746366 39667848
sda 2.24 79.21 19.87 1828622 458690
dm-0 3.55 78.05 16.17 1801938 373184
dm-1 0.49 0.19 3.70 4432 85488
dm-2 261.66 170.33 1671.87 3932154 38596584
dm-3 7.13 51.21 44.06 1182154 1017272
dm-4 8.06 807.03 2.34 18630994 54000
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.25 72.32 0.00 26.93
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 12.00 0.00 112.00 0 224
sda 0.00 0.00 0.00 0 0
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 23.00 0.00 112.00 0 224
dm-3 0.00 0.00 0.00 0 0
dm-4 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.50 63.41 0.00 35.59
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 18.50 0.00 216.00 0 432
sda 0.00 0.00 0.00 0 0
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 41.50 0.00 224.00 0 448
dm-3 0.00 0.00 0.00 0 0
dm-4 0.00 0.00 0.00 0 0
You can reverse the journal update with the following:
I wonder if adding another processor to the VM would possibly help?
-bash-4.1$ ps aux | grep \ D
oracle 1617 0.5 19.9 1283752 784684 ? Ds 16:19 2:17 ora_dbw0_rhnsat
oracle 1619 0.5 19.9 1282216 781256 ? Ds 16:19 2:16 ora_dbw1_rhnsat
oracle 1621 0.8 0.7 1295304 29636 ? Ds 16:19 3:37 ora_lgwr_rhnsat
#UPDATE
The change to the filesystem journal seems to have helped. I still experience some I/O wait, however the CPU utilization is rather tremendous, as are the I/O values (in comparison to how they were, of course).
[jradtke@rhnsat01 ~]$ iostat 2 3
Linux 2.6.32-131.21.1.el6.x86_64 (rhnsat01.ncell.com) 12/05/2011 _x86_64_(2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.06 0.00 1.09 11.78 0.00 82.07
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.55 131.65 12.65 691254 66394
sdb 20.00 337.59 197.35 1772542 1036224
dm-0 3.53 126.94 12.64 666522 66376
dm-1 0.06 0.50 0.00 2616 0
dm-2 16.71 216.64 101.71 1137474 534024
dm-3 12.57 14.43 95.64 75778 502176
dm-4 11.37 106.44 0.00 558850 24
avg-cpu: %user %nice %system %iowait %steal %idle
3.52 0.00 2.76 45.48 0.00 48.24
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sdb 132.50 33188.00 72.00 66376 144
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 11.00 0.00 88.00 0 176
dm-3 0.00 0.00 0.00 0 0
dm-4 131.00 33188.00 0.00 66376 0
avg-cpu: %user %nice %system %iowait %steal %idle
7.91 0.00 2.04 54.85 0.00 35.20
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sdb 131.50 29004.00 292.00 58008 584
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 24.00 0.00 192.00 0 384
dm-3 1.50 0.00 12.00 0 24
dm-4 136.50 29004.00 72.00 58008 144
SDB is the device that hosts my entire RHN Satellite environment (packages, cache and database).
[root@rhnsat01 ~]# pvdisplay -m /dev/sdb
--- Physical volume ---
PV Name /dev/sdb
VG Name vg_satellite
PV Size 100.00 GiB / not usable 4.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 25599
Free PE 3583
Allocated PE 22016
PV UUID Q9qv0z-12gu-5aNN-Ae8L-Ua5T-oL1e-s0uKgH
--- Physical Segments ---
Physical extent 0 to 4095:
Logical volume /dev/vg_satellite/lv_rhnsat
Logical extents 0 to 4095
Physical extent 4096 to 6655:
Logical volume /dev/vg_satellite/lv_varcacherhn
Logical extents 0 to 2559
Physical extent 6656 to 22015:
Logical volume /dev/vg_satellite/lv_varsatellite
Logical extents 0 to 15359
Physical extent 22016 to 25598:
FREE
1.) processes are taking more time that I am accustomed to (I sound like a typical user now?)
2.) I/O wait on this host seems to be relatively high most of the time
Now - while I am attempting to troubleshoot, I am running a satellite-sync of one of my RHN channels. The software has all been downloaded, and at this point I believe it is simply being catalogued and inserted into the Satellite DB.
I have plenty of memory dedicated to this system, I also assigned 2 CPUs to this host - after noticing that there were 2 to 3 blocked processes, I had hoped another proc would help. It seems to have improved response times when the network is involved.
[root@rhnsat01 ~]# free -m
total used free shared buffers cached
Mem: 3833 3715 118 0 18 2029
-/+ buffers/cache: 1667 2166
Swap: 6015 29 5986
[root@rhnsat01 ~]# vmstat 2 3
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 30188 115300 18544 2072444 0 1 349 532 396 775 9 3 36 51 0
0 2 30188 112324 18732 2075204 0 0 1118 540 339 459 5 2 39 55 0
1 2 30188 107216 19180 2079660 0 0 2446 120 447 702 6 3 49 41 0
[root@rhnsat01 ~]# ps aux | grep \ D
oracle 1617 0.6 19.3 1283752 758144 ? Ds 16:19 1:39 ora_dbw0_rhnsat
oracle 1619 0.6 19.2 1282216 754064 ? Ds 16:19 1:38 ora_dbw1_rhnsat
root 2847 12.3 9.0 863964 353428 pts/0 D+ 16:38 31:16 /usr/bin/python /usr/bin/satellite-sync --channel rhel-x86_64-server-5 --email --traceback-mail=blah_blah@blah.com
root 6364 0.0 0.0 103232 832 pts/1 S+ 20:51 0:00 grep D
Now things get a little ridiculous (and a bit over my head at this point, but I thought I'd start to look anyhow...) I'm going to step into the Database process (this one happens to be a DB writer command) and see if something looks amiss.
[root@rhnsat01 ~]# strace -p 1617
Process 1617 attached - interrupt to quit
times(NULL) = 431066880
semtimedop(425986, {{9, -1, 0}}, 1, {0, 860000000}) = -1 EAGAIN (Resource temporarily unavailable)
getrusage(RUSAGE_SELF, {ru_utime={12, 757060}, ru_stime={87, 712665}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={12, 757060}, ru_stime={87, 712665}, ...}) = 0
times(NULL) = 431066969
times(NULL) = 431066969
times(NULL) = 431066970
semtimedop(425986, {{9, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
times(NULL) = 431067270
times(NULL) = 431067270
times(NULL) = 431067270
pwrite(26, "\6\242\0\0\350\r\200\1\252\333O\0\0\0\1\6{\314\0\0\2\0\0\0,)\0\0\251\333O\0"..., 8192, 29163520) = 8192
So - I notice the 8192 and I assume that indicates a block-size, which makes me wonder about a few things.
Leaving a few details out of this effort, I happen to know that the database has it's own volume.
[root@rhnsat01 ~]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg_rhnsat01-lv_root
30584700 4166684 24864404 15% /
tmpfs 1962640 0 1962640 0% /dev/shm
/dev/sda1 495844 88764 381480 19% /boot
/dev/mapper/vg_satellite-lv_varsatellite
61927420 38073216 20708476 65% /var/satellite
/dev/mapper/vg_satellite-lv_varcacherhn
10321208 1166960 8629960 12% /var/cache/rhn
/dev/mapper/vg_satellite-lv_rhnsat
16513960 7200656 8474444 46% /rhnsat
[root@rhnsat01 ~]# stat /dev/mapper/vg_satellite-lv_rhnsat
File: `/dev/mapper/vg_satellite-lv_rhnsat' -> `../dm-2'
Size: 7 Blocks: 0 IO Block: 4096 symbolic link
Device: 5h/5d Inode: 10272 Links: 1
Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-12-04 16:18:46.153088686 -0600
Modify: 2011-12-04 16:18:45.960093540 -0600
Change: 2011-12-04 16:18:45.960093540 -0600
Hmm... I wonder if this means that there are 2 writes for each 8k block, rather than 1. As much as this bothers me, it doesn't matter as mkfs will only accept 1 of 3 values (1024, 2048 and 4096).
My database resides on an LVM volume, with default options. As well as default options for mkfs and mount options. Time to investigate these things...
I determine that I am running
SQL> SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
In my research I discovered the following:
I determined that my ext4 filesystem already has a journal, so I'll add writeback option which he recommends.
[root@rhnsat01 tmp]# dumpe2fs /dev/mapper/vg_satellite-lv_rhnsat | head -50 > /var/tmp/dumpe2fs.20111204.1
[root@rhnsat01 tmp]# tune2fs -o journal_data_writeback /dev/mapper/vg_satellite-lv_rhnsat
[root@rhnsat01 tmp]# dumpe2fs /dev/mapper/vg_satellite-lv_rhnsat | head -50 > /var/tmp/dumpe2fs.20111204.2
[root@rhnsat01 tmp]# sdiff dumpe2fs.20111204.1 dumpe2fs.20111204.2
dumpe2fs 1.41.12 (17-May-2010) <
Filesystem volume name: lv_rhnsat Filesystem volume name: lv_rhnsat
Last mounted on: /rhnsat Last mounted on: /rhnsat
Filesystem UUID: 48d7816e-0276-4e76-ac51-432d8fe827a Filesystem UUID: 48d7816e-0276-4e76-ac51-432d8fe827a
Filesystem magic number: 0xEF53 Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic) Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode d Filesystem features: has_journal ext_attr resize_inode d
Filesystem flags: signed_directory_hash Filesystem flags: signed_directory_hash
Default mount options: (none) | Default mount options: journal_data_writeback
However, the mount options seem reasonably compelling (and those options are actually the reason I found his site in the first place). Unfortunately my database is in use and I can't simply remount the filesystem at this time.
I am really hoping this will fix this. Otherwise - I may have look at separating the different parts of the database (redo, db, index - at least I can assume that this database has all of those components, like a typical Oracle Database).
On epart, in particular, that is really throwing me is why is there very little I/O, but the system appears to be waiting on I/O.
[root@rhnsat01 tmp]# iostat 2 3
Linux 2.6.32-131.21.1.el6.x86_64 (rhnsat01.ncell.com) 12/04/2011 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
7.71 0.00 3.06 55.02 0.00 34.22
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 75.67 1028.61 1718.27 23746366 39667848
sda 2.24 79.21 19.87 1828622 458690
dm-0 3.55 78.05 16.17 1801938 373184
dm-1 0.49 0.19 3.70 4432 85488
dm-2 261.66 170.33 1671.87 3932154 38596584
dm-3 7.13 51.21 44.06 1182154 1017272
dm-4 8.06 807.03 2.34 18630994 54000
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.25 72.32 0.00 26.93
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 12.00 0.00 112.00 0 224
sda 0.00 0.00 0.00 0 0
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 23.00 0.00 112.00 0 224
dm-3 0.00 0.00 0.00 0 0
dm-4 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.50 63.41 0.00 35.59
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 18.50 0.00 216.00 0 432
sda 0.00 0.00 0.00 0 0
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 41.50 0.00 224.00 0 448
dm-3 0.00 0.00 0.00 0 0
dm-4 0.00 0.00 0.00 0 0
You can reverse the journal update with the following:
# Delete has_journal option
tune2fs -O ^has_journal /dev/sda10
# Required fsck
e2fsck -f /dev/sda10
# Check fs options
dumpe2fs /dev/sda10 |m
-bash-4.1$ ps aux | grep \ D
oracle 1617 0.5 19.9 1283752 784684 ? Ds 16:19 2:17 ora_dbw0_rhnsat
oracle 1619 0.5 19.9 1282216 781256 ? Ds 16:19 2:16 ora_dbw1_rhnsat
oracle 1621 0.8 0.7 1295304 29636 ? Ds 16:19 3:37 ora_lgwr_rhnsat
The change to the filesystem journal seems to have helped. I still experience some I/O wait, however the CPU utilization is rather tremendous, as are the I/O values (in comparison to how they were, of course).
[jradtke@rhnsat01 ~]$ iostat 2 3
Linux 2.6.32-131.21.1.el6.x86_64 (rhnsat01.ncell.com) 12/05/2011 _x86_64_(2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.06 0.00 1.09 11.78 0.00 82.07
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.55 131.65 12.65 691254 66394
sdb 20.00 337.59 197.35 1772542 1036224
dm-0 3.53 126.94 12.64 666522 66376
dm-1 0.06 0.50 0.00 2616 0
dm-2 16.71 216.64 101.71 1137474 534024
dm-3 12.57 14.43 95.64 75778 502176
dm-4 11.37 106.44 0.00 558850 24
avg-cpu: %user %nice %system %iowait %steal %idle
3.52 0.00 2.76 45.48 0.00 48.24
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sdb 132.50 33188.00 72.00 66376 144
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 11.00 0.00 88.00 0 176
dm-3 0.00 0.00 0.00 0 0
dm-4 131.00 33188.00 0.00 66376 0
avg-cpu: %user %nice %system %iowait %steal %idle
7.91 0.00 2.04 54.85 0.00 35.20
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sdb 131.50 29004.00 292.00 58008 584
dm-0 0.00 0.00 0.00 0 0
dm-1 0.00 0.00 0.00 0 0
dm-2 24.00 0.00 192.00 0 384
dm-3 1.50 0.00 12.00 0 24
dm-4 136.50 29004.00 72.00 58008 144
SDB is the device that hosts my entire RHN Satellite environment (packages, cache and database).
[root@rhnsat01 ~]# pvdisplay -m /dev/sdb
--- Physical volume ---
PV Name /dev/sdb
VG Name vg_satellite
PV Size 100.00 GiB / not usable 4.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 25599
Free PE 3583
Allocated PE 22016
PV UUID Q9qv0z-12gu-5aNN-Ae8L-Ua5T-oL1e-s0uKgH
--- Physical Segments ---
Physical extent 0 to 4095:
Logical volume /dev/vg_satellite/lv_rhnsat
Logical extents 0 to 4095
Physical extent 4096 to 6655:
Logical volume /dev/vg_satellite/lv_varcacherhn
Logical extents 0 to 2559
Physical extent 6656 to 22015:
Logical volume /dev/vg_satellite/lv_varsatellite
Logical extents 0 to 15359
Physical extent 22016 to 25598:
FREE
Comments
Post a Comment