SVM - Submirrors - One in Needs maintenance & another in Last Erred
SVM - Submirrors - One in Needs maintenance & another in Last Erred
Main Point to Note:
Always replace components in the “Maintenance” state first, followed by those in the “Last Erred” state. After a component is replaced and resynchronized, use the metastat command to verify its state. Then, validate the data.
Theory from Oracle:
When a component in a RAID-1 or RAID-5 volume experiences errors, Solaris Volume Manager puts the component in the “Maintenance” state. No further reads or writes are performed to a component in the “Maintenance” state.
Sometimes a component goes into a “Last Erred” state. For a RAID-1 volume, this usually occurs with a one-sided mirror. The volume experiences errors. However, there are no redundant components to read from. For a RAID-5 volume this occurs after one component goes into “Maintenance” state, and another component fails. The second component to fail goes into the “Last Erred” state.
When either a RAID-1 volume or a RAID-5 volume has a component in the “Last Erred” state, I/O is still attempted to the component marked “Last Erred.” This I/O attempt occurs because a “Last Erred” component contains the last good copy of data from Solaris Volume Manager's point of view. With a component in the “Last Erred” state, the volume behaves like a normal device (disk) and returns I/O errors to an application. Usually, at this point, some data has been lost.
The subsequent errors on other components in the same volume are handled differently, depending on the type of volume.
- RAID-1 Volume
- A RAID-1 volume might be able to tolerate many components in the “Maintenance” state and still be read from and written to. If components are in the “Maintenance” state, no data has been lost. You can safely replace or enable the components in any order. If a component is in the “Last Erred” state, you cannot replace it until you first replace the components in the “Maintenance” state. Replacing or enabling a component in the “Last Erred” state usually means that some data has been lost. Be sure to validate the data on the mirror after you repair it.
- RAID-5 Volume
- A RAID-5 volume can tolerate a single component in the “Maintenance” state. You can safely replace a single component in the “Maintenance” state without losing data. If an error on another component occurs, it is put into the “Last Erred” state. At this point, the RAID-5 volume is a read-only device. You need to perform some type of error recovery so that the state of the RAID-5 volume is stable and the possibility of data loss is reduced. If a RAID-5 volume reaches a “Last Erred” state, there is a good chance it has lost data. Be sure to validate the data on the RAID-5 volume after you repair it.
Actual Issue
- After the Disk Replacement, One of the mirror devices has got both of its sub mirrors in the maintenance mode.
- Point to note is One submirror is in Needs maintenance & another in Last Erred.
- metastat d3
d3: Mirror
Submirror 0: d13
State: Needs maintenance
Submirror 1: d23
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 10247232 blocks
d13: Submirror of d3
State: Needs maintenance
Invoke: metasync d3
Size: 10247232 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s3 0 No Okay
d23: Submirror of d3
State: Needs maintenance
Invoke: after replacing "Maintenance" components:
metareplace d3 c0t1d0s3 <new device>
Size: 10247232 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s3 0 No Last Erred - Try:
- #metasync d3
- Submirror d13 tries to sync with the device d23, If its successful, no further troubleshooting is required. Check the State of both the submirrors with #metastat command and you are good.
- >> If Above step is not successful;
- The issue is not necessarily related with the submirror but it could be related to the disk on which the Submirror d23 is sourced. d13 is simply try to sync to d23. The Main Culprit in this case is d23 which needs attention.
- >> Check for read errors in /var/adm/messages on the disk c0t1d0
- >> Check for the errors from iostat -En output;
- # iostat -En
- c0t1d0 Soft Errors: 14 Hard Errors: 6850 Transport Errors: 7189
- Vendor: SEAGATE Product: ST373207LSUN72G Revision: 045A Serial No: 3432BEKC
- Size: 73.40GB <73400057856 bytes>
- Media Error: 5707 Device Not Ready: 0 No Device: 1142 Recoverable: 0
- Illegal Request: 14 Predictive Failure Analysis: 736
- Check if the below steps could solve the issue.
- # prtvtoc /dev/rdsk/c0t1d0s3
* /dev/rdsk/c0t1d0s3 partition map
*
* Dimensions:
* 512 bytes/sector
* 424 sectors/track
* 24 tracks/cylinder
* 10176 sectors/cylinder
* 14089 cylinders
* 14087 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 2 00 0 4100928 4100927
1 3 01 4100928 4100928 8201855
2 5 00 0 143349312 143349311
3 7 00 8201856 10247232 18449087
4 0 00 18449088 10247232 28696319
5 4 00 28696320 20484288 49180607
6 0 00 49180608 94148352 143328959
7 0 00 143328960 20352 143349311
>> SLICE 3: Starting sector 8201856 18449087
>>format>analyze>read
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y
pass 0
Medium error during read: block 14559059 (0xde2753) (1430/17/171)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559059 (1430/17/171)...ok.
Medium error during read: block 14559060 (0xde2754) (1430/17/172)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559060 (1430/17/172)...ok.
Medium error during read: block 14559061 (0xde2755) (1430/17/173)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559061 (1430/17/173)...ok.
Medium error during read: block 14559062 (0xde2756) (1430/17/174)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559062 (1430/17/174)...ok.
Medium error during read: block 14559063 (0xde2757) (1430/17/175)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559063 (1430/17/175)...ok.
Medium error during read: block 14559064 (0xde2758) (1430/17/176)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559064 (1430/17/176)...ok.
Medium error during read: block 14559065 (0xde2759) (1430/17/177)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559065 (1430/17/177)...ok.
Medium error during read: block 14559066 (0xde275a) (1430/17/178)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559066 (1430/17/178)...ok.
Medium error during read: block 14559067 (0xde275b) (1430/17/179)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559067 (1430/17/179)...ok.
Medium error during read: block 14559068 (0xde275c) (1430/17/180)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559068 (1430/17/180)...ok.
Medium error during read: block 14559069 (0xde275d) (1430/17/181)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559069 (1430/17/181)...ok.
Medium error during read: block 14559070 (0xde275e) (1430/17/182)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559070 (1430/17/182)...ok.
Medium error during read: block 14559071 (0xde275f) (1430/17/183)
ASC: 0x11 ASCQ: 0x0
Repairing hard error on 14559071 (1430/17/183)...ok.
14086/23/304
pass 1
14086/23/304
Total of 13 defective blocks repaired. - REMEMBER THE FIRST INSTRUCTION; DO SYNC THE DEVICE IN MAINTENANCE MODE FIRST NOT THE LAST ERRD ONE.
- # metasync d3
- LET THE WHOLE RESYNC FOR D13 GET COMPLETED. DO NOT INTERRUPT.
- CHECK;
- #metastat d3 if the device d13 is OKAY
- Now complete the other last errd device sync.
- #metareplace -e d3 c0t1d0s3
- # metastat d3
- BOTH THE DEVICES OKAY? YOU ARE GOOD. LEAVE YOUR COMMENTS BELOW;
- CHECK THE OTHER REFERENCE BLOGS:
- http://1shiftg.blogspot.com.au/2012/08/solaris-10-svm-mirrors-maintenence-last.html
- http://tad1982.blogspot.com.au/2011/05/both-submirrors-in-needs-maintenance.html
Comments
Post a Comment