【365比分网】HP EVA4400 Storage Replacement Failed Hard Drive Handling Solution

Failure description

Customers report HP EVA4400 storage hard disk failure, the appearance of the location showing slot2 hard disk failure.

Engineers are ready to replace the faulty hard disk, logging into the EVA management console prompts controller2 (Figure 1) need to pay

attention to the disk group (Figure 2) is marked with an exclamation point,

check the log error message, showing controller2 DP1B port link is lost,

the default disk group is marked with an exclamation point. When replacing the failed hard disk, the operation of removing the failed hard disk

is completed successfully, but when adding the new hard disk to the disk group,

it prompts Figure 3 to report an error:

图1.png

fig.1

图2.png

fig.2

图3.png

Fig.3

Error message: page refresh prompt Operation failed! The target device is not in the right condition to perform the operation

Failure Analysis

When replacing a failed hard disk, the error message on the EVA 4400 storage console page indicates

that the current operation of adding a disk group does not meet the conditions for executing

and subsequent leveling operations of disk data disks cannot be performed. The storage system

restricts the operation of adding a group by replacing a new hard disk, which is actually the storage data

equalization mechanism that triggers the protection mechanism of data consistency and data security.

Because the artificial operation of the new disk to add back to the original disk group, it is likely

that the controller2 DP1B link link loss and lead to data leveling operation failure, but also cause

the data content on the storage is not consistent. Therefore, when multiple points of failure occur

in the whole set of storage equipment at the same time, it is necessary to analyze the correlation between the failures, extensibility,

scope of influence, risk level, and whether the failure will lead to inconsistencyof business data

in the context of the storage protection mechanism and the specific failure situation.

Combined with the above, in the case of a storage hard disk is damaged, and at the same time,

the connection link of the controller and I/O module is lost all the way, combined with the storage redundancy mode

is double or single (Note: theoretically, double mode has 4 disks of redundant space to cope with bad disks,

and single mode has 2 disks of redundant space to cope with bad disks) under the fault protection mechanism,

the storage system data protection Under the fail-safe mechanism, the storage system data protection strategy determines

that the bad disk can be eliminated normally, and the storage itself will move the faulty disk to

the ungroup disk group, and the disks in the ungroup disk group can be unplugged normally without affecting the use of the storage and data security.

In addition to the storage of the hard disk replacement mechanism is MANUAL way, that is, prompting the physical replacement process

of the hard disk is manually operated, storage system design at the beginning of

the proposed operation of a spare parts replacement time gap, this time gap in the HP EVA4400 known as disk replacement delay,

the default configuration threshold is 1 minute, so that in the replacement of faulty hard disk.

The default configuration threshold is 1 minute, so when replacing a failed hard disk, a disk replacement delay that is either too long (greater than 1 minute)

or too short (less than 15 seconds) may result in the storage failing torecognize the new hard disk properly and failing

to add the group. Engineers in the case of disk replacement delay mechanism, through two attempts to different locations of the hard disk insertion and removal test,

it is clear that the failure is not caused by the time interval between thereplacement of hard disks,nor is it a problem with the disk enclosure slots, but rather,

there are other higher risk level or failure level of the failure needs to be

prioritized to solve the problem.

Troubleshooting

The cause of the failure is confirmed after analysis: it is due to the loss of the DP1B link of the controller, which triggers the data protection policy

of the storage system, thus restricting human misoperation.

The principle of this fault handling can not prioritize the replacement of hard disk, but should first restore the link; when the storage is in normal link mode,

storage security policy to determine the data and storage to get redundant

protection,

then you can replace the hard disk normally.

As shown in Figure 4 and Figure 5, there is a problem with the link connecting the I/O B module of DE2 disk enclosure and DP1B of controller2.

Figure 4 shows Failed, this can not be intuitively characterized as

controller2 DP1B failure or DE2 disk cabinet I/O B module failure,

should be analyzed from three angles:

a. Controller 2 failure (module is integrated)

b, DE2 disk cabinet I / O B module failure

c, the interconnection line between controller 2 and the I/O B of the

DE2 disk cabinet is loose (this fault can be troubleshooted in advance)

图4.png

Fig. 4: Screenshot of controller DP1B status

图5.png

Fig.5：Disk Cabinet I/O Module Status Screenshot

After analyzing, the points of failure are ranked in order of risk:

Disk DE2 I/O B module failure = controller failure > disk

According to the failure situation, the DP1B port link of controller 2 is lost, while the DP1A port link is normal,

logically, there are 3 possibilities that lead to the alarm of controller 2:

a, Controller 2's own DP1B port failure;

b, Failure of the I/O B module of the disk enclosure DE2 to which Controller 2 is connected;

c, controller 2 DP1B port connected to the DE2 disk enclosure I/O B-IN port on the data cable failure.

Since the DP1B port indicator of Controller 2 is off, and the DP1B port is integrated into the controller control motherboard and is not a separate replaceable module,

and the DP1A module displays normally, the likelihood of damage to Controller 2 is low. Therefore, the possibility of damage to controller 2 is lower than the possibility of

damage to the I/O module,

and the possibility of damage to the cable is the lowest (the first step needs to be to check this cable), thus, the order of spare parts replacement in this troubleshooting is

as follows:

1, cable check of DP1B port of controller 2 and IN port cable check of DE2 I/O B module (direct manual check)

2、Replacement of DE2 disk I/O B module

3, replace the controller 2 (Note: If the replacement of I / O module failure recovery, this step is omitted)

4, replace the DE2 SLOT2 disk

01 Replace disk I/O module (Disk Enclosure 2 I/O B)

1. Physically remove the interconnecting cables connected to the disk enclosure2 I/O B module and record or label them;

2. Physically unplug the DE2 I/O B module;

3. wait 15 seconds and insert the new I/O module incompletely;

4. plug back the interconnecting cable between the controller and the DE2 disk enclosure I/O B module;

5. fully insert the controller into the I/O B slot of the DE2 disk enclosure;

6. wait for the storage controller to recognize the controller;

7. wait for 1 minute, if the controller B is still not recognized properly, you need to reboot the controller B.

Reboot the controller:

Click the web window Power down or restart system navigation menu, the right side of the window page refresh

appears as Figure 6 interface:

图6.png

Fig.6

Find the restart a controller navigation window in the page, which corresponds to the drop-down list check box behind the restart button,

select controller 2, click restart, controller 2 enters the reboot process, wait for 1 minute, refresh the EVA console page, check whether the

overall status of the storage is back to normal, and check whether the Failed logo on the DP1B port of controller 2 is back to normal;

if it is still not restored, continue to perform the replacement controller operation or perform the reboot management module operation.

Failed logo on the DP1B port of Controller 2 to see if it returns to normal; if it still does not return to normal, continue to perform the operation

of replacing the controller, or perform the operation of restarting the management module.

02 Replacing the Controller

After rebooting the management module and the controller, if the DP1B Link loss of the controller is still not eliminated, you need to continue

to replace the controller 2:

1. Unplug and mark the chain line connected to Controller 2;

2、Trigger the physical latch on Controller 2 and pull out the controller outward (the cache module needs to be unplugged);

3, insert a new controller (install the cache module on the bad controller);

4, insert the chain line on the I/OB IN/OUT port of the DE2 disk enclosure to which controller 2 is connected back to its original position;

5, wait for the controller 2 power-up self-test to complete.

6, refresh the EVA controller to see if the controller 2 Failed fault is restored.

Note: Replacement of the controller, you may encounter click on the left side of the navigation menu Launch Command View EVA button

can not be normal access to the situation, at this time, restart the management module can be.

Click the restart button in Figure 7restart the Management Module page, wait about 1 minute, refresh the current login page, and enter

the login interface again.

图7.jpg

Fig.7

Replacing a hard disk

1. Check the location of the physically failed hard disk;

2. Click the disk inside the ungroup disks;

3. Verify that the physical location (slot2) and logical location (slot2) of the failed disk are the same, as in Figure 8;

图8.png

Fig.8

4. Click on the hard disk window in the slot2 position and click on the REMOVE button to remove the hard disk in the slot2 slot;

5, physically unplug the faulty hard disk in DE2 slot2 slot;

6, wait for 15s and then insert the new hard disk and wait for storage recognition;

7, the newly added disk will again appear in the ungroup disks disk group, click on the disk, to re-add it back to the original disk group

(Note: the current storage default is only one disk group,

if the disk array has more than one disk group, you need to add the disk back to the original disk group).

The operation is shown in the following figure:

图9.png

Fig.9

图10.png

Fig.10

图11.png

Fig.11

图12.png

Fig.12

At this point, the hard disk replacement is complete, data Leveling is in progress, and troubleshooting is complete.

Summary of experience

1, this failure is the I / O module B IN port failure caused by the hard disk replacement recognition failure,

the need for overall analysis of the cause of the failure, sorting out the treatment process, should not be

simple mechanical replacement of spare parts.

2, in the storage class of hard disk failure repair, need to combine the fault phenomenon of comprehensive

judgment, qualitative analysis of the fault, should not rush to a single point of failure repair.

For more information, please visit Antute's official website:zcq.amuralha.net

Operation & Maintenance Management

Hardware Maintenance

Software Maintenance

DC Migration

Implementation Service