結論: 上天保佑

這個發生在有 RAID的情況下 (目前為止我看到的)

從以下的文章我整理出

可以用 fsck /dev/sdbX 來檢查硬碟磁區是否有壞軌 (損壞的檔案)

(以下未親自測試過)

如果未顯示任何壞軌,就試 fsck -b XXX /dev/sdbX

而 XXX 就是從 dumpe2fs -h /dev/sdbX 的 Blocks per group 得到   32768 這個數值

且作者說明如果硬碟不能掛載、讀取的話,可以先把journal 刪除 (我也不知道為何要刪除,刪除的話不就無法恢復資料  ??)

查看硬碟有沒有has_journal ,一樣用 dumpe2fs ,如果有的話在 Filesystem features 可以看到 has_journal

刪除journal

tune2fs -f -O ^has_journal /dev/sdbX

以下原文引用

Quote: http://chowroc.blogspot.com/2007/09/trouble-shooting-for-ext3fs-with-disk.html

今天放数据库的主机不能正常运行了,因为以前出过文件系统错误,当时我把数据磁盘 umount,再做 fsck.ext3 /dev/sdb1 后就直接从 journal 恢复了。但今天却不行,当运行 fsck.ext3 /dev/sdb1 的时候,出现
(Attempt to read block from filesystem resulted in short read).  Ignore error?这样的错误。
于是检查一下系统日志 /var/log/message
Sep 23 04:07:38 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 04:09:31 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 04:28:51 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 04:31:38 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 04:34:01 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 04:37:40 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has aborted……Sep 23 12:07:13 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 12:07:36 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 23 12:40:05 db kernel: __journal_remove_journal_head: freeing b_frozen_dataSep 23 12:40:05 db kernel: __journal_remove_journal_head: freeing b_committed_dataSep 23 12:40:05 db kernel: __journal_remove_journal_head: freeing b_frozen_dataSep 23 12:40:59 db kernel: kjournald starting.  Commit interval 5 secondsSep 23 12:40:59 db kernel: EXT3 FS on sdb1, internal journalSep 23 12:40:59 db kernel: EXT3-fs: mounted filesystem with ordered data mode.Sep 27 02:34:00 db kernel: (scsi1:A:1:0): Unexpected busfree in Data-in phaseSep 27 02:34:00 db kernel: SEQADDR == 0x82Sep 27 02:34:00 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:00 db kernel: end_request: I/O error, dev sdb, sector 272737767Sep 27 02:34:01 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:01 db kernel: end_request: I/O error, dev sdb, sector 272737775Sep 27 02:34:01 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:01 db kernel: end_request: I/O error, dev sdb, sector 272737783Sep 27 02:34:01 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:01 db kernel: end_request: I/O error, dev sdb, sector 272737791Sep 27 02:34:01 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:01 db kernel: end_request: I/O error, dev sdb, sector 272737799Sep 27 02:34:02 db kernel: SCSI error : <1 0 1 0> return code = 0x10000……Sep 27 02:34:14 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 27 02:34:14 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 27 02:34:14 db kernel: end_request: I/O error, dev sdb, sector 272737767Sep 27 02:34:14 db kernel: (scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)Sep 27 02:34:15 db kernel: (scsi1:A:1:0): Unexpected busfree in Data-in phaseSep 27 02:34:15 db kernel: SEQADDR == 0x56Sep 27 02:34:15 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:15 db kernel: end_request: I/O error, dev sdb, sector 272872423Sep 27 02:34:16 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 27 02:34:16 db kernel: end_request: I/O error, dev sdb, sector 272872431……Sep 27 02:34:30 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 27 02:34:30 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 27 02:34:30 db kernel: end_request: I/O error, dev sdb, sector 272872639Sep 27 02:34:30 db kernel: (scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)Sep 28 02:35:10 db kernel: (scsi1:A:1:0): Unexpected busfree in Data-in phaseSep 28 02:35:10 db kernel: SEQADDR == 0x53Sep 28 02:35:10 db kernel: scsi1:A:15: parity error detected while idle. SEQADDR(0x1) SCSIRATE(0xc2)Sep 28 02:35:10 db kernel:      No terminal CRC packet receviedSep 28 02:35:10 db kernel: SCSI error : <1 0 1 0> return code = 0x10000Sep 28 02:35:10 db kernel: end_request: I/O error, dev sdb, sector 270637663……Sep 28 02:35:27 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 28 02:35:27 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 28 02:35:27 db kernel: end_request: I/O error, dev sdb, sector 120062031Sep 28 02:35:27 db kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block – inode=7503878, block=15007746Sep 28 02:35:27 db kernel: Aborting journal on device sdb1.Sep 28 02:35:27 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 28 02:35:27 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 28 02:35:27 db kernel: end_request: I/O error, dev sdb, sector 273146143……Sep 28 02:35:29 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 28 02:35:29 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 28 02:35:29 db kernel: end_request: I/O error, dev sdb, sector 63Sep 28 02:35:29 db kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: IO failureSep 28 02:35:29 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 28 02:35:29 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 28 02:35:29 db kernel: end_request: I/O error, dev sdb, sector 272573927Sep 28 02:35:29 db kernel: SCSI error : <1 0 1 0> return code = 0x8000002Sep 28 02:35:29 db kernel: Info fld=0x0, Current sdb: sense key Aborted CommandSep 28 02:35:29 db kernel: end_request: I/O error, dev sdb, sector 272574783Sep 28 02:35:29 db kernel: ext3_abort called.Sep 28 02:35:29 db kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journalSep 28 02:35:29 db kernel: Remounting filesystem read-onlySep 28 02:35:29 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 02:35:29 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 02:35:29 db kernel: (scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)Sep 28 02:35:29 db kernel: EXT3-fs error (device sdb1) in ext3_dirty_inode: IO failureSep 28 02:35:29 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 02:35:29 db kernel: __journal_remove_journal_head: freeing b_frozen_dataSep 28 02:35:29 db kernel: __journal_remove_journal_head: freeing b_frozen_dataSep 28 02:35:55 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 02:41:25 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 03:10:40 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 03:42:08 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 03:51:37 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has abortedSep 28 03:54:38 db kernel: EXT3-fs error (device sdb1) in start_transaction: Journal has aborted……基本上,可以肯定是磁盘损坏了,出现了坏道!
因为这里是没有使用 RAID 设施的,所以只能使用 rescue 主机替代,这属于高可用(HA: High Availability)的问题了,这里不讨论。回过头来,需要检查和确认这块磁盘的问题。
首先使用另外一个 IP 启动这台主机(因为 HA 已经使用了原来的 IP 地址)。在这里我遇到的情况是在重启时第二块磁盘(/dev/sdb)还是被“正常”挂载了(/dev/sdb1),这意味着你仍然可以从里面把数据拷贝出来,当然最好是只读使用!——运气不错,不是吗?但这并不意味着这块磁盘就没有问题了,需要先进行检查:
sh# fsck.ext3 /dev/sdb1e2fsck 1.35 (28-Feb-2004)/dev/sdb1: clean, 39195/17924096 files, 22706784/35843015 blocks好像没有问题?但真的是这样吗?
应该使用其他的超级块检查一下,因为很可能这个超级块本身就有问题。从 Ext3 文件系统使用磁盘的基本原理上看,一个磁盘被分成若干 groups,每个 groups 包含很多物理 blocks(通常为 4KB,和内存的一个 page 的大小一致。注意,这里和使用 fdisk 是显示的 Blocks 是不一样的,后者是以 1KB 为单元的),通常一个 group 可能包含 8192/16384/32768 个 blocks,取决于磁盘的大小或创建分区文件系统时的设置,而每一个 group 上面都有一个超级块(super block),只有第一个 group 的超级块被实际使用,其他 group 的超级块都作为备用。因此,尝试使用另外一个超级块:
sh# fsck.ext3 -b 8192 /dev/sdb1e2fsck 1.35 (28-Feb-2004)fsck.ext3: Bad magic number in super-block while trying to open /dev/sdb1The superblock could not be read or does not describe a correct ext2filesystem.  If the device is valid and it really contains an ext2filesystem (and not swap or ufs or something else), then the superblockis corrupt, and you might try running e2fsck with an alternate superblock:    e2fsck -b 8193 sh# fsck.ext3 -b 8193 /dev/sdb1e2fsck 1.35 (28-Feb-2004)fsck.ext3: Bad magic number in super-block while trying to open /dev/sdb1The superblock could not be read or does not describe a correct ext2filesystem.  If the device is valid and it really contains an ext2filesystem (and not swap or ufs or something else), then the superblockis corrupt, and you might try running e2fsck with an alternate superblock:    e2fsck -b 8193 两者都不行!这说明一个 group 的大小不是 8192,第二个 super block 不在第 8193 个 block 上,那么可以通过 tune2fs 来获得文件系统的信息,进而找到下一个超级块。
sh# tune2fs -l /dev/sdb1tune2fs 1.35 (28-Feb-2004)Filesystem volume name:   Last mounted on:          Filesystem UUID:          204a2d09-12d5-4623-bb0d-75b2e340401dFilesystem magic number:  0xEF53Filesystem revision #:    1 (dynamic)Filesystem features:      has_journal resize_inode filetype sparse_super large_fileDefault mount options:    (none)Filesystem state:         cleanErrors behavior:          ContinueFilesystem OS type:       LinuxInode count:              17924096Block count:              35843015Reserved block count:     1792150Free blocks:              13136231Free inodes:              17884901First block:              0Block size:               4096Fragment size:            4096Reserved GDT blocks:      1024Blocks per group:         32768Fragments per group:      32768Inodes per group:         16384Inode blocks per group:   512Filesystem created:       Sun Jan  7 15:50:06 2007Last mount time:          Fri Sep 28 10:05:06 2007Last write time:          Fri Sep 28 11:01:56 2007Mount count:              1Maximum mount count:      35Last checked:             Fri Sep 28 10:05:05 2007Check interval:           15552000 (6 months)Next check after:         Wed Mar 26 10:05:05 2008Reserved blocks uid:      0 (user root)Reserved blocks gid:      0 (group root)First inode:              11Inode size:               128Journal inode:            8Default directory hash:   teaDirectory Hash Seed:      dd1084ed-a28f-489e-8e47-c4a3a49f5e6dJournal backup:           inode blocks可以看出每个块组拥有 32768 个 blocks,所以应该指定 -b 32768 来进行检查:
sh# fsck.ext3 -b 32768 /dev/sdb1e2fsck 1.35 (28-Feb-2004)/dev/sdb1 contains a file system with errors, check forced.Pass 1: Checking inodes, blocks, and sizesDuplicate blocks found… invoking duplicate block passes.Pass 1B: Rescan for duplicate/bad blocksDuplicate/bad block(s) in inode 311389: 712136Duplicate/bad block(s) in inode 311425: 680600 680633 680634 680635 680636 680637 680638 680639 680641 680642 680643 680644 680645 680646 680647 680665 680666 680667 680668 680669 680670 680671 680673 681752 682704 684104 684352 685688 686064Duplicate/bad block(s) in inode 311560: 657856 658872 662192 665192 669600 686504 689824 694104 696144 696904 698816 703032Duplicate/bad block(s) in inode 344144: 705776 706088 706784Duplicate/bad block(s) in inode 344151: 708952Duplicate/bad block(s) in inode 344170: 709200Duplicate/bad block(s) in inode 377354: 810496Error reading block 22872266 (Attempt to read block from filesystem resulted in short read).  Ignore error?已经报告出现了 bad blocks。
short read 通常意味着文件系统中的一个 i 节点指向一个不能再读取的块,或者关于文件系统的某个元数据位于一个或者几个不能读取的块。对于日志文件系统,如果文件系统的日志的任何一部分存储在一个坏块上,也会出现这个错误,因为不能读取事务。
如果运气不够好,磁盘分区完全不能挂载和读取,可以先尝试把日志删除,再尝试挂接和访问数据。先用 debugfs 来看看文件系统的特性:
sh# debugfs /dev/sdb1debugfs 1.35 (28-Feb-2004)debugfs:  featuresFilesystem features: has_journal resize_inode filetype sparse_super large_filedebugfs:  open /dev/sdb1open: Filesystem /dev/sdb1 is still open.  Close it first.debugfs:  open /dev/sdb1open: Filesystem /dev/sdb1 is still open.  Close it first.debugfs:  close /dev/sdb1close: Usage: close_filesysdebugfs:debugfs:  open /dev/sdb1open: Filesystem /dev/sdb1 is still open.  Close it first.debugfs:  quit可以看到 has_journal 特性。使用 tune2fs 可以设置和清除文件系统特性。下面的命令删除日志:
sh# tune2fs -f -O ^has_journal /dev/sdb1tune2fs 1.35 (28-Feb-2004)sh# debugfs /dev/sdb1debugfs 1.35 (28-Feb-2004)debugfs:  featuresFilesystem features: resize_inode filetype sparse_super large_filedebugfs:  open /dev/sdb1open: Filesystem /dev/sdb1 is still open.  Close it first.debugfs:  quithas_journal 特性已经被清除。这时候,再次运行 fsck.ext3,不使用 -b 指定 super block,也可以看到报错信息:
e2fsck 1.35 (28-Feb-2004)/dev/sdb1 contains a file system with errors, check forced.Pass 1: Checking inodes, blocks, and sizesError reading block 35750122 (Attempt to read block from filesystem resulted in short read) while doing inode scan.  Ignore error?总之,要从一个损坏的磁盘上恢复数据总是没有保证的,所以备份至关重要——”热镜像 + 轮转拷贝”才是根本之道。
参考:《LINUX SERVER HACKS 卷二》, Hacks #94 “从崩溃的磁盘恢复数据”。
另外,我的项目 cutils包含了 mirrord/fs_mirror 镜像和轮转工具。

 

Related posts 相關文章
WD 新策略 ? 將在 3 年後提示你的硬碟需要更換了
More...
救援一顆變成 RAW 格式的硬碟
More...
硬碟分割方式 2016
More...
Linux 檢查硬碟壞軌的工具/方式
More...

作者

留言

撰寫回覆或留言

發佈留言必須填寫的電子郵件地址不會公開。