Did I just brick my SAS drive?

I was trying to make a pool with the other 5 drives and this one kept giving errors. As a completer beginner I turned to gpt…

What can I do? Is that drive bricked for good?

Don’t clown on me, I understand my mistake in running shell scripts from Ai…

EMPTY DRIVES NO DATA

The initial error was:

Edit: sde and SDA are the same drive, name just changed for some reason And also I know it was 100% my fault and preventable 😞

**Edit: ** from LM22, output of sudo sg_format -vv /dev/sda (broken link)

BIG EDIT:

For people that can help (btw, thx a lot), some more relevant info:

Exact drive model: SEAGATE ST4000NM0023 XMGG

HBA model and firmware: lspci | grep -i raid 00:17.0 RAID bus controller: Intel Corporation SATA Controller [RAID mode] Its an LSI card Bought it here

Kernel version / distro: I was using Truenas when I formatted it. Now trouble shooting on other PC got (6.8.0-38-generic), Linux Mint 22

Whether the controller supports DIF/DIX (T10 PI): output of lspci -vv (broken link)

Whether other identical drives still work in the same slot/cable: yes all the other 5 drives worked when i set up a RAIDZ2 and a couple of them are exact same model of HDD

COMMANDS This is what I got for each command: (broken link)


Solved by y0din! Thank you soo much!


Thanks for all the help 😁

  • y0din@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    2 months ago

    Right now there isn’t enough information to conclude that the drive is “bricked”.

    sg_format on a SAS drive with DIF enabled can absolutely make the disk temporarily unusable to the OS if the format parameters no longer match what the HBA/driver expects, but that is very different from a dead drive.

    To make any determination, more data is required. At minimum (boot with a live Linux USB drive if you are unable to get to this information):

    Please provide verbatim output from:

    • dmesg -T (from boot and when the drive is detected)
    • sblk -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC
    • fdisk -l /dev/sdX
    • sg_inq /dev/sdX
    • sg_readcap -l /dev/sdX
    • sg_modes -a /dev/sdX

    Also specify:

    • Exact drive model
    • HBA model and firmware
    • Kernel version / distro
    • Whether the controller supports DIF/DIX (T10 PI)
    • Whether other identical drives still work in the same slot/cable

    Common possibilities (none can be confirmed without logs):

    • Drive formatted with DIF enabled but HBA/OS not configured for it
    • Logical/physical block size mismatch (e.g. 520/528 vs 512/4096)
    • Format still in progress or left the drive in a non-ready state
    • Mode pages changed that Linux does not like by default

    Things that are usually recoverable on SAS drives:

    • Re-formatting with correct sector size and DIF disabled
    • Clearing protection information
    • Power-cycling the drive after format completion
    • Formatting from a controller that fully supports the drive’s feature set

    Actual permanent bricking from sg_format alone is rare unless firmware flashing or vendor-specific commands were involved.

    Until logs are posted, all anyone can honestly say is:

    The drive is not currently usable, but there is no evidence yet that it is permanently damaged.

    If you can share this information it might be possible to get the drive back online, though I make no promises.

    (edit typos)

    • rook@lemmy.zipOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      2 months ago

      Thank you for helping! Like I said I’m a complete beginner with little knowledge of all this, means a lot 🤗

      just so you know I connected the drive to my dell pc, so its just the one broken drive not all 6.

      Exact drive model: SEAGATE ST4000NM0023 XMGG

      HBA model and firmware: lspci | grep -i raid 00:17.0 RAID bus controller: Intel Corporation SATA Controller [RAID mode] Its an LSI card Bought it here

      Kernel version / distro: I was using Truenas when I formatted it. Now trouble shooting on other PC got (6.8.0-38-generic), Linux Mint 22

      Whether the controller supports DIF/DIX (T10 PI): output of lspci -vv

      Whether other identical drives still work in the same slot/cable: yes all the other 5 drives worked when i set up a RAIDZ2 and a couple of them are exact same model of HDD

      COMMANDS This is what I got for each command: verbatim output from

      Edit: from LM22, output of sudo sg_format -vv /dev/sda

      I really appreciate your knowledge and help 🙂
      Let me know if anything else is needed

      • y0din@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        2 months ago

        Thanks for the additional details, that helps, but there are still some critical gaps that prevent a proper diagnosis.

        Two important points first:

        The dmesg output needs to be complete, from boot until the moment the affected drive is first detected.
        What you posted is cut short and misses the most important part: the SCSI/SAS negotiation, protection information handling, block size reporting, and any sense errors when the kernel first sees the disk.

        Please reboot, then run as root or use sudo:

        dmesg -T > dmesg-full.txt

        1. Do not filter or truncate it. Upload the full file.

        2. All diagnostic commands must be run with sudo/root, otherwise capabilities, mode pages, and protection features may not be visible or may be incomplete.

        Specifically, please re-run and provide full output (verbatim) of the following, all with sudo or as root, on the problem drive and (if possible) on a working identical drive for comparison:

        sudo lspci -nnkvv

        sudo lsblk -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC,ROTA

        sudo fdisk -l /dev/sdX

        sudo sg_inq -vv /dev/sdX

        sudo sg_readcap -ll /dev/sdX

        sudo sg_modes -a /dev/sdX

        sudo sg_vpd -a /dev/sdX

        Replace /dev/sdX with the correct device name as it appears at that moment.

        Why this matters:

        • The Intel SATA controller you listed is not the LSI HBA. We need to see exactly which controller the drive is currently attached to and what features the kernel believes it supports.

        • That Seagate model is a 520/528-capable SAS drive with DIF/T10 PI support. If it was formatted with protection enabled and is now attached to a controller/driver path that does not expect DIF, Linux will report I/O errors even though the drive itself is fine.

        • sg_format -vv output alone does not tell us the current logical block size, protection type, or mode page state.

        Important clarification:

        • Formatting the drive under TrueNAS (with a proper SAS HBA) and then attaching it to a different system/controller is a very common way to trigger exactly this situation.

        • This is still consistent with a recoverable configuration mismatch, not a permanently damaged disk.

        Once we have:

        • Full boot-time dmesg

        • Root-level SCSI inquiry, mode pages, and read capacity

        • Confirmation of which controller is actually in use

        …it becomes possible to say concretely whether the drive needs:

        • Reformatting to 512/4096 with protection disabled

        • A controller that supports DIF

        • Or if there is actual media or firmware failure (less likely)

        At this point, the drive is “unusable”, not proven “bricked”. The missing data is the deciding factor.

        One more important thing to verify, given the change of machines:

        Please confirm whether the controller in the original TrueNAS system is the same type of LSI/Broadcom SAS HBA as the one in the current troubleshooting system.

        This matters because:

        DIF/T10 PI is handled by the HBA and driver, not just the drive.

        A drive formatted with protection information on one controller may appear broken when moved to a different controller that does not support (or is not configured for) DIF.

        Many onboard SATA/RAID controllers and some HBAs will enumerate a DIF-formatted drive but fail all I/O.

        If the original TrueNAS machine used:

        • A proper SAS HBA with DIF support

        then the best recovery path may be to put the drive back into that original system and either:

        • Reformat it there with protection disabled, or

        • Access it normally if the controller and OS were already DIF-aware

        If the original controller was different:

        • Please provide lspci -nnkvv output from that system as well (using sudo or run as root)

        • And confirm the exact HBA model and firmware used in the TrueNAS SAS controller

        At the moment, the controller change introduces an unknown that can fully explain the symptoms by itself. Verifying controller parity between systems is necessary before assuming the drive itself is at fault.

        (edit:)

        One last thing, how long did you let sg_format run for?

        It can take hours to complete one percent if the drive is large, probably a full day or more considering the capacity of your drive.

        I was just wondering if it might have been cut short for some reason and just needs to be restarted on the original hardware to complete the process and bring the drive back online.

        • rook@lemmy.zipOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          Thanks for the continued support! ❤

          I’ve attached an identical Segate SAS drive from the server.

          To confirm, it is the same LSI card that was in the TrueNAS server. I pulled it out of the server and put it into the trouble shooting machine, where I run the commands.

          It is this one: 01:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087] (rev 05)

          I did not see your other reply lol, I will also try this command that you recommended:

          sudo sg_format –format –size=512 –fmtpinfo=0 –pfu=0 /dev/sdb

          Also, the sg_format ran for less than 5 minutes, very quick. However, if I can recall, it did say it was completed.

          **Note: ** “Bricked Drive” turned to sdb

          Identical working drive installed as sda

          Here is the dmesg -T > dmesg-full.txt with the identical drive

          Here is the code from: (with the output for each drive, separately)

          sudo lspci -nnkvv

          sudo lsblk -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC,ROTA

          sudo fdisk -l /dev/sdX

          sudo sg_inq -vv /dev/sdX

          sudo sg_readcap -ll /dev/sdX

          sudo sg_modes -a /dev/sdX

          sudo sg_vpd -a /dev/sdX

          Thanks again for all the help, I await your reply. :)

          I will let you know the results of (sudo sg_format –format –size=512 –fmtpinfo=0 –pfu=0 /dev/sdb), as soon as it’s done.

          • y0din@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            2 months ago

            Thanks for the update, that’s helpful.

            Confirming that the controller is a Broadcom / LSI SAS2308 and that it’s the same HBA that was used in the original TrueNAS system removes one major variable. It means the drive is now being tested under the same controller path it was previously attached to.

            The device mapping you described is clear:

            sda = known-good identical drive

            sdb = the problematic drive

            Running:

            sudo sg_format --format --size=512 --fmtpinfo=0 --pfu=0 /dev/sdb

            as you did is the correct next step to normalize the drive’s format and protection settings.

            A few general notes while this is in progress:

            • Some drives report completion before all internal states are fully settled, which will cause reduced performance as the operation continues until finished in the background
            • A power cycle after completion is recommended before testing the drive again

            At this point it makes sense to pause any further investigation until the current sg_format has fully completed and the system has been power-cycled.

            Once that’s done, the next step will be a direct comparison between sdb and the known-good sda using:

            sudo sg_readcap -lla

            • Reported logical and physical sector sizes

            • Protection / PI status

            As a general note going forward: on Linux / FreeBSD it’s safer to reference disks by persistent identifiers (e.g. /dev/disk/by-id/ or UUID (this is safer but not so direct human readable) on Linux or glabel on FreeBSD) rather than /dev/sdX, as device names can change across boots or hardware reordering as you have had some experience with now.

            Post the results when you’re ready and the sg_format complete and we can continue from there.

            • rook@lemmy.zipOP
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 months ago

              Great News!

              Format completed and now the drive is viewable in “Disks” (however it is still unknown compared to the other one, it might just need a normal format.

              The code for the comparison returns invalid option, I assumed you need just -l comparison:

              sudo sg_readcap -l /dev/sdb and sudo sg_readcap -l /dev/sda

              One question I have is: what do you mean by powercycle? Is that another command to run on the problematic drive? If you mean turn off the pc and turn it back on, I will do that right now, just after the drive has completed formatting.

              After PowerCycle (turned pc off and on)

              sudo sg_readcap -l /dev/sdb and sudo sg_readcap -l /dev/sda

              Would the next step be formatting of some kind?

              • y0din@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                2 months ago

                That’s good news — what you’re seeing now is the expected state.

                A quick clarification first:

                Power cycle means exactly what you did: shut the machine down completely and turn it back on. There is no command involved. You did the right thing.

                Regarding the current status:

                The drive showing up in Disks but marked as unknown is normal

                At this point the disk has:

                • No partition table

                • No filesystem

                “Unknown” here does not indicate a problem, only that nothing has been created on it yet

                About sg_readcap:

                sg_readcap -l is correct

                There is no direct “comparison” mode; running it separately on sda and sdb is exactly what was intended

                The important thing is that both drives now report sane, consistent values (logical block size, capacity, no protection enabled)

                Next steps:

                Yes, the next step is normal disk setup, just like with any new drive:

                1. Create a partition table (GPT is typical)

                2. Create one or more partitions

                3. Create a filesystem (or add it back into ZFS if that’s your goal)

                At this stage the drive has transitioned from “unusable” to functionally recovered. From here on, you’re no longer fixing a problem — you’re just provisioning storage.

                If you plan to put it back into TrueNAS/ZFS, it’s usually best to let TrueNAS handle partitioning and formatting itself rather than doing it manually on Linux.

                Nice work sticking with the process and verifying things step by step.

                • rook@lemmy.zipOP
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  2 months ago

                  Oh my what a ride! I got everything up and running in a RAIDZ2 with the 6 x 4TB drives! (soon i will add another 4 x 1tb in an icy dock as a separate vdev)

                  Everything works now with no errors! 🥳

                  I could not have fixed this without your help. You are a lifesaver and probably saved this drive from the landfill lol. I honestly can’t thank you enough for your continuous support throughout many days!

                  You are the light that shows that there are still good people on the internet that want to help, and not just lurkers that laugh and move on and treat everything as content instead of a person on the other side sharing something that is important to them.

                  In my case I was in need of help, and like one comment put it: Out of the 50 messages of ridicule, one person will actually go out of their way and help.

                  I learned soo much and a good lesson too!

                  Thanks again for your help, and I will remember this interaction for the rest of my self-hosting journey! I’m serious.

                  Keep helping others and sharing your knowledge. I will pay this kind gesture forward in the new year, and help others more with the things that I know. 🫡

                  (Please don’t delete this convo, might help someone in the future)

                  Thanks again and Happy Holidays!

                  I wish you all the best in the New Year! 🤗 🎉

  • 6nk06@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    As a completer beginner I turned to gpt

    I tell people not to do that all the time. They’d rather listen to the statistical vomit machine.

    • squaresinger@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      Can you blame them?

      The manuals are written by experts for experts and in most cases entirely useless for complete beginners who likely won’t be able to even find the right manual page (or even the right manual to begin with).

      Tutoral pages are overwhelmingly AI vomit too, but AI vomit from last year’s AI, so even worse than asking AI right now.

      Asking for help online just gets you a “lol, RTFM, noob!”

      Look at this thread right now and count how many snarky bullshit answers are there that don’t even try to answer the question, how many answers like “I got no idea” are there and then how many actually helpful answers are here.

      Can you really blame anyone who turns to AI, because that garbage at least sounds like it tries to help you?

      • Brewchin@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        Can you really blame anyone who turns to AI, because that garbage at least sounds like it tries to help you?

        A comfortable lie is still a lie. Everything that comes out of an LLM is a lie until proven otherwise. (“Lie” is a bit misleading, though, as they don’t have agency or intent: they’re a variation of your phone keyboard’s next-word text prediction algorithm. With added flattery and confidence.)

        There’s a reason experienced people stress hard to others about not using them as shortcuts to your own knowledge. This is the outcome.

        Another way to look at it is “trust, but verify”. If you’re intent on relying on probabilistic text as an answer, instead of bothering to learn, then take what it’s given you and verify what that does before doing it. You could learn to be an effective sloperator with just that common sense.

        But if you’re going to give an LLM root/admin access to a production environment, then expect to be laughed at, because you had plenty of opportunities to not destroy something and actively chose not to use them.

        • squaresinger@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          I had a problem with Fedora 42, where the performance of my games would be fine one day and abysmal another day. Couldn’t find a pattern. I googled a ton, tried to debug myself, asked on reddit, stackexchange, the fedora forum and lemmy. I only got answers like “Works fine on my machine, noob” and “I have that problem too”. It only affected games running in proton on heroic, everything else was fine.

          After about a year of on-and-off debugging and asking around, I swallowed my pride and asked ChatGPT.

          First answer from that thing was correct: I had run dnf update without doing a flatpak update right afterwards. Turns out, flatpak has its own copy of Nvidia drivers and if the system driver is updated without the flatpak copy being updated, it falls back to software rendering. So the performance was crap until I did flatpak update the next time, and broke again when I ran dnf update.

          I still haven’t found that in any documentation so far.

          AI is crap more often than not, but it does at least try to help and sometimes it actually does.

          Look in this thread here. Is there even a single answer that tries to help OP, or is every single answer here just dumb snark?

        • irmadlad@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          arrow-down
          1
          ·
          2 months ago

          Everything that comes out of an LLM is a lie until proven otherwise.

          Everything that comes off of a tutorial, or web page is paddling the same boat, without exception.

          • lambalicious@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 months ago

            Are you really comparing LLM output to be on the same level of… hallucination-ness, than a Gamefaqs tutorial for a SNES game from the late 90s?

            I know tiktok has deep-fried and rotten the brains of entire generations but this is just ridiculous.

      • non_burglar@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        Can you blame them?

        Yes. LLMs don’t make anyone not responsible for their output.

        If your dumb friend gave you bad advice and you followed it, you are ultimately still responsible for your decisions.

        • squaresinger@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          2 months ago

          What’s your point? “Don’t use Linux unless you are a professional user”?

          Beginners have to begin somewhere and they need to get info from somewhere.

          A lot of Linux UX is still at the level where it doesn’t give enough relevant information to a non-technical user in a way that the user can actually make an informed decision. That is the core problem.

          Whether users get their wrong information from AI, Stackoverflow, random tutorials, Google, a friend or somewhere else hardly matters.

          Take for example a look at the setup process of a Synology NAS. A 10yo can successfully navigate that process, because it’s so well done. We need more of that, especially for FOSS stuff.

          Too much of Linux is built by engineers for engineers.