Ok, so here's an update on the thing (thread incoming):

to recap, I was thinking about a thermal issue, and it seems that I was right about that. https://twitter.com/SamantazFox/status/1250057327711002624
Here is the HP iLO interface for system management. you can see a map (not very accurate, tho) of the temperatures reported by the different sensors across the board.

This could have been really helpful, but well, it wasn't and I'll explain why in a bit.
See that spike on the left?
This is the sensor labelled "Chipset-1". Atm it idles around 60°C.

Before "the fix" its temp was around 75°C, but given the "caution" and "critical" temperature ranges, I wasn't afraid much it, and eventually Ignored it.
I did my original thread, where some nice peeps recommended to check the iLO firmware.

So I did. I went to HPE's site, grabbed the last firmware in RPM format, and extracted the .bin file in order to upgrade from the web UI.

But my mind was telling me "that's not the issue!"...
I tried to figure out if I could Install a SUSE Linux Enterprise Server (SLES) to see if the "Proliant Service Pack" toolset would help.

Maybe the current OS doesn't properly manage the cooling? Maybe the HP tools may help?
It went:

* too complicated (I couldn't use a USB drive key, so wanted to do PXE, but my girlfriend wasn't ok to change the DHCP config)

* quite expensive (I had to buy a new SD card to not destroy my current debian)

at the point where I decided to abandon and look elsewere.
As a reminder, I was loosing IPMI *and* iLO access every time this issue happenned.

So, I did a test run, during the day (when it's hot) and let it warm up.

I opened the case, inspected the board, took a close look to it, until I noticed "the blank" on the PCIe-3/6 bay.
This thing (red/#1) is a 3D printed blank pane for the second, optional, PCIe bay.

It was greatly limiting the air flow in the chassis, at the point where the "Chipset-1" (blue/#2) was super hot to the point where I couldn't put a finger on it.
Even if the iLO reports okay-ish temps, the chip seems to have reached is thermal shutdown limit.

For later on this thread, it has to be noted that my current BIOS settings are set to "minimum cooling": it means that it tries to cool the system with as little noise as possible.
Which lets me bring up the following hypotheses:

1/ The cooling settings and/or system is made to have an empty PCIe2 bay, with an adequate blank (i.e with holes in it)

2/ There is a T° reading issue on this chipset (I don't have a thermal camera to confirm this)
3/ The cooling system doesn't take into account the chipset temperatures with either:

3.1) this is managed by the board embedded firmware (linked to iLO)

3.2) this is managed by the OS (likely the Linux kernel)
You can follow @SamantazFox.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: