The purpose of a home lab is to learn. For me, it’s in many ways a re-visit of the work with computers I started when I was in my teenaged years. Although the systems I’ve worked on have evolved as has my knowledge, some things remain the same. First and foremost that you have to put the investment in to get the return.

I am working through the initial iteration of this home lab. What that means is I have TrueNAS set up serving iSCSI, and I have three nested Proxmox VMs configured. Disk Pass through was configured to the TrueNAS guest and I put four 10K 1.2TB disks in to it to serve as the ZFS Pool. The high level architecture diagram is as in the diagram at the top of this post.

The basic Proxmox VE Image was a template, and the others were cloned off of that template post install. I created multipathing on iSCSI and set up shared LVM as per the guidelines Proxmox laid out.

I then used this logical volume to provision storage to a VM within the virtual Proxmox Instance. Because the storage was shared LVM, I can migrate as I please between them.

This is where things fell apart. The machine was horribly slow to boot up and even mundane things like logging in were slow. I started using dd to do disk testing, and found out that I was getting writes of about 35Mb/s and 16MB/S reads. I was also getting horrible IOwait doing even basic actions.

At this point, I changed as many variables as I could think of. I bumped the MTU to 9000, started looking through caching, altered NUMA settings, altered core/socket lay outs, altered CPU architecture, basically anything I could think of. I asked other people, and didn’t get too far. I knew that the “SAN” was capable of much more performance, because when I used the host node or an external workstation to connect to iSCSI, performance was about 200MB/s read and 100MB/s write.

I’d been seeing massive amounts of repetitive error messages in the console of the TrueNAS VM. I couldn’t figure out what these meant, so I left them be. This was likely a mistake, as we’ll find out momentarily.

Iostat was showing that the test disk I wasn’t using was seeing significant IO traffic. That seemed really odd. After looking through running processes in top, I found out udev was constantly hitting the disk. As it turns out, it was probing. I found that out via:

udevadm monitor –udev

After a while and some searching, that cascaded into:

iscsiadm -m session -P 3 | grep sd

This lead me to finding out my disks were alternating between a blocked and not blocked state.

I then forced the logout of iSCSI and a restart of the iSCSI daemon.

iscsiadm -m session -u

iscsiadm -m session

systemctl restart iscsid

For some reason performance started working, then degraded again. After that, I knew that it wasn’t an inherent architectural issue, so what was the issue? A simple visit to the journal would have let me know what was happening. These errors were scrolling by as fast as they could.

Feb 04 20:47:56 pveV1 kernel: sd 4:0:0:1: Power-on or device reset occurred

Feb 04 20:47:56 pveV1 iscsid[2349493]: Kernel reported iSCSI connection 3:0 error (1020 – ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

Feb 04 20:47:56 pveV1 iscsid[2349493]: connection3:0 is operational after recovery (1 attempts)

Feb 04 20:47:56 pveV1 iscsid[2349493]: Kernel reported iSCSI connection 3:0 error (1020 – ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

What did it mean? Obviously that resets were happening, but what was the root cause?

As it turns out, the iSCSI initiator name was the same for all three VMs. I changed two hosts, did some iSCSI logouts, daemon resets for iscsid, and resorted to rebooting one of the Proxmox VMs. I’m getting similar to the same performance I was seeing at the host. A nice boost, and my test VM that’s in the nested Proxmox system is actually performant.

A follow up I need to work on is to determine if the ID was generated on the VMs through derivation of some other ID that’s duplicated, or was the file already in the template and therefore just copied over. Looking at it I believe that I should be able to delete the initiator.iscsi file from the template, and it will generate a new one on boot up.

What lessons are there from this misadventure?

  • Don’t ignore repetitive log entries, even if you don’t know what they mean.
  • If you don’t know what log entries mean, look at the other side if you can. Maybe those errors are easier to interpret.
  • Configuration management is a great asset when working with systems like this as is a clearly defined understanding of “identifiers” in a system.
  • Nothing interesting happens when things work is a true concept.
  • Getting past a bump in the road can take time, patience and effort. The knowledge is the reward, and is difficult to replicate short of doing.
  • This would have probably never happened if I had three physical machines I installed. It’s important to understand compromises that are made in a home lab can be strange to troubleshoot.

Up next I’ll try to put some of my opinions of this configuration to paper. I have some more testing I want to do, and I have some more

By utadmin

Leave a Reply

Your email address will not be published. Required fields are marked *