Investigating network storage disconnections

One of our customers was often getting requests from operators to do a Nearline (network storage with media files) re-scan for them. So I went to investigate the problem!

 

Symptoms

It appeared that contents of the directory shown by IPDirector were obsolete – newer files were indeed missing from the Database Explorer view. But when the appropriate network share was opened via Windows explorer it contained newer files that were not present in IPD.

 

Investigation

In IPDirector SyncroDB service logs we could see that the application has decided to stop managing the nearline storage:

Error\DirectoryEngine.log 3775 2016-01-31 15:57:51,2033599 – The directory \\x\xxxxx\xxxxxx\Media\Processed_Clips\ doesn’t exists or is not available.
Error\DirectoryEngine.log 3776 2016-01-31 15:57:51,2033599 –
Nearline mangement for \\x\xxxxxx\xxxxxx\Media\Processed_Clips\ must be stopped
error:False
NotAccessible:False
NotAvailable:True
NotReadable:False
shouldBeManaged:Yes (Accessible True MaxNLReached False) ).
Stopping management.

Why does the Directory Engine think that the nearline is not available?

Error\Ping.log        1446        2016-01-31 15:57:51,2033599 – Ping failed after 3 tries: TimeOut

Then IPD stops tracking changes in the network folder as it was inaccessible: the FileSystemWatcher is removed, which means no new notifications are going to tell SyncroDB about the changes in the directory. Consequently – no new clips will appear in IPD.

Log\DirectoryManager_11266.log        31046        2016-01-31 15:57:51,2033599 – Stopping management of directory : \\x\xxxx\xxxx\Media\Processed_Clips\

Log\DirectoryManager_11266.log        31047        2016-01-31 15:57:51,2033599 – Removing FileSystemWatcher for \\x\xxxx\xxxxx\Media\Processed_Clips\

 

But do the pings actually fail/timeout or is this some application-specific glitch? The way to prove it is to use a third-party software. I’ve started pinging hostnames with a utility PingInfoView:

PingInfoView

As you can see, we get around 5% of Ping time-outs (note, that it’s a timeout, not any other failure).

 

One of my colleagues has suggested an excellent way to test this behaviour further: ping all the actual Media Server (also known as ‘nodes’) IP addresses and see which pings fail. I was kindly provided with this list of addresses from Isilon OneFS management page and have started to ping them all. An interesting pattern has emerged:

PingInfoView_failures

Only a few of the MSs have this issue – the majority respond to ping in 100% of cases!

Then, I have highlighted the problematic addresses in the list of network interfaces below:

Interfaces_timeouts

Looks like two of the nodes are having a particularly bad time responding to pings!

I have consulted my friends who work with *nix systems all the time and they have suggested that we could be experiencing (deep breath!) kernel buffers overloads within FreeBSD that runs on Isilon media servers. They suggested to check:

cat /proc/interrupts

iostat

To see what may be causing system hangs. As we will see below, this hypothesis about the system overload was correct.

 

Solution

 

Luckily, the solution to the problem appeared to be quite simple. After evaluating my findings, the Broadcast IT department have implemented two changes on Isilon and the network:

Load balancing on Isilon storage has been changed from ‘Connection Count’ to ‘CPU Usage’. This means the DNS server will always respond to DNS query by providing an Isilon IP that has less CPU load on it.

It turned out that LACP has been disabled on some nodes. So it has been enabled for Node 2 & 11 (.129, .128, .132 .131)

The new configuration has been successfully tested – no more errors in logs or missing clips, and the operators are happy!