One of our customers was often getting requests from operators to do a Nearline (network storage with media files) re-scan for them. So I went to investigate the problem!
It appeared that contents of the directory shown by IPDirector were obsolete – newer files were indeed missing from the Database Explorer view. But when the appropriate network share was opened via Windows explorer it contained newer files that were not present in IPD.
In IPDirector SyncroDB service logs we could see that the application has decided to stop managing the nearline storage:
Error\DirectoryEngine.log 3775 2016-01-31 15:57:51,2033599 – The directory \\x\xxxxx\xxxxxx\Media\Processed_Clips\ doesn’t exists or is not available.
Error\DirectoryEngine.log 3776 2016-01-31 15:57:51,2033599 –
Nearline mangement for \\x\xxxxxx\xxxxxx\Media\Processed_Clips\ must be stopped
shouldBeManaged:Yes (Accessible True MaxNLReached False) ).
Why does the Directory Engine think that the nearline is not available?
Error\Ping.log 1446 2016-01-31 15:57:51,2033599 – Ping failed after 3 tries: TimeOut
Then IPD stops tracking changes in the network folder as it was inaccessible: the FileSystemWatcher is removed, which means no new notifications are going to tell SyncroDB about the changes in the directory. Consequently – no new clips will appear in IPD.
Log\DirectoryManager_11266.log 31046 2016-01-31 15:57:51,2033599 – Stopping management of directory : \\x\xxxx\xxxx\Media\Processed_Clips\
Log\DirectoryManager_11266.log 31047 2016-01-31 15:57:51,2033599 – Removing FileSystemWatcher for \\x\xxxx\xxxxx\Media\Processed_Clips\
But do the pings actually fail/timeout or is this some application-specific glitch? The way to prove it is to use a third-party software. I’ve started pinging hostnames with a utility PingInfoView:
As you can see, we get around 5% of Ping time-outs (note, that it’s a timeout, not any other failure).
One of my colleagues has suggested an excellent way to test this behaviour further: ping all the actual Media Server (also known as ‘nodes’) IP addresses and see which pings fail. I was kindly provided with this list of addresses from Isilon OneFS management page and have started to ping them all. An interesting pattern has emerged:
Only a few of the MSs have this issue – the majority respond to ping in 100% of cases!
Then, I have highlighted the problematic addresses in the list of network interfaces below:
Looks like two of the nodes are having a particularly bad time responding to pings!
I have consulted my friends who work with *nix systems all the time and they have suggested that we could be experiencing (deep breath!) kernel buffers overloads within FreeBSD that runs on Isilon media servers. They suggested to check:
To see what may be causing system hangs. As we will see below, this hypothesis about the system overload was correct.
Luckily, the solution to the problem appeared to be quite simple. After evaluating my findings, the Broadcast IT department have implemented two changes on Isilon and the network:
Load balancing on Isilon storage has been changed from ‘Connection Count’ to ‘CPU Usage’. This means the DNS server will always respond to DNS query by providing an Isilon IP that has less CPU load on it.
It turned out that LACP has been disabled on some nodes. So it has been enabled for Node 2 & 11 (.129, .128, .132 .131)
The new configuration has been successfully tested – no more errors in logs or missing clips, and the operators are happy!