Micrologix Driver Losing Connection

jgreenewv · April 15, 2010, 8:10pm

I have a system I am field testing now. I have six remote MicroLogix1100 PLCs connected over VPNs to the server. I am running Ignition 7.0.8 (b4866) with MicroLogix driver 1.1.1 (b1). I am having intermittent dropouts from my PLCs, and the console shows “[plcname] Connection lost due to IOException.” I am also getting an occasional message of “[plcname] Initialization request failed due to INVALID_DRIVER_STATE” These dropouts are occurring approximately every 5 seconds. Right now, I have my Minimum Sampling Interval set to 20000mS, and my Stale Threshold set to 6. Any ideas on what I should check? Thanks.

thechtman · April 15, 2010, 9:00pm

Try increasing the Communication Timeout in the MicroLogix settings from the default of 2000 mSec. I’m thinking that it may be too short when communicating through a VPN.

jgreenewv · April 16, 2010, 12:18pm

I just checked, and the following are my Micrologix settings:

Browse Timeout: 30000
Read Timeout: 30000
Write Timeout: 30000
Communications Timeout: 20000
Browse Cache Timeout: 120000

Thanks.

jgreenewv · April 17, 2010, 12:56am

OK, I restarted the context and watched for a while. I’m also getting the following errors:

[plcname] Connection lost due to IOException.
com.inductiveautomation.xopc.drivers.allenbradley.MicroLogixDriver
Level ERROR Thread http-8088-3

com.inductiveautomation.ignition.gateway.web.pages.config.systemconsole.LogViewer$SerializableLoggingEvent$ClonedThrowable: An established connection was aborted by the software in your host machine
sun.nio.ch.SocketDispatcher.read0(Native Method)
sun.nio.ch.SocketDispatcher.read(Unknown Source)
sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
sun.nio.ch.IOUtil.read(Unknown Source)
sun.nio.ch.SocketChannelImpl.read(Unknown Source)
com.inductiveautomation.xopc.driver.api.AbstractEthernetDriver$SelectionManager.readFromChannel(AbstractEthernetDriver.java:445)
com.inductiveautomation.xopc.driver.api.AbstractEthernetDriver$SelectionManager.run(AbstractEthernetDriver.java:582)
java.lang.Thread.run(Unknown Source)

jgreenewv · April 20, 2010, 11:54am

Update:

I think I tracked down part of the problem to an issue in the VPN endpoint at the server. It looks like the router is bottlenecking my connection to around 700kbps. This wouldn’t be an issue for data collection, but when you add the client traffic on top of that, it appears to be bogging the network down enough to cause intermittent dropouts. I’m working on a resolution for this issue.

Question:

Would installing Panel Edition on the client PC help with any of the client traffic to the server, or is that essentially fixed? If I need to, I can probably get a second connection at the server, and dedicate one to client traffic and one to PLC traffic, I’m just not sure how long that will take to set up.

Carl.Gould · April 20, 2010, 3:34pm

I’m not sure how to answer that, because I’m fuzzy on your architecture topology. Where is the client pc? What does it communicate with? Is there a database? A diagram might be helpful.

jgreenewv · April 20, 2010, 4:38pm

Yeah, a diagram might help. I wasn’t completely awake yet when I posted this morning.

All of the PLCs, the client, and the server are sitting on a VPN. Stability is pretty good when I initially restart the server, with a site or two dropping out every few minutes, but over time the dropouts become more frequent and longer duration.

I’m spending this afternoon working on the kinks in the VPN endpoint, routers, and cable modems, so if there is anything you can think of for me to check, I’ll have access to the system. Thanks.

Carl.Gould · April 20, 2010, 8:44pm

I think your diagram is missing some things. How does this VPN work? Is it a separate router, software, or built into the cable modem? Is there a cable modem attached to each of the outer nodes?

jgreenewv · April 20, 2010, 9:08pm

Sorry, all I have to work with is MS Paint, and I have no artistic skills, so I was trying to keep it simple. Here is a (hopefully) complete layout.

Carl.Gould · April 20, 2010, 9:18pm

Ok, thats better. So each device is on its own, huh? Everything must go through the VPN to talk to anything else. In that case putting the panel edition on the client won’t do a thing. My instinct tells me that your vpn routers are to blame (I’ve found that the quality of “VPN” implementations from vendor-to-vendor and flavor-to-flavor vary greatly). Unfortunately it is difficult to troubleshoot this sort of thing.

Your best clue is the message “An established connection was aborted by the software in your host machine”. This seems to be caused when the TCP layer (in Windows) doesn’t receive an acknowledgement that data was sent correctly over a connection.

You might try bringing up the system slowly. For example, bring up the server, client, and one PLC. See if it can handle that. Then add a second PLC, etc. You may find that there is a saturation point at which the VPN and/or internet connection can no longer keep up.

This sort of troubleshooting isn’t my strongest point. Perhaps someone with some stronger networking skills can chime in?

jgreenewv · April 20, 2010, 9:34pm

I’d appreciate any recommendations anyone has. The strange (to me) thing is, we started seeing these errors after bringing the first site on line, it has just gotten progressively worse the more sites we bring up. I’m still tweaking settings in the routers to try and stabilize it, but according to the routers, the VPN tunnels are never dropping out. This is making me think that the transmissions are being garbled or something similar. I’m going to try and adjust the MTU setting tonight and see if that improves anything.

thechtman · April 21, 2010, 1:58am

With Wireshark you will be able to see the packets being transmitted and received. It will show the entire protocol stack including TCP connections and will detail any timeouts, error, etc.

It is free and can be downloaded from www.wireshark.org

nathan · April 21, 2010, 7:06am

+1, wireshark will be your friend in this case.

Could you describe your equipment and VPN setup/settings in more detail? We’re going to have to dig up more troubleshooting details to get to the root of the problem.

jgreenewv · April 21, 2010, 12:52pm

Yeah, I should have thought of Wireshark sooner. Here is the setup:

The Client PC and PLCs are sitting behind Cisco RVS4000 VPN Routers.
The server sits behind a Cisco RV082 VPN Router.
All cable modems are Scientific Atlantic model 2100.

VPN setup is as follows:
IPSec IKE with Preshared key
Subnet Security Group
768 bit
3DES encryption
MD5 Phase 1 Auth
SHA1 Phase 2 Auth
3600 sec key lifetime

I’m beginning to believe the problem is in the RV082. I have a separate VPN running between four of the PLCs for critical messaging, and it is running without a hitch on just the RVS4000 routers. I don’t have anything jumping out at me on Wireshark as a glaring error. I have a spare RVS4000 I am going to set up and see if that evens things out, and will report back whatever I find. Any ideas anyone comes up with before then are appreciated. Thanks.

jgreenewv · April 21, 2010, 1:47pm

Update:

Temporarily switch the RV082 over to an RVS4000, and I am getting the same results.

I am able to get better information from the router log on the RVS4000. Right now, I am showing now VPN errors on the routers. Just for giggles, I also upped the key lifetimes to 86400 seconds, to see if that did anything, but there was no change.

It does appear, however, that I am having fewer dropouts using the RSV4000 than I was the RV082. The problem lies in that the RVS4000 only supports 5 tunnels, and I need at least seven right now, which is why we used the RV082 at the server to begin with.

nathan · April 22, 2010, 12:42am

A. Please post the VPN error logs that you’re getting from the 4000.

B. Also, could you diagram/describe the VPN connections:

What’s logically connected to what?
Are all tunnels terminated to where the PC is, or do you have a mesh of devices connected to each other.
What does the IP scheme/routing look like at each node

C. Could you save and post a wireshark capture on a PLC side and on the PC side. One thing we’re looking for is dropped packets (ie packets out of sequence).

D. What is the MTU size set to on the cable modems? Too small could lead to packet fragmentation.

I suspect the following: routing loops or packet fragmentation.

jgreenewv · April 22, 2010, 3:04am

Nathan,

Attached is an excerpt from the RVS4000 error log and a Wireshark capture file from the server. I’ll have to grab a capture from one of the PLC sites tomorrow. I don’t know if these will show much, as the system is fairly stable at the moment.

Right now, everything except the Client PC is on a tunnel to the Server location. Client is simply connecting through port 8088 on the router to the server, not through the VPN. Additionally, three of the PLCs have tunnels to a fourth PLC for backup comms. Everything is on a 10.1.x.x address, with a 255.255.255.0 subnet.

MTU should be 1500 on all sites right now. I’d have to verify this.

I’ve got a meeting with the cable company tomorrow, and they are going to run some diagnostics on their end to make sure they don’t have something causing problems. It could be a complete coincidence, but it makes me somewhat suspicious when the system becomes fairly stable after 10PM and starts dropping out badly again around 8-9AM.
ServerCap.zip (767 KB)

jgreenewv · April 22, 2010, 1:23pm

Another Update:

Things are still somewhat more stable today than yesterday (dropouts approximately every 5 minutes, rather than every 30 seconds or so.

I was looking at the console this morning, and caught a new event. Before a site drops out, I see the following:

Message waitForPendingToEmpty() took longer than 20 seconds. Failing remaining requests.
Logger com.inductiveautomation.xopc.drivers.allenbradley.MicroLogixDriver
Time 4/22/10 9:15:38 AM Level WARN Thread http-8088-22
Exception
Error is empty.

This seems to be pretty consistant. I don’t know if this is a symptom of the dropouts, or something else entirely.

Kevin.Herron · April 22, 2010, 3:57pm

[quote=“jgreenewv”]Another Update:

Things are still somewhat more stable today than yesterday (dropouts approximately every 5 minutes, rather than every 30 seconds or so.

I was looking at the console this morning, and caught a new event. Before a site drops out, I see the following:

Message waitForPendingToEmpty() took longer than 20 seconds. Failing remaining requests.
Logger com.inductiveautomation.xopc.drivers.allenbradley.MicroLogixDriver
Time 4/22/10 9:15:38 AM Level WARN Thread http-8088-22
Exception
Error is empty.

This seems to be pretty consistant. I don’t know if this is a symptom of the dropouts, or something else entirely.[/quote]

This is probably caused by the dropouts. The message is basically saying that it couldn’t process all of the requests in the pending queue within 20 seconds. If the connection to the PLC went down this would absolutely be the case.

thechtman · April 22, 2010, 4:01pm

The waitForPendingToEmpty() took longer than 20 seconds message is normal if the connection to a PLC is lost, because pending communications cannot be completed.

After reviewing the Wireshark file, there are slow responses and lost transmissions in both directions between Ignition and the MicroLogix that are causing TCP retries and take 10-12 seconds to recover. If the Communication Timeout setting is set high enough as you do at 20 seconds, this should not be a problem. Please send us the Ignition log files using the following steps:

Make sure at least one MicroLogix device is in the OPC-UA device list.
In the Gateway, select Console and then select the Levels tab.
Type in MicroLogix and click the Set button.
Set the trace level for all 5 entries to “TRACE”
Remove all MicroLogix devices then add one back. (If your process will allow)
Wait for the problem to reoccur.
Set the trace level for the 5 entries back to “INFO”
Zip up and email the wrapper.log files to support@inductiveautomation.com. The warpper.log files are in the Ignition installation directory.

This will provide us with more details that may help.