ChipChop Support Forum
Log in
Log out
Join the forum
My Details
REPLIES: 8
VIEWS: 268
BACK
REPLY TO THIS POST
Hossrod
03 Feb 2024
ESP32 started to not be able to establish socket connection
Hello, I left my ESP32 running since yesterday and today it suddenly started to not be able to send heartbeat. I verified it was indeed connected to wifi (checked router, also my code turns a red LED on if connection is lost). Here is log from when this started happening...



16:57:27.771 -> {"api_call":"heartbeat",
16:57:27.989 -> {"status":"ok","timestamp":1706893049436}
16:57:37.814 -> {"api_call":"heartbeat",
16:57:38.033 -> {"status":"ok","timestamp":1706893059445}
16:57:47.819 -> {"api_call":"heartbeat",
16:57:57.843 -> {"api_call":"heartbeat",
16:58:07.819 -> {"api_call":"heartbeat",
16:58:17.866 -> {"api_call":"heartbeat",
16:58:27.868 -> {"api_call":"heartbeat",
16:58:37.895 -> {"api_call":"heartbeat",
16:58:47.905 -> {"api_call":"heartbeat",
16:58:57.909 -> {"api_call":"heartbeat",
16:59:07.921 -> {"api_call":"heartbeat",
16:59:17.911 -> {"api_call":"heartbeat",
16:59:27.947 -> {"api_call":"heartbeat",
16:59:37.966 -> {"api_call":"heartbeat",
16:59:47.976 -> {"api_call":"heartbeat",
16:59:57.980 -> {"api_call":"heartbeat",
17:00:07.997 -> {"api_call":"heartbeat",
17:00:18.013 -> {"api_call":"heartbeat",
17:00:27.997 -> {"api_call":"heartbeat",
17:00:38.040 -> {"api_call":"heartbeat",
17:00:48.045 -> {"api_call":"heartbeat",
17:00:58.047 -> {"api_call":"heartbeat",
17:01:08.063 -> {"api_call":"heartbeat",
17:01:18.057 -> {"api_call":"heartbeat",
17:01:58.098 -> {"api_call":"heartbeat",
17:01:58.098 -> ChipChop => connection closed. Don't panic, we'll try again in 4294939 seconds
17:02:05.091 -> ChipChop => Waiting to establish socket connection
17:02:08.090 -> ChipChop => Can not send heartbeat, waiting for handshake to complete.
17:02:18.100 -> ChipChop => Can not send heartbeat, waiting for handshake to complete.
17:02:27.095 -> ChipChop => Waiting to establish socket connection
17:02:30.108 -> ChipChop => Can not send heartbeat, waiting for handshake to complete.
17:02:40.111 -> ChipChop => Can not send heartbeat, waiting for handshake to complete.
17:02:49.117 -> ChipChop => Waiting to establish socket connection


Note that at first, it started to not get the status ok response. The last 3 lines repeated for several hours until I noticed it wasn't working. Also, notice it says it will try again in 4294939 seconds which doesn't sound correct.

FYI, I had to remove the command details as it made the post more than the allowed character limit.
Gizmo
03 Feb 2024


hmmm... I've checked the server logs for your location (you are on API2, right) and can't see anything out of the ordinary? Haven't had any reports from any users on that api server but it's now on my watch.


I have to say I have never seen anything like that "we'll try again in 4294939 seconds"??? that's like 50 days in the future :-)

I't kinda feels like somehow you are getting multiple instances of the websockets trying to run in parallel and compounding the wait time!
It's definitely a code issue, your log is showing a cut-off chunks of the requests so if ChipChop is only getting incomplete requests it may decide to block the communication on that socket but I would expect it to completely drop the connection although you may not be even seeing that message!

Can you please ping me the code via email so I can have a looksie, and also if you go in the Dev Console > My Account at the bottom you can generate an Access Key and send that as well so I can then track your device socket in real time and see if there is anything funny going on there.

It's probably something simple but I am not discounting a server problem so I am also going to monitor that one with scrutiny (the bugger is not even too busy currently chillin at 2.4% cpu!?)

p.s. I've responded to your question in the Notifications post, hope it will make sense!
Gizmo
03 Feb 2024

@Hossrod

This has been bugging me all day and just remembered the issues from last week with the esp-idf and the ESP32-C3. You said in the other post that you are using a C6 which is also a RiscV processor like the C3 so it feels too much of a coincidence!?

It seems that there is something going on with the WiFiSecure part of the sdk and I have managed on few occasions to reproduce the problem and the esp just keeps going in circles and dropping the socket connections (connect-drop-connect-drop) kinda similar to what you have described.

If you are using PlatformIO you can downgrade the ESP sdk which seems to be more stable but I have no clue how to do it in Arduino IDE?

I will keep investigating what's going on but one thing you can do in the meantime is to change the String server_uri to use ws:// instead of wss:// , that would switch the code to use the plain WiFiClient which seems to be fine and it doesn't matter to ChipChop which type of connection you use.

Keep me posted please, I want to get to the bottom of this!
Hossrod
03 Feb 2024

@Gizmo

Thanks! I emailed you my source code. I'm using Arduino IDE.
Hossrod
03 Feb 2024

@Gizmo

FYI, I looked in Arduino IDE and the only Espressif/ESP thing I've installed is this under Board Manager. Is this Arduino IDE's version of IDF SDK? Should I try a lower number? I installed this version after following a tutorial to setup my board. Not sure if my ESP32 C6 is available in an earlier version.



Attached images
Gizmo
03 Feb 2024

I've pinged you an email earlier but give it a go and see if it compiles (the numbers are different than what I have in PlatformIO so I'm not sure?)
Hossrod
04 Feb 2024

I think my issue was triggered by my wifi access point going offline for a few seconds for some reason.

While working on a different ESP32 (WROOM model), I turned off my AP to test something and noticed the ESP32 C6 thats heartbeating ChipChop did the exact same thing as before... Sending heartbeats, but not receiving status ok responses, then after a minute or so (a timeout in ChipChop library?) ChipChop logs "Don't panic, we'll try again in 4294937 seconds" and that it can not send heartbeat.

The odd thing here is that the heartbeat code on my ESP32 is wrapped in a check that wifi is connected...


if (WiFi.status() == WL_CONNECTED) {
// Update all component status
ChipChop.updateStatus("BlueLED", blue_led_status);
ChipChop.updateStatus("WhiteLED", white_led_status);
ChipChop.updateStatus("LightSensor", light_sensor_value);
} else {
Serial.println("Wifi connection lost. Attempting to reconnect...");
WiFi.reconnect();
}


So it seems that WiFi is continuing to report WL_CONNECTED even though its for sure not still connected as the AP is offline.

I think this is a bug in Espressifs code for the C6 as my WROOM device doesn't do this and hits the else logic path correctly.

Tomorrow I'll try running my ChipChop program on my WROOM device just to confirm and see if I can get it to auto reconnect if wifi goes down.

On a side note, you now have a repro case for the "Don't panic, we'll try again in 4294937 seconds" thing. Just move call to ChipChop.updateStatus() outside the WiFi status check, then turn off your wifi AP/Router and wait a minute or two.
Gizmo
04 Feb 2024
ACCEPTED

Well, butter my bum and call me a biscuit! Sir, you are a genius, please send my warmest regards to your dodgy wifi router!

That's it, I've managed to reproduce the issue and it is caused by WiFi fluctuation.

I had a bit of trouble killing the wifi to the little ESP32-C3, I have a mesh network around the house so even killing two routers didn't work (have 6) . Then I brought out the big guns and tried enclosing the little bugger in metal, starting with a small metal box then a soup can and finally gave up when a combo of cake tin and large cooking pot didn't do the trick!
I mean seriously, how on earth that little ceramic antenna can pick up the signal so well!?



Long story short, in the end I've tethered it to my phone and then stoped the wifi and bam, same messages and slightly different number from ChipChop but in the similar range.

So, the story with the library code. The wifi check that you use in your main loop is actually not required, it's only there in that example just to show a good coding practice because is saves a bit of work for the processor.
Technically you can update the statuses at any time (even non-stop) and the ChipChop library will still handle all the timings and the websockets will do their own connection checks. Have a looksie inside the "smart-button" example, there's no checks at all but the requests are still sent in correct intervals i.e. like a heartbeat.

That weird long number I have no idea, probably some wonky c++ maths with negative integers, easy to fix now that I know what to look for.

But, I do know now that the WebSocket connection check gets incorrect report back from the underlying Espressiff sdk. It's not the WiFi at all but the http client layer that incorrectly reports back that there is no client connection with the server and kills a completely valid connection and sometimes keeps doing it in a loop.

Now, thanks to your detective work and knowing that this is all caused by a "hiccup" in the wifi client I can potentially do some fail safe checks and try to dispose of the entire wifi client and restart the wifi or maybe dispose of the websockets or something like that. Alternatively I can maybe give an option (i.e. aggressive keep alive) and restart the entire ESP if the connection is hanging for too long? (that could maybe force the reconnection and on an ESP a restart/reconnect would take maybe 1-2 seconds)

Thank very much for this find, this has potentially saved me a week of going in circles.


Attached images
lampix
04 Feb 2024

so who suggested that we should downgrade the esp idf? 🙈