backend Feb 14

Debugging the un-debuggable

5 min read –
debug photo of code snippet developers

We all have a lot of different ways of debugging. Some use dumps or logs, others use various debugging tools etc. But what to do when you can’t use any of those? In the past few months, I had several different features that required connecting to an external service or an API. During development, I got a lot of different responses, errors, and unusual bugs.

One problem here is that every external program has its own way of handling requests, its own error messages, or its own bugs and workarounds. Aside from that, a lot of these third-party solutions are not very well-documented. Yet another problem can be very a slow communication with the solution’s developers or customer support. So what to do in a case when you receive an error response from a third party software that you don’t know how to solve? In most cases, you will still have to wait for an answer from somebody working on that software. You can find some strategies below of debugging a part of the problem yourself. So, at least you know what to ask when you reach customer support.

The timeout problem

Whether you receive “Operation timed out” response, or get an infinite load, or cannot ping the server IP address or something similar, in 90% cases, there is a connection problem. You may not always get the timeout error, you can get something like this:

* Hostname {serverIp} was found in DNS cache
*   Trying {serverIp}...
* TCP_NODELAY set
* Connected to {serverIp} ({serverIp}) port {serverPort} (#0)
> POST /{url} HTTP/1.1
Host: {serverIp}:{serverPort}
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17
Content-type: text/xml;charset="utf-8"
Accept: text/xml
Cache-Control: no-cache
Pragma: no-cache
Content-length: 629
* upload completely sent off: 629 out of 629 bytes
< HTTP/1.1 401 Unauthorized
< Content-Length: 0
< Server: Microsoft-HTTPAPI/2.0
< WWW-Authenticate: Negotiate
< Date: Thu, 13 Sep 2018 08:46:30 GMT
< 
* Connection #0 to host {serverIp} left intact

While messages you receive back may differ, the main symptom is not being able to successfully send a packet to server. The easiest way to confirm is to try to ping the server, and if all of the sent packets don’t go through, you have yourself a whitelist problem. The solution to this problem is having a static IP address and asking the customer support to whitelist it.

If whitelisting doesn’t solve your problem, you can ask someone else with a whitelisted IP to try and connect to the server in question. If someone else can access the server and you, for whatever reason cannot, it can be a lead on how to debug the problem. For instance, I once couldn’t connect to the server from a Linux-based OS, and a coworker with a Windows machine could. We were trying to connect to a Microsoft Dynamics NAV ERP system, and it used Windows NTLM user authentication. I was the first client they had that worked on a non-Windows laptop, and it was the first time for them to face this issue also. As we didn’t find a way for me to connect to it, we had to transfer the ERP solution to another server without NTLM authentication.

Timeouts

Timeout can also happen the other way around. If you are listening to some push notifications, your server can cause timeout for the sender of the notification. If you are using a local server to develop your application, your local address is probably not accessible from the Internet, so you cannot expect it to receive the push notification. I haven’t found the solution to this problem, so I just faked the calls they ought to have made. But if your dev/staging/live server is returning the timeout, this is a problem that you have to solve. Depending on configuration and security levels of your server, you will either have to whitelist the notification sender’s IP address, open some additional port. If that doesn’t work, make an exception in your authentication.

Bad request response code

If you receive a response with response code 400, without any other errors, like this one:

* Hostname {serverIp} was found in DNS cache
*   Trying {serverIp}...
* TCP_NODELAY set
* Connected to {serverIp} ({serverIp}) port {serverPort} (#0)
> GET /{url} HTTP/1.1
Host: {serverIp}:{serverPort}
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17
Content-type: text/xml;charset="utf-8"
Accept: text/xml
Cache-Control: no-cache
Pragma: no-cache
Content-length: 610
* upload completely sent off: 610 out of 610 bytes
< HTTP/1.1 400 Bad Request
< Content-Length: 0
< Server: Microsoft-HTTPAPI/2.0
< WWW-Authenticate: Negotiate
< Date: Thu, 27 Sep 2018 09:54:05 GMT
< 
* Connection #0 to host {serverIp} left intact

it is most likely that your request is not correct. In this case, you should make sure that:

  • you request headers are correct
  • you are not missing an envelope if you are using a protocol that requires one
  • all authentication data is correct
  • your request is correctly formatted
  • all your data is in the correct format
  • all the constants are correct
  • url or method name is correct
  • you are not missing any required fields
  • you have compared everything to documentation.

If all of the above checks out, and the problem still persist, it can be that the documentation is not up to date or that something changed in the request structure. In this case, you need to send request and response data to customer service and ask for their advice.

Internal server error

Depending on the system error handling, this error can mean the same as Bad request error, so make sure to check the list above. If the Internal server error happens suddenly, on a previously working request or on all the requests to that service, the cause is most probably some update on the external system.

This can be also caused by you sending faulty data to the service. If you receive an error text and has some relation to the data you sent, probably the data is wrong. But, if the error text has nothing to do with your request (or is always the same, no matter what you sent in the request itself), it is most likely that it’s not your fault, and you should inform the customer service.

Tips&tricks

Listed below are random things I found useful while working with external APIs and services:

  • If you are using an SDK or a library to process your requests and responses – there has to be a validation file hidden in the code. Find it and use it as a guide to building your requests, it will prevent a lot of mistakes.
  • Try to research every possibilities and try every trick, before you send a question to customer service. You can solve a lot of problems on your own.
  • Most of the problems with external services, for me, ended up being connection issues or errors on the service side. Don’t be afraid to be a bit of a nuisance to customer service until you solve the error.
  • Log requests and responses, especially on the production system — it can be very useful if an error happens.
  • Thoroughly read the documentation of the service or API. I found that sometimes you have to highlight important things and if you scroll through the documentation you won’t notice them.

Above are the things that I’ve noticed and the solutions or techniques that worked for me. Having a list like this would’ve made my life a lot easier a few months ago. Especially when you have a problem that you stare at for hours, not knowing what to do.

Even if you don’t find the cause of the problem by checking the list, you will know what is not the cause. That leaves a shorter list of things to try and debug.