Hi Rayanp,
Thank you for looking into this, your efforts are very much appreciated!
However, I’m afraid the issue lies somewhere else this time, and not with the lack of credits.
I’m quite confident in this, for a few reasons:
-
Our system already implements reliable error handling for the lack of credits error.
According to the GraphQL API docs, ShipHero returns an error with "code": 30
when the user runs out of credits, and this is indeed what we would normally get in the situation that you mentioned. This is a response that we do in fact receive from time to time, and when we do, it’s with the HTTP 200 status. Our system then goes on to parse the number of seconds from "time_remaining"
, adds an extra bit of time for good measure, then waits that long before re-executing / sending another query. (Example: We didn’t have enough credit for request_id: 63739557b4d88e2a3d34ec95
so we waited and re-submitted a copy request_id: 6373955dada88deded131ac5
) Here’s an example of what the expected response looks like (as opposed to the plain html quoted in OP):
{
"errors": [
{
"code": 30,
"message": "There are not enough credits to perform the requested operation, which requires 1101 credits, but the are only 1039 left. In 3 seconds you will have enough credits to perform the operation",
"operation": "orders",
"request_id": "63736536691d7637a920f06b",
"required_credits": 1101,
"remaining_credits": 1039,
"time_remaining": "3 seconds"
}
],
...
}
-
The failing query is sometimes the first one in hours.
And we never submit queries that would be more expensive than our account limit. (Even if we did, our system would throw and handle an appropriate exception)
-
The issue is just as likely to happen on “cheap” queries when credits are full:
There’s a query with "estimated_complexity": 2
and "cost": 2
that returned without a problem. In the response’s user_quota
extension we can see the field "credits_remaining": 2000
. You can see this as well, "request_id": 63738a3cf60d3047a0f485a0
What you won’t be able to see, is that for the exact same 2 credit query we received the HTTP 403 with the previously quoted headers just 3 seconds before. How could the reason for the 403 be the lack of credits, when 3 seconds later, a complexity=2
query confirms that our credits are full? There were no requests made in between.
-
The 403 Forbidden is received from a different server than API responses
I believe your API controller system doesn’t even get a chance to evaluate the credit costs of our query, as our requests likely never reach the API endpoint. I am basing this on the Server
header in the response:
403 Headers:
{
"Date": "Tue, 15 Nov 2022 13:34:08 GMT",
"Connection": "keep-alive",
"Server": "awselb/2.0",
"Content-Type": "text/html",
"Content-Length": "118"
}
API response headers:
{
"X-Content-Type-Options": "nosniff",
"Strict-Transport-Security": "max-age=5184000; includeSubDomains",
"Server": "nginx",
"Transfer-Encoding": "chunked",
"Content-Encoding": "gzip",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Content-Type": "application/json",
"Expires": "0",
"Cache-Control": "no-cache",
"X-XSS-Protection": "1; mode=block",
"Date": "Tue, 15 Nov 2022 13:34:16 GMT",
"X-Frame-Options": "sameorigin"
}
I’d be happy to find and provide more evidence supporting my view that this should not be happening due to the lack of credits–if you think more information is necessary. Based on the difference in the response headers, I think the issue might lie somewhere in your infra. Unfortunately, there isn’t much information to parse out from the 403 on this end; if I had to guess I’d say your WAF is blocking a certain GCP IP address (here’s a full list of possible IP addresses, randomly allocated for API calls), or–considering the potentially increased traffic with the approaching peak season–some of your systems might not be scaling properly. Of course, it could be any number of other reasons, but there isn’t much visibility on our end. In either case, I was hoping you (or someone from your infra team) could shed more light on why this might be happening and hopefully fix this issue quickly.
As previously mentioned, our system has been live for nearly 4 months before this started happening randomly: sporadically first, now with increasing frequency. This error is severely impacting our ability to fulfil orders with customisable line items, and we’re concerned that we have neither the capacity nor the time to adapt our process to work around this issue before the expected uptick in volume for Christmas.