Seemingly random 403 Forbidden

TL;DR
We are receiving HTTP 403 responses with an HTML body seemingly at random–we haven’t been able to decipher any pattern as to when and why this happens.

Description
We are running a script in a GCP project to pull order and line item data for our customisation station at fulfilment. The basic process is:

  1. An operator scans the totes that have arrived at their desk into our system
  2. Our script identifies the relevant line items in the totes
  3. Our system displays this information to our operator in the format they require

Step No 2 in itself is quite a convoluted process, as answering the question “What orders / items are in this tote?” is not a trivial problem through the API (^1, ^2). Therefore, finding all the relevant information requires several different queries.

What we tried
Our system implements some error handling, so when we receive an error in the expected json response, we are able to deal with that by waiting the appropriate amount of time before resubmitting a query. So running out of credits should not be the issue.

Re-running our script a couple of times does eventually “resolve” the problem, (i.e. an HTTP 403 response with an HTML body doesn’t manifest in that particular run) so authentication should not be the issue.

We can receive the unexpected response on any of the queries, and since re-running the script does sometimes not reproduce the error, an incorrectly formatted query should not be the issue.

This system has been operational for three months without a single instance of this issue, until a few weeks ago when it started popping up semi-regularly.

The error

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>

Meta
There’s a recent topic that could potentially be related, but as there isn’t a lot of information in there, I thought it better to create a new topic.

Before implementing a brute-force solution whereby we’d automatically re-run the script every time an unexpected html response is received, I was hoping someone here could provide insight into how / why this might be happening. If any further info is required to help with the investigation, please let me know!

Hey @Gergely,

Thanks for reaching out!
I think you’re absolutely right that it can’t be an improperly formatted query. I’d like to take a deeper look at this, do you by chance have any of the request id’s of the queries returning that 403 error?

Please let me know if there’s anything I can do to assist!

Best,
RayanP

Hi Rayanp,
Thanks for the quick reply!

Unfortunately, there’s no json data received in the response, so there’s no request_id field in there either. In the response body we only receive the basic html quoted above. I believe the connection might be denied by the load balancer / WAF before the application server has a chance of trying to handle the GraphQL query at all.

Here are the headers from that response as well:

  Server: 'awselb/2.0',
  Content-Length: '118',
  Connection: 'keep-alive',
  Content-Type: 'text/html',
  Date: 'Wed, 09 Nov 2022 12:40:52 GMT'

And this is all the info we get in the response.

Hey @Gergely,

Just wanted to let you know I’m still looking into this issue.Trying to see exactly what’s causing these 403’s.
I’ll have an update for you shortly.

Let me know if you have any questions or concerns in the meantime!

Best,
RayanP

1 Like

Hey @Gergely,

Thank you for hanging in there while I looked into this.

Almost all of the errors associated with your account in the past two weeks have been a lack of credits error. The query that seems to be returning this error is:

{"query":"query {orders(fulfillment_status: \"Customisation Complete\") {data{edges {node {order_number fulfillment_status line_items(first: 10) {edges {node {tote_picks {tote_id}}}}}}}}}"}

It looks like this query costs 1101 credits and on average I would say you have around 1080 -1090 credits when you try to run this query - resulting in the error. Would it be possible to increase the time between queries, the normal increment rate is 30 per second so just a few more should do it.

Again, thank you for your patience. Let me know if there’s anything I can do to assist!

Best,
RayanP

Hi Rayanp,

Thank you for looking into this, your efforts are very much appreciated!
However, I’m afraid the issue lies somewhere else this time, and not with the lack of credits.

I’m quite confident in this, for a few reasons:

  1. Our system already implements reliable error handling for the lack of credits error.
    According to the GraphQL API docs, ShipHero returns an error with "code": 30 when the user runs out of credits, and this is indeed what we would normally get in the situation that you mentioned. This is a response that we do in fact receive from time to time, and when we do, it’s with the HTTP 200 status. Our system then goes on to parse the number of seconds from "time_remaining", adds an extra bit of time for good measure, then waits that long before re-executing / sending another query. (Example: We didn’t have enough credit for request_id: 63739557b4d88e2a3d34ec95 so we waited and re-submitted a copy request_id: 6373955dada88deded131ac5 ) Here’s an example of what the expected response looks like (as opposed to the plain html quoted in OP):
{
  "errors": [
    {
      "code": 30,
      "message": "There are not enough credits to perform the requested operation, which requires 1101 credits, but the are only 1039 left. In 3 seconds you will have enough credits to perform the operation",
      "operation": "orders",
      "request_id": "63736536691d7637a920f06b",
      "required_credits": 1101,
      "remaining_credits": 1039,
      "time_remaining": "3 seconds"
    }
  ],
  ...
}
  1. The failing query is sometimes the first one in hours.
    And we never submit queries that would be more expensive than our account limit. (Even if we did, our system would throw and handle an appropriate exception)
  2. The issue is just as likely to happen on “cheap” queries when credits are full:
    There’s a query with "estimated_complexity": 2 and "cost": 2 that returned without a problem. In the response’s user_quota extension we can see the field "credits_remaining": 2000. You can see this as well, "request_id": 63738a3cf60d3047a0f485a0 What you won’t be able to see, is that for the exact same 2 credit query we received the HTTP 403 with the previously quoted headers just 3 seconds before. How could the reason for the 403 be the lack of credits, when 3 seconds later, a complexity=2 query confirms that our credits are full? There were no requests made in between.
  3. The 403 Forbidden is received from a different server than API responses
    I believe your API controller system doesn’t even get a chance to evaluate the credit costs of our query, as our requests likely never reach the API endpoint. I am basing this on the Server header in the response:

403 Headers:

{
  "Date": "Tue, 15 Nov 2022 13:34:08 GMT",
  "Connection": "keep-alive",
  "Server": "awselb/2.0",
  "Content-Type": "text/html",
  "Content-Length": "118"
}

API response headers:

{
  "X-Content-Type-Options": "nosniff",
  "Strict-Transport-Security": "max-age=5184000; includeSubDomains",
  "Server": "nginx",
  "Transfer-Encoding": "chunked",
  "Content-Encoding": "gzip",
  "Connection": "keep-alive",
  "Pragma": "no-cache",
  "Content-Type": "application/json",
  "Expires": "0",
  "Cache-Control": "no-cache",
  "X-XSS-Protection": "1; mode=block",
  "Date": "Tue, 15 Nov 2022 13:34:16 GMT",
  "X-Frame-Options": "sameorigin"
}

I’d be happy to find and provide more evidence supporting my view that this should not be happening due to the lack of credits–if you think more information is necessary. Based on the difference in the response headers, I think the issue might lie somewhere in your infra. Unfortunately, there isn’t much information to parse out from the 403 on this end; if I had to guess I’d say your WAF is blocking a certain GCP IP address (here’s a full list of possible IP addresses, randomly allocated for API calls), or–considering the potentially increased traffic with the approaching peak season–some of your systems might not be scaling properly. Of course, it could be any number of other reasons, but there isn’t much visibility on our end. In either case, I was hoping you (or someone from your infra team) could shed more light on why this might be happening and hopefully fix this issue quickly.

As previously mentioned, our system has been live for nearly 4 months before this started happening randomly: sporadically first, now with increasing frequency. This error is severely impacting our ability to fulfil orders with customisable line items, and we’re concerned that we have neither the capacity nor the time to adapt our process to work around this issue before the expected uptick in volume for Christmas.

1 Like

Hey @Gergely,

Thank you for the detailed response!

For the most part I was only able to see errors that resulted from the lack of credits in that part of our logs. As you mentioned though if this was the first mutation ran after some time the error cause couldn’t have been due to that. I’ll continue looking into this and thank you for the tip regarding our WAF.

Please let me know if you have any questions or concerns.

Best,
RayanP

1 Like

Hey @Gergely,

Going to have to escalate this one to our Engineering Team. I’ll let you know when a fix has been pushed.

Thank you again for the detailed information, it’s super helpful!
Let me know if there’s anything I can do in the meantime to assist.

Best,
RayanP

1 Like