endCursor in pageInfo is unique?

heyjess · December 12, 2020, 7:05pm

Hi
It looks like endCursor is unique. It means when I’m requesting shipments data with order_from_date which is fixed date (not changed) and endcursor, I could get correct data. right?
For example, order_from_date is always 2018-10-11. And I’m pulling 10 shipments data based on each endcursor from the pageInfo.

I just wonder if endcursor is not unique, is there same endcursor existing based on diffrent dates setting?

jeremyw · December 16, 2020, 12:52pm

Hey, uh, heyjess!

Are you talking about using the same exact query, day after day, and only changing the “after” field for the cursor as you need more records? If so, that’s one of those things that can usually work, but not always.

The reason is because of the way the cursors are tied to both the query and the data. As long as everything stays consistent, then you may have some luck getting it to work that way, but if anything ever changes in the data you’re pulling or the way its sorted, the cursors will also change. Cursors are not guaranteed to match the same record over the long haul.

Also, there is a performance aspect to this that could get in the way. The more data that’s getting pulled into the “main” part of the query (before you apply your “before” or “after” filter for the cursor) then the slower the query is going to be in returning your results, and you could actually run into a timeout when trying to get the results.

When I first started using the ShipHero API, I used to do something similar when using the picks_per_day query. I wanted to be able to export new picks every day so I could build a master table on my side for performance analysis of our pickers. I thought the same thing as you, that it would simplify my sending of the query to always keep the date the same but just change the cursor, but in the end I found that it wasn’t worth it. First, as I mentioned, the response time kept growing, and eventually I kept running into timeouts. Second, I did run into occasions where there were either gaps or overlaps in the data I was pulling due to something having changed in the data (I’m not sure what the change was, but I just know that it caused a misalignment of the encCursors I was using) and that resulted in inaccurate analysis. Finally, if you ever lose the endCursor you used last, you don’t have a great way of picking up where you left off.

The recommended way is to narrow down the range you want to look at using the date fields. FYI, the date doesn’t have to be “just a date”. You can put a full timestamp there. So you can start with 2020-12-15, which is the equivalent of 2020-12-15 00:00:00, or if you know the last order timestamp you previously pulled was 2020-12-15 16:58:23, you can use that (or add a second) in your “from” date and continue from there.

I’m not 100% sure if that’s what you were looking for, and anyone that knows more details about how the cursors tie to the query and data is welcome to chime in, but I hope that helps, at least somewhat.

Regards,
Jeremy

heyjess · December 16, 2020, 3:00pm

I really appreciate your reply.
It helps me get a better understanding about endCursor.

My company has big amount of data from ShipHero.
Since ShipHero has changed to GraphQL, we didn’t get data from ShipHero for a while.
The amount of data is not able to be pulled in one day.

For example, we need to pull shiphero data from 2020-04-01 up to now.
Order_date_from is set to 2020-04-01.
And I pull10 shipments data and save the data into our DB and also save the endcursor and use endCursor from and so on. Let’s say I get 10000 data today.
Next day, order_date_from is same 2020-04-01. and use endcursor from our db which is saved yesterday from last pageinfo and start to pull data again from that endcursor.

So, it means that I depend on endcursor to pull data from ShipHero.
I hope that I could get all the data without missing any single data.

jeremyw · December 16, 2020, 7:57pm

Hi @heyjess,

First of all, I’m sorry for the huge wall of text.

Thanks for letting me know some of the details of what you’re up against. I’ve been there, too, so I get where you’re coming from.

I think you would be better served if you break up your workload into chunks, whether you want to do hours, days or weeks, and work from there. By doing that, since you can’t grab everything from April through today in one day, you can keep a better tab of where you left off without worrying relying on the endCursor and risk missing or duplicating anything.

I’m going to use the picks_per_day query as an example since it’s a pretty flat data set (no nested lines or anything like that) and I can make the example pretty simple, but the same idea should apply to pulling shipments. You would just need to adjust your record requests if you’re pulling lines along with the shipments.

Anyway, for my company, we average about 5k picks per day. If I want to get that data all the way from 2020-04-01, I’d probably want to work on a day-by-day basis. So, when sending the query, I’d set the start and end dates for that day.

query {
    picks_per_day (date_from:"2020-04-01", date_to:"2020-04-02") {
        request_id
        complexity
        data (first:100) {
            pageInfo {
                hasPreviousPage
                hasNextPage
                endCursor
            }
            edges {
                cursor
                node {
                    id
                    created_at
                    # ...
                }
            }
        }
    }
}

I’d do the normal pagination, 100 records at a time using the endCursor loops, until I get to the end of the day, saving the data to my databbase. Then I’d tweak the query to have the date range for the next day, and do it again:

query {
    picks_per_day (date_from:"2020-04-02", date_to:"2020-04-03") {
    #...

It’s worth noting that the date_from criteria translates to “greater than or equal to”, but the date_to criteria is “less than”. That means that the query would actually pull everything from 2020-04-01 00:00:00 up through 2020-04-01 23:59:59, but it will not cross over to 2020-04-02 00:00:00. For a real-world example of this, I ran the query below. Note that I’m requesting the first 5, but only receiveing 3 in the results. That’s because in the 6-second window I gave the query, only 3 picks happened in that time.

query {
    picks_per_day (date_from:"2020-12-15 11:15:18", date_to:"2020-12-15 11:15:24") {
        request_id
        complexity
        data (first:5, sort: "created_at") {
            pageInfo {
                hasPreviousPage
                hasNextPage
                endCursor
            }
            edges {
                cursor
                node {
                    id
                    created_at
                }
            }
        }
    }
}

My results:

{
  "data": {
    "picks_per_day": {
      "request_id": "5fda56dcdeba16eb4f1e7e28",
      "complexity": 6,
      "data": {
        "pageInfo": {
          "hasPreviousPage": false,
          "hasNextPage": false,
          "endCursor": "YXJyYXljb25uZWN0aW9uOjI="
        },
        "edges": [
          {
            "cursor": "YXJyYXljb25uZWN0aW9uOjA=",
            "node": {
              "id": "UGlja1Jlc3VsdDo3OTAyNTU1Nw==",
              "created_at": "2020-12-15T11:15:18+00:00"
            }
          },
          {
            "cursor": "YXJyYXljb25uZWN0aW9uOjE=",
            "node": {
              "id": "UGlja1Jlc3VsdDo3OTAyNTU1OQ==",
              "created_at": "2020-12-15T11:15:20+00:00"
            }
          },
          {
            "cursr": "YXJyYXljb25uZWN0aW9uOjI=",
            "node": {
              "id": "UGlja1Jlc3VsdDo3OTAyNTU2Mw==",
              "created_at": "2020-12-15T11:15:22+00:00"
            }
          }
        ]
      }
    }
  }
}

Notice my last “created_at” is less than the date_to field from the query. If I change the times and move my old date_to timestamp to date_from:

query {
    picks_per_day (date_from:"2020-12-15 11:15:24", date_to:"2020-12-15 11:15:59") {
    ...

my results now include the pick that happened on that timestamp:

{
  "data": {
    "picks_per_day": {
      "request_id": "5fda608c1902064733a02f0c",
      "complexity": 6,
      "data": {
        "pageInfo": {
          "hasPreviousPage": false,
          "hasNextPage": false,
          "endCursor": "YXJyYXljb25uZWN0aW9uOjI="
        },
        "edges": [
          {
            "cursor": "YXJyYXljb25uZWN0aW9uOjA=",
            "node": {
              "id": "UGlja1Jlc3VsdDo3OTAyNTU2Nw==",
              "created_at": "2020-12-15T11:15:24+00:00" # << THIS ONE
            }
          },
          {
            "cursor": "YXJyYXljb25uZWN0aW9uOjE=",
            "node": {
              "id": "UGlja1Jlc3VsdDo3OTAyNTU2OQ==",
              "created_at": "2020-12-15T11:15:26+00:00"
            }
          },
          {
            "cursor": "YXJyYXljb25uZWN0aW9uOjI=",
            "node": {
              "id": "UGlja1Jlc3VsdDo3OTAyNTU4Nw==",
              "created_at": "2020-12-15T11:15:52+00:00"
            }
          }
        ]
      }
    }
  }
}

Also, notice the endCursor for this set versus the endCursor for the first set. They’re exactly the same!

YXJyYXljb25uZWN0aW9uOjI= (from the first result)
YXJyYXljb25uZWN0aW9uOjI= (from the second result)

Finally, as a side note, you probably know this already, but changing the dates/cursors directly in the query isn’t the best way to do that. You would instead use variables in the queries, and then send the variable JSON along with the query request.

A shipment query with variables defined:

query(
    # These define the variables to be used
    $mystartdate: ISODateTime,
    $myenddate: ISODateTime,
    $cursor: String,
    $record_count: Int) {
  shipments (
    order_date_from: $mystartdate
    order_date_to: $myenddate
  ) {
    complexity
    request_id
    data (after: $cursor, first: $record_count, sort: "created_date") {
    ...

The variables JSON:

{
"mystartdate": "2020-04-01",
"myenddate": "2020-04-02",
"record_count": 100,
    "cursor": "yourcursorid" # Remove this completely on the first send with this date range
}

Anyway, I hope this helps. I know the date changing feels tedious, but in the long run it really will help to ensure the integrity of the results coming out of the API by lessening the chances that something will change in the underlying data, causing your endCursor to get out of whack.

Regards,
Jeremy

heyjess · December 17, 2020, 6:31pm

Thank you very much!
I really appreciate your post and your time!
Everything is super clear.

Topic		Replies	Views
Doing pagination correctly GraphQL API	1	515	May 4, 2022
Duplicate orders in the orders graphql query GraphQL API	4	731	July 9, 2021
Cursor re-use for orders created within 2-seconds of each other GraphQL API	2	344	January 9, 2023
Continuing download via API without duplicating results GraphQL API	4	400	July 29, 2019
Warehouse_product GraphQL API	3	443	October 13, 2021

endCursor in pageInfo is unique?

Related topics