Hi @heyjess,
First of all, I’m sorry for the huge wall of text.
Thanks for letting me know some of the details of what you’re up against. I’ve been there, too, so I get where you’re coming from.
I think you would be better served if you break up your workload into chunks, whether you want to do hours, days or weeks, and work from there. By doing that, since you can’t grab everything from April through today in one day, you can keep a better tab of where you left off without worrying relying on the endCursor and risk missing or duplicating anything.
I’m going to use the picks_per_day query as an example since it’s a pretty flat data set (no nested lines or anything like that) and I can make the example pretty simple, but the same idea should apply to pulling shipments. You would just need to adjust your record requests if you’re pulling lines along with the shipments.
Anyway, for my company, we average about 5k picks per day. If I want to get that data all the way from 2020-04-01, I’d probably want to work on a day-by-day basis. So, when sending the query, I’d set the start and end dates for that day.
query {
picks_per_day (date_from:"2020-04-01", date_to:"2020-04-02") {
request_id
complexity
data (first:100) {
pageInfo {
hasPreviousPage
hasNextPage
endCursor
}
edges {
cursor
node {
id
created_at
# ...
}
}
}
}
}
I’d do the normal pagination, 100 records at a time using the endCursor loops, until I get to the end of the day, saving the data to my databbase. Then I’d tweak the query to have the date range for the next day, and do it again:
query {
picks_per_day (date_from:"2020-04-02", date_to:"2020-04-03") {
#...
It’s worth noting that the date_from criteria translates to “greater than or equal to”, but the date_to criteria is “less than”. That means that the query would actually pull everything from 2020-04-01 00:00:00 up through 2020-04-01 23:59:59, but it will not cross over to 2020-04-02 00:00:00. For a real-world example of this, I ran the query below. Note that I’m requesting the first 5, but only receiveing 3 in the results. That’s because in the 6-second window I gave the query, only 3 picks happened in that time.
query {
picks_per_day (date_from:"2020-12-15 11:15:18", date_to:"2020-12-15 11:15:24") {
request_id
complexity
data (first:5, sort: "created_at") {
pageInfo {
hasPreviousPage
hasNextPage
endCursor
}
edges {
cursor
node {
id
created_at
}
}
}
}
}
My results:
{
"data": {
"picks_per_day": {
"request_id": "5fda56dcdeba16eb4f1e7e28",
"complexity": 6,
"data": {
"pageInfo": {
"hasPreviousPage": false,
"hasNextPage": false,
"endCursor": "YXJyYXljb25uZWN0aW9uOjI="
},
"edges": [
{
"cursor": "YXJyYXljb25uZWN0aW9uOjA=",
"node": {
"id": "UGlja1Jlc3VsdDo3OTAyNTU1Nw==",
"created_at": "2020-12-15T11:15:18+00:00"
}
},
{
"cursor": "YXJyYXljb25uZWN0aW9uOjE=",
"node": {
"id": "UGlja1Jlc3VsdDo3OTAyNTU1OQ==",
"created_at": "2020-12-15T11:15:20+00:00"
}
},
{
"cursr": "YXJyYXljb25uZWN0aW9uOjI=",
"node": {
"id": "UGlja1Jlc3VsdDo3OTAyNTU2Mw==",
"created_at": "2020-12-15T11:15:22+00:00"
}
}
]
}
}
}
}
Notice my last “created_at” is less than the date_to field from the query. If I change the times and move my old date_to timestamp to date_from:
query {
picks_per_day (date_from:"2020-12-15 11:15:24", date_to:"2020-12-15 11:15:59") {
...
my results now include the pick that happened on that timestamp:
{
"data": {
"picks_per_day": {
"request_id": "5fda608c1902064733a02f0c",
"complexity": 6,
"data": {
"pageInfo": {
"hasPreviousPage": false,
"hasNextPage": false,
"endCursor": "YXJyYXljb25uZWN0aW9uOjI="
},
"edges": [
{
"cursor": "YXJyYXljb25uZWN0aW9uOjA=",
"node": {
"id": "UGlja1Jlc3VsdDo3OTAyNTU2Nw==",
"created_at": "2020-12-15T11:15:24+00:00" # << THIS ONE
}
},
{
"cursor": "YXJyYXljb25uZWN0aW9uOjE=",
"node": {
"id": "UGlja1Jlc3VsdDo3OTAyNTU2OQ==",
"created_at": "2020-12-15T11:15:26+00:00"
}
},
{
"cursor": "YXJyYXljb25uZWN0aW9uOjI=",
"node": {
"id": "UGlja1Jlc3VsdDo3OTAyNTU4Nw==",
"created_at": "2020-12-15T11:15:52+00:00"
}
}
]
}
}
}
}
Also, notice the endCursor for this set versus the endCursor for the first set. They’re exactly the same!
YXJyYXljb25uZWN0aW9uOjI= (from the first result)
YXJyYXljb25uZWN0aW9uOjI= (from the second result)
Finally, as a side note, you probably know this already, but changing the dates/cursors directly in the query isn’t the best way to do that. You would instead use variables in the queries, and then send the variable JSON along with the query request.
A shipment query with variables defined:
query(
# These define the variables to be used
$mystartdate: ISODateTime,
$myenddate: ISODateTime,
$cursor: String,
$record_count: Int) {
shipments (
order_date_from: $mystartdate
order_date_to: $myenddate
) {
complexity
request_id
data (after: $cursor, first: $record_count, sort: "created_date") {
...
The variables JSON:
{
"mystartdate": "2020-04-01",
"myenddate": "2020-04-02",
"record_count": 100,
"cursor": "yourcursorid" # Remove this completely on the first send with this date range
}
Anyway, I hope this helps. I know the date changing feels tedious, but in the long run it really will help to ensure the integrity of the results coming out of the API by lessening the chances that something will change in the underlying data, causing your endCursor to get out of whack.
Regards,
Jeremy