Technology Inside Out!

Index ¦ Archives ¦ Atom ¦ RSS

Hail GraphQL

GraphQL Logo

We start with basics, but then we've to upgrade. But why? Because the latest solution to the old problem is more elegant and much faster! To me, it happened to be GraphQL. It's a query language that follows you. The mantra? The structure remains the same. Let's see how we can get started with GraphQL using GitHub's API as an example.

Background

I was working on CPython-Pull-Requests, which queries GitHub to show files and a list of PRs opened against them. (Want to check it out? Go here) Previous work by Cheryl Sabella was awesome. But it could do with a few touchups. Which CuriousLearner and me and GeekyShacklebolt did. The previous version had one big drawback. It took ~10 minutes to get all the relevant data from GitHub. That was because GitHub's REST API got much more data than what we required. Then we were advised to use GraphQL. And guess what? The query takes ~27 Seconds now!

What is GraphQL

GraphQL is a Query Language (The QL in GraphQL) and is used to fetch data from servers. But what makes it different is how you query the server. Your query has the same structure as that of your required response. And due to this you only get what you ask. Nothing extra.

GitHub terminologies

Object

Object means resources that you can access. An object has related connections. The connections have multiple edges all reaching to a different node. Examples of objects are repositories, issues, gists, blames, etc. Complete list: https://developer.github.com/v4/object/

Connection

Connection allows you to query for related objects. Like here you can see a list of connections available under the repository object.

Edge

A link between nodes. You need to go through an edge to a node. But since an edge will always take you to a node so you don't need to specify edge in your query. Though providing edge isn't wrong. It clarifies the meaning. But since it's obvious, it's optional. Like

edge {
  node{
    stuff
  }
}

Example: RepositoryEdge

Nodes

Objects, node, and fields are sometimes used interchangeably. But what helps me clarify node is that node is the final part of your query that returns some scalar(A value with the base data type. Like Int, String, Boolean). Each node should return a scalar. If it does not, you have to include subfields until they return scalars. In a node, you can specify what fields you require. Remember that the final field that is returned inside a node should be a scalar.

NOTE: To find various fields, connections, nodes you can refer here https://developer.github.com/v4/query/

Enough talk. Let's get started!

What better way to understand something new than to see it work, play with it. Let's take the one I used (GitHub). You can use GitHub's explorer to test queries straight away. But it asks a LOT of permissions to support every type of query. Another way is to make a smaller version yourself. With fewer permissions. Here's a short version that I wrote. For this, you just need an authentication token generated at https://github.com/settings/tokens with no scopes other than default scopes. i.e you don't need to select any option from the list of scopes provided in settings.

Minimal Explorer

This is the minimal explorer that I wrote. But since we're not taking loads of permissions, we're a little limited. But for getting started, it's enough (Not only getting started, but for getting big too. CPython-Pull-Requests uses only default scopes, no additional scopes.)

So here's our first query.

{
  viewer{
    login
  }
}

Output:

{
  "data": {
    "viewer": {
      "login": "storymode7"
    }
  }
}

(See the similarity? The "data" provided in the response to our query has exactly the same structure.) Viewer is a User object. There is also a user field. The difference of viewer from user field is that viewer represents the currently authenticated user. Whereas user is used to lookup a user from their login. So, in this case, the response was the login of the currently authenticated user(storymode7).

You can also prepend query before your query. It is useful if you need to pass some parameters to your query.

query {
  rateLimit {
    limit
    remaining
  }
}

Output:

{
  "data": {
    "rateLimit": {
      "limit": 5000,
      "remaining": 5000
    }
  }
}

RateLimit object. It contains fields related to the rate limit info of currently logged in client.

Using variables in your query

Like in a programming language, you can use variable to make query easy to modify. Like if you want to change the number of results returned by a query, you don't need to change your query every time. You can just change the relevant variable without touching the query at all. To use variables you need to define the variable type when you write your query. While using variables, keeping this in mind helps: You can use a field in an object as a parameter. You can also list these fields under nodes to display them. Variables are prefixed with a $ sign wherever they are in a query. And enclosed in "" within variables. Example:

query($user_name:String!) {
  user(login: $user_name) {
    repositories(first: 1) {
      nodes{
        name
      }
    }
  }
}

{
  "user_name": "storymode7"
}

Executing this query in GitHub explorer: Paste the variables part (The last part with "user_name") under QUERY VARIABLES Executing this query in my minimal explorer: Paste the variables part (The last part with "user_name") under variables. So that it looks like:

variables = '''\
{
  "user_name": "storymode7"
}
'''

Output:

{
  "data": {
    "user": {
      "repositories": {
        "nodes": [
          {
            "name": "django-init"
          }
        ]
      }
    }
  }
}

Look closely on this line: query($user_name:String!) { This is to make sure that nullability matches. I.E if the field requires a variable (If they are required, the type has a ! in end). Then your variable definition should specify the variable as compulsory too! Lookup login field here, and you'll notice it's defined as: login (String!)

Pagination

A query can not list more than hundred resources in one request. For example:

query {
  repository(owner:"python", name:"cpython") {
    pullRequests(states: OPEN, first: 101) {
      nodes {
        title
      }
    }
  }
}

Output:

Requesting 101 records on the `pullRequests` connection exceeds the `first` limit of 100 records.

This is where we pagination comes. The use of variables, before & after fields, cursors, python all comes in together! Under a connection, you can see there's a pageInfo field. Example: PullRequestConnection.

Since we're going to deal with paging forward, the fields of interest to us in pageInfo are: endCursor & hasNextPage. So, to get all the info we need to keep turning the page until we're on the last page. We can check if we're on the last page yet, by checking the value of hasNextPage which is a Boolean. If we have a next page available, we copy the endCursor value and then we update the query for the next request with that endCursor specified in the after field. So now we only get details after those that were at the 'end' of the previous 'page'. In short, to paginate: * Check if hasNextPage (if not then quit) * Copy endCursor * Update the after field in query with endCursor's value * Send the query * Repeat

Let's see how these endCursor, hasNextPage etc look in the Response of a small query:

Query:

query {
  repository(owner:"python", name:"cpython") {
    pullRequests(states: OPEN, first: 1) {
        pageInfo {
          endCursor
          hasNextPage
        }
        totalCount
    }
  }
}

Output:

{
  "data": {
    "repository": {
      "pullRequests": {
        "pageInfo": {
          "endCursor": "Y3Vyc29yOnYyOpHOBlW8Wg==",
          "hasNextPage": true
        },
        "totalCount": 1048
      }
    }
  }
}

Now here's an example script with pagination. That will fetch titles of all PRs for "cpython" repo with owner "python". Since we're getting a list of all the nodes(PR titles), I am capturing node of every request and appending it to the previous node. I'm not attaching the final node created by the accumulation of all other nodes back to the response because I've achieved what I wanted. Titles of all PRs opened in the "CPython" repo. Also, since the printing of such a large amount of information is useless, I'm sharing a small snippet of output:

Output:

Page 1 fetched
Page 2 fetched
Page 3 fetched
Page 4 fetched
Page 5 fetched
Page 6 fetched
Page 7 fetched
Page 8 fetched
Page 9 fetched
Page 10 fetched
Page 11 fetched
[
  {
    "title": "bpo-29553: Fixed ArgumentParses format_usage for mutually exclusive groups"
  },
  {
    "title": "Alternarive for bpo-29553 - Fixed ArgumentParses format_usage for mutually exclusive groups"
  },
  {
...

Note: I'm only displaying the node part here since that is the focus of this query.

Is this all?

No! GraphQL has a lot of features. This post includes a few that I came across while working on CPython-Pull-Requests. There are a lot of things. Like mutations, that you can use to modify data. This is the part that would require some permissions and this is the reason why GitHub's explorer is so permission heavy. It supports every type of GraphQL query that you can do on GitHub, including mutations. There are more features that you can find out on GitHub's GraphQL API reference.

© The Geeky Way. Built using Pelican. Theme by Giulio Fidente on github.

Disclaimer Privacy policy