RethinkDB: querying and using jq for post-query processing.

Introduction

I was recently tasked with compiling some aggregate data for the FaceItDown project. This project uses a RethinkDB database, and due to how large the dataset is, and how the database is configured (and maybe even just because it’s RethinkDB) running queries is a slow process.

To help speed this up, I decided to create a local JSON file data store of the relevant data, and then used some Python and jq to extract the information needed.

Journey data

The journey data is stored in the Journey table, but for the purposes of the questions I was asked to answer, only certain fields from these records are required i.e. the user id, and the number of points.

When filtering the data in this query we are also only concerned with successful journeys, so any other journeys are excluded.

The following query returns all the objects that represent successful journeys over the past 13 months. 13 months may seem arbritary, but for data privacy reasons, any journey data that is older than 13 months is deleted once per day.

def save_journeys(conn):
    # if the key does not exist in journey data, create it and give it
    # an empty list value
    journey_data = defaultdict(list)

    # get all journeys we're interested in from the database
    journeys = (
        r.db("faceitdown")
        .table("Journey")
        .filter({"status": 4})
        .filter(r.row["updated_at"] >= "2023-07-19")
        .filter(r.row["updated_at"] <= "2024-08-19")
        .run(conn)
    )
    # for each document in the journeys cursort, insert or
    # update the key in journey_data with the points earned
    # by that journey
    for doc in journeys:
        journey_data[doc["userId"]].append(doc["points"])
    # write the abbreviated journey data to the journyes.json file
    with open("journeys.json", "w") as f:
        json.dump(journey_data, f, indent=2)

and the result of running this is a journeys.json file like so:

{
  "669814a0-ce69-410e-bf85-a26e25baaac2": [
    4
  ],
  "51dcd5e1-23f4-46a0-837d-a6ba07e0af08": [
    60,
    66,
    62
  ],
  "114eeafd-3293-4906-963f-d1dcc50c5a2e": [
    1
  ]
}

Sorted journey data

I’d like to sort this data by the most users who made the most successful journeys. To do this we can use jq:

jq 'to_entries | sort_by(.value | length) | reverse | from_entries' \
  journeys.json > sorted_journeys.json

to_entries - converts from an object to a kv pair array
sort_by - sorts this array by the length of the value - in this case the value is the array of earned points
reverse - reverses the array; most journeys first
from_entries - converts back to the original format

The output from this command is stored in sorted_journeys.json which will be used next.

User report

The user summary can then be created. This is a JSON file, with an entry per user containing the number of successful journeys, and the cumulative points total. It’s probable this could be done using jq but for the sake of getting the work done in less time I stuck to what I knew here.

def user_summary(conn):
    # open the user_summary.json file for writing and the
    # sorted_journeys.json file for reading.
    with open("user_summary.json", "w") as rf, open("sorted_journeys.json", "r") as f:
        # list for the processed user entries
        results = []
        # read all the journey data
        journeys = json.load(f)
        # iterate over all journeys, idx represents the index of the current
        # object, and user_id is the key of the current object
        for idx, user_id in enumerate(journeys):
            print(f"{idx}: user_id -> {user_id}")
            results.append(
                {
                    "userId": user_id,
                    "num_journeys": len(journeys[user_id]),
                    "points": sum(journeys[user_id]),
                }
            )
        # write the results to user_summary.json
        rf.write(json.dumps(results, indent=2))

When this has run user_summary.json has entries like the following:

[
  {
    "userId": "b1e1ec76-a615-4a59-a118-df4edfa7f118",
    "num_journeys": 19,
    "points": 544
  },
  {
    "userId": "61c8d3de-8f6d-42b2-ab05-6f5875053464",
    "num_journeys": 19,
    "points": 1607
  }
]

It would be nice to have this data ordered by most points first, so again we can use jq to do this efficiently:

jq 'sort_by(.points) | reverse' user_summary.json > user_report.json

This time we sort by the value of the points field and again reverse. The results are written to user_report.json.

Nerd 2 Human Translation

We now have all the data we need, but it would be nice to provide our non-techie clients with a human readable version of this report. The best way to do this would be as a spreadsheet, but to do so we will need to convert the JSON to CSV.

There are many ways to do this, but for this post we’re going to use, yeah you guessed it, jq :)

jq -r '["userId", "num_journeys", "points"], (.[] | [.userId, .num_journeys, .points]) | @csv' \
  user_report.json > user_report.csv

In this command, we explicity add the columns titles to their own row in the output array. Then we pipe the contents of user_report.json and create an array per object with the specified keys. The remaining step is to then convert this array into CSV using @csv.

The CSV can then be imported into any spreadsheet software e.g. Google Sheets:

jq is powerful

If you work with JSON data and you haven’t heard of jq I would recommend taking a look at it. There is a helpful online jq playground where you can experiment.