Learn How to Use Group in Mongodb Aggregation Pipeline (With Exercise)

Last updated on 18 January 2022

#databases#mongodb

The group stage in the MongoDB aggregation pipeline helps us group data using any field in the MongoDB document. It is one of the most important and commonly used stages in the MongoDB aggregation pipeline.

In this article, we'll look at the $group stage and use various features that it provides. We'll be working with a sample collection of movies. There's going to be a playground link for each query so you can practice and learn by doing.

There's also an exercise at the end of this article for you to try out. It will help you solidify your understanding once you've finished this article.

Here are the things we will cover in this article:

Find distinct using group by ♟
Group by using multiple fields 🎳
Using accumulator functions 🧩
Using with $project ⚙️
Sorting the results 📈
$group vs $project stage 🪞
Thing to note 📝
Conclusion 🎬
Exercise 🎐

Establishing the data

Before we jump into the aggregation pipeline and the group stage, we need some data to work with. I'm taking an example Movies collection for understanding the concept here. Again, there'll be links to the playground for each query throughout the article.

Here's the Movies collection with only 5 documents containing random data:

1{
2  "name": "Spidey One way home",
3  "release_year": "2021",
4  "rating": 9,
5  "starring": [
6    "Tom Hanks",
7    "Tom Holland",
8    "Mark Zucks",
9    "Samy"
10  ],
11  "runtime": 120,
12  "totalReviews": 2000,
13  "director": "Jon What"
14},
15{
16  "name": "The Arrival of a Train",
17  "release_year": "1896",
18  "rating": 6,
19  "starring": [
20    "Shawn Ching",
21    "Looker Blindspot",
22    "Tom Hanks"
23  ],
24  "runtime": 115,
25  "totalReviews": 720,
26  "director": "Ricky"
27},
28{
29  "name": "Lost persuit of adventure",
30  "release_year": "2005",
31  "rating": 7.1,
32  "starring": [
33    "Jimmy simmon",
34    "Catarina"
35  ],
36  "runtime": 150,
37  "totalReviews": 823,
38  "director": "Ricky"
39},
40{
41  "name": "Jungle Warrior",
42  "release_year": "2016",
43  "rating": 5.9,
44  "starring": [
45    "Stormer",
46    "Carmony",
47    "Tom Hanks"
48  ],
49  "runtime": 150,
50  "totalReviews": 1368,
51  "director": "Whim Wailer"
52},
53{
54  "name": "The Last of the us all",
55  "release_year": "2005",
56  "rating": 8.5,
57  "starring": [
58    "Samy",
59    "George wise",
60    "Pennywise"
61  ],
62  "runtime": 120,
63  "totalReviews": 1800,
64  "director": "Jon What"
65}

Now that we have our sample collection, it's time to explore the $group stage ⚡

Find distinct using group by

To find the distinct items in a collection we can use the group stage on any field that we want to group by. This field will be unique in the output. Let's group the movies by their release year:

Run Code

1{
2  $group: {
3    _id: "$release_year"
4  }
5}

Here's the output of the above query. Note that we only got unique release year values in the output.

1[
2  { "_id": "1896" },
3  { "_id": "2016" },
4  { "_id": "2021" },
5  { "_id": "2005" }
6]

Group by using multiple fields

Similar to grouping by a single field, we might want to group the data with more than one field as per our use case. MongoDB aggregation pipeline allows us to group by as many fields as we want.

Whatever we put inside the _id field is used to group the documents i.e., it returns all the fields present inside the _id field and groups by all of them.

Let's group the movies by their release year and their runtime:

Run Code

1{
2  $group: {
3    _id: {
4      "release_year": "$release_year",
5      "runtime": "$runtime"
6    }
7  }
8}

Grouping by release year and their runtime gives us this output:

1[
2  {
3    "_id": {
4      "release_year": "2005",
5      "runtime": 150
6    }
7  },
8  {
9    "_id": {
10      "release_year": "2021",
11      "runtime": 120
12    }
13  },
14  {
15    "_id": {
16      "release_year": "2016",
17      "runtime": 150
18    }
19  },
20  {
21    "_id": {
22      "release_year": "2005",
23      "runtime": 120
24    }
25  },
26  {
27    "_id": {
28      "release_year": "1896",
29      "runtime": 115
30    }
31  }
32]

Instead of using a single field to group by, we are using multiple fields in above scenario. The combination of release year and runtime acts as the unique identifier for each document.

Using accumulator functions

There are a lot of accumulator functions available in the group stage which can be used to aggregate the data. They help us carry out some of most common operations on the grouped data. Let's take a look at some of them:

$count accumulator

$count accumulator is used to count the number of documents in the group. This can be combined with our group by query to get the total number of documents in the group.

Let's apply this to our movies collection:

Run Code

1{
2  $group: {
3    _id: "$release_year",
4    totalMovies: { $count: {} }
5  }
6}

We'll get the total movies released in each year:

1[
2  {
3    "_id": "2016",
4    "totalMovies": 1
5  },
6  {
7    "_id": "1896",
8    "totalMovies": 1
9  },
10  {
11    "_id": "2021",
12    "totalMovies": 1
13  },
14  {
15    "_id": "2005",
16    "totalMovies": 2
17  }
18]

$sum accumulator

We can use the $sum accumulator to add up all the values in a field. Let's group the movies by their rating and sum up the reviews to understand if there's a correlation between movie rating and the number of reviews.

Run Code

1{
2  $group: {
3    _id: "$rating",
4    totalMovies: {
5      $sum: "$totalReviews"
6    }
7  }
8}

And here we can see that there is a slight correlation between the number of reviews and the movie rating:

1[
2  {
3    "_id": 9,
4    "totalMovies": 2000
5  },
6  {
7    "_id": 6,
8    "totalMovies": 720
9  },
10  {
11    "_id": 5.9,
12    "totalMovies": 1368
13  },
14  {
15    "_id": 8.5,
16    "totalMovies": 1800
17  },
18  {
19    "_id": 7.1,
20    "totalMovies": 823
21  }
22]

$avg accumulator

We might want to examine which year has the highest average movies rating for analytical purposes. Let's see how we can get those stats from our data:

Run Code

1{
2  $group: {
3    _id: {
4      year: "$release_year",
5    },
6    avgRating: {
7      $avg: "$rating"
8    }
9  }
10}

We are first grouping the movies by the release year and then calculating the average rating for each release year. Here's the output of the above query:

1[
2  {
3    "_id": { "year": "2016" },
4    "avgRating": 5.9
5  },
6  {
7    "_id": { "year": "1896" },
8    "avgRating": 6
9  },
10  {
11    "_id": { "year": "2021" },
12    "avgRating": 9
13  },
14  {
15    "_id": { "year": "2005" },
16    "avgRating": 7.8
17  }
18]

$push accumulator

We want to look at all the ratings movies received for every release year. Let's use the $push accumulator to get all the movie names for each year:

Run Code

1{
2  $group: {
3    _id: {
4      year: "$release_year",
5    },
6    ratings: {
7      $push: "$rating"
8    }
9  }
10}

All the movie ratings for each release year are pushed into an array:

1[
2  {
3    "_id": { "year": "1896" },
4    "ratings": [ 6 ]
5  },
6  {
7    "_id": { "year": "2016" },
8    "ratings": [ 5.9 ]
9  },
10  {
11    "_id": { "year": "2021" },
12    "ratings": [ 9 ]
13  },
14  {
15    "_id": { "year": "2005" },
16    "ratings": [ 7.1, 8.5 ]
17  }
18]

$addToSet accumulator

You can consider this to be like the $push accumulator. $addToSet only adds the value to the array if it doesn't exist already. This is the only difference between $addToSet and $push.

Let's group by rating and see which (unique) release years produced each:

Run Code

1{
2  $group: {
3    _id: {
4      rating: "$rating"
5    },
6    releasedIn: {
7      "$addToSet": "$release_year"
8    }
9  }
10}

We get ratings along with their unique release years:

1[
2  {
3    "_id": { "rating": 8.5 },
4    "releasedIn": [ "2005" ]
5  },
6  {
7    "_id": { "rating": 7.1 },
8    "releasedIn": [ "2005" ]
9  },
10  {
11    "_id": { "rating": 9 },
12    "releasedIn": [ "2021" ]
13  },
14  {
15    "_id": { "rating": 6 },
16    "releasedIn": [ "1896" ]
17  },
18  {
19    "_id": { "rating": 5.9 },
20    "releasedIn": [ "2016" ]
21  }
22]

$min accumulator

Let's say we want to find out successful release years for the movies. A year is considered successful if the all the movies released during that year have rating greater than 7. Let's use the $min accumulator to get the successful years:

Run Code

1{
2  $group: {
3    _id: {
4      year: "$release_year"
5    },
6    minRating: { $min: "$rating" }
7  }
8},
9{
10  "$match": {
11    minRating: { $gt: 7 }
12  }
13}

We have grouped the movies collection using the release_year field.
In addition to that, we have added minRating field which maintains the minimum rating for each release year.
We have also applied a $match stage to filter out the years which don't have a minimum rating greater than 7.

1[
2  {
3    "_id": {
4      "year": "2021"
5    },
6    "minRating": 9
7  },
8  {
9    "_id": {
10      "year": "2005"
11    },
12    "minRating": 7.1
13  }
14]

$first accumulator

This accumulator is different from the $first array operator which gives first element in an array. For each grouped documents, $first accumulator gives us the first one.

Let's fetch the highest rated movie for every release year. Since we want to get the highest rated document from the each group, we need to sort the documents before passing them to the group stage.

Run Code

1{
2  "$sort": {
3    "release_year": 1,
4    "rating": -1
5  }
6},
7{
8  $group: {
9    _id: "$release_year",
10    highestRating: {
11      $first: "$rating"
12    }
13  }
14}

We are sorting using two fields here, release_year and rating. Let's understand the output of sort stage first:

Run Code

1[
2  {
3    "rating": 6,
4    "release_year": "1896"
5  },
6  {
7    "rating": 8.5,
8    "release_year": "2005"
9  },
10  {
11    "rating": 7.1,
12    "release_year": "2005"
13  },
14  {
15    "rating": 5.9,
16    "release_year": "2016"
17  },
18  {
19    "rating": 9,
20    "release_year": "2021"
21  }
22]

The output is first sorted on the basis of ascending release year and then for each year, the movies are sorted in descending order of rating.

This sorted output is then passed to the group stage which groups the documents by their release year. For example, group stage is working with two documents for release year 2005:

1{
2  "rating": 8.5,
3  "release_year": "2005"
4},
5{
6  "rating": 7.1,
7  "release_year": "2005"
8}

Let's call these "shortlisted documents" for release year 2005. This happens for all (unique) release years. Group stage picks the first element from these shortlisted documents (which has the highest rating because ratings are sorted in descending order).

Combining the sort and group stages, here's the final output of the query:

1[
2  {
3    "_id": "2016",
4    "highestRating": 5.9
5  },
6  {
7    "_id": "1896",
8    "highestRating": 6
9  },
10  {
11    "_id": "2021",
12    "highestRating": 9
13  },
14  {
15    "_id": "2005",
16    "highestRating": 8.5
17  }
18]

Passing sorted documents to $group stage does not guarantee that the order will be preserved.

Using with $project

The movie rating is a floating point number. We'll round that off to the nearest integer to get the movie rating as a whole number. Let's also group movies by their modified ratings:

Run Code

1{
2  "$project": {
3    rating: {
4      "$round": "$rating"
5    }
6  }
7},
8{
9  $group: {
10    _id: "$rating",
11    movies: {
12      $sum: 1
13    }
14  }
15}

We used $project stage to round off the rating to the nearest integer.
We used $group stage to group the movies by their modified rating.

Here's the output of the above query:

1[
2  {
3    "_id": 7,
4    "movies": 1
5  },
6  {
7    "_id": 8,
8    "movies": 1
9  },
10  {
11    "_id": 9,
12    "movies": 1
13  },
14  {
15    "_id": 6,
16    "movies": 2
17  }
18]

The possibilities are endless. You can combine many other stages, perform some filters, put conditions or even $$REMOVE the documents.

Sorting the results

The year with the highest movie minutes might give us some insights on movies production and its correlation with audience attention spans over the years. So let's understand how to achieve that:

Run Code

1{
2  $group: {
3    _id: "$release_year",
4    totalRuntime: {
5      "$sum": "$runtime"
6    }
7  }
8},
9{
10  "$sort": {
11    "totalRuntime": -1
12  }
13}

We are fetching the total runtime of all the movies released in a particular year and then sorting them in descending order with the help of $sort stage:

1[
2  {
3    "_id": "2005",
4    "totalRuntime": 270
5  },
6  {
7    "_id": "2016",
8    "totalRuntime": 150
9  },
10  {
11    "_id": "2021",
12    "totalRuntime": 120
13  },
14  {
15    "_id": "1896",
16    "totalRuntime": 115
17  }
18]

It is evident from this query that the attention spans of the target audience have been decreasing in non-uniform way over the years.

$group vs $project stage

We have an n:1 relationship between input and output documents in the group stage. But, we have a 1:1 relationship in the $project stage.

In group stage we usually get a count, sum, average of documents based on the grouping key (or _id), or even build an array. All of these operations take n number of documents and the output of group is a single document with the aggregated values.

On the other hand, we include/exclude fields, perform field transformations within a single document in case of project stage in aggregation pipeline,

Thing to note

$group stage has a limit of 100 megabytes of RAM

If you're working with a massive dataset and you receive an error during group stage execution, you might be hitting the memory limit. If you want to increase it, use allowDiskUse option to enable the $group stage to write to temporary files on disk.

The reason for this issue is very well stated in the mongoDB docs:

Pipeline stages operate on streams of documents with each pipeline stage taking in documents, processing them, and then outputting the resulting documents.

Some stages can't output any documents until they have processed all incoming documents. These pipeline stages must keep their stage output in RAM until all incoming documents are processed. As a result, these pipeline stages may require more space than the 100 MB limit.

Conclusion

And that was how the group stage works in mongoDB aggregation pipeline. We looked at how we can group data using single fields (distinct count), multiple fields, sort them, how we can carry out complex computations by adding conditions to group stage and the subtle difference between group and project stage. I hope you find this useful and interesting.

Let me know your thoughts and feedback on Twitter.

Exercise

To make sure you understand the concepts, I have curated a couple of questions related to what we've learned in this article. You can download exercise PDF below. It also contains working mongoDB playground links containing the answers for all the questions. Be honest, don't cheat 🙂.

5 quick questions on group stage with answers