Building Your Own Rest Query Language - An Experiment With Azure Cosmos Db

Swagata Prateek
8y
18k
0
1

Article

For all the REST resources we deploy out every day, one of the most common scenario a developer ends up handling is designing a nice and effective search/filter/query endpoint for these aforementioned resources. Usually, in these scenarios, conventions play a big part and that starts from query filter parameters in the query string to using a full blown search engine like ElasticSearch, Lucene or Algolia.

Today, we will do a simple yet fun experiment. Our goal today is to create a simple nifty looking query language for a sample REST resource. We will try to mimic a couple of features Twitter Search API provides and our sample data storage today will be Azure CosmosDB: DocumentDB which previously was known as just Azure DocumentDB has now moved to the Azure CosmosDB family.

Therefore, the first thing we need is a set of sample data that we can use as our REST resource. Since we are going to mimic the Twitter public search API, we definitely need some sample data first. And, of course, GitHub comes to the rescue. Download the sample data and put it to your own document db instance. If you don't know how to create a document db instance, follow the quick start here. I'm referencing the link to the .NET quick start, but there are node.js, Python, Java, and Xamarin resources right by the side of it. For those of you who do not want to test your work against an Azure hosted DocumentDB and want to test Azure DocumentDB locally, you can always opt for the Azure DocumentDB emulator.

Now, for the sake of this experiment, I ported the GitHub Twitter sample data to an Azure DocumentDB instance. It has 10 entries in that small data set and hosted here. If you want to browse the data, Azure portal does allow it. But I suggest looking at this open source app here named Azure DocumentDB Studio. You can use the endpoint and the key written in the sample code in GitHub to connect. The database is public just for you guys so you can test the sample code I have hosted in GitHub and will list out in the end of this article.

Note - TL;DR people, this is your cue to go to the end of the article, but I highly suggest to stick around.

Features to be built

Out of all the features Twitter's Search API provides, I will port the to:UserAccount, from:UserAccount and "exact text" search capabilities. That means, we will be able to search the tweets sent from one account to other and we will also be able to search tweets by mentioning a string we like to be present in the tweets. For an example, if we want to find tweets

from a user account named @terminator
to another user account named @robocop
or any tweet where the words "I'm back"
or the hashtag #HastaLaVista is present,

our sample query language excerpt will be

"http://our-awesome-api/search/q?to:robocop AND from:terminator OR "I'm back" OR #HastaLaVista"

If you have an eye for detail, you will notice that this is not exactly like the Twtter Search API since it doesn't use an and operator. But for the sake of the simplicity, in this example, this should be enough.

Tools we are going to use

Our API stack will be written in ASP.NET Core. We will use ANTLR as our query language lexer-parser. If you need an ANTLR primer and another experiment I did with it, please have a look here where I tried to gobble up a simple scripting language. If you are not really accustomed to any of .NET stacks, fret not; all of these are totally doable in any other tech stack you will possibly prefer to use. Since ANTLR has targets for multiple languages, API can be built in basically any language and even for our sample storage solution Azure DocumentDB here, you have access to multiple client SDKs.

The data

As I mentioned above, we have a little data set of 10 entries. I'm posting a sample entry here so the rest of the tutorial makes sense. One entry points to a single tweet made by a sample user.

{
"entities": {
"user_mentions": [
{
"indices": [
3,
15
],
"id_str": "178253493",
"screen_name": "mikalabrags",
"name": "Mika Labrague",
"id": 178253493
}
],
"urls": [],
"hashtags": [ "#Confused" ]
},
"in_reply_to_screen_name": null,
"text": "RT @mikalabrags: Bipolar weather #Confused",
"id_str": "210621130703245313",
"place": null,
"retweeted_status": {
"entities": {
"user_mentions": [],
"urls": [],
"hashtags": []
},
"in_reply_to_screen_name": null,
"text": "Bipolar weather",
"id_str": "210619512855343105",
"place": null,
"in_reply_to_status_id": null,
"contributors": null,
"retweet_count": 0,
"favorited": false,
"truncated": false,
"source": "http://ubersocial.com",
"in_reply_to_status_id_str": null,
"created_at": "Thu Jun 07 06:29:39 +0000 2012",
"in_reply_to_user_id_str": null,
"in_reply_to_user_id": null,
"user": {
"lang": "en",
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/503549271/tumblr_m25lrjIjgT1qb6nmgo1_500.jpg",
"id_str": "178253493",
"default_profile_image": false,
"statuses_count": 13635,
"profile_link_color": "06544a",
"favourites_count": 819,
"profile_image_url_https": "https://si0.twimg.com/profile_images/2240536982/AtRKA77CIAAJRHT_normal.jpg",
"following": null,
"profile_background_color": "373d3a",
"description": "No fate but what we make",
"notifications": null,
"profile_background_tile": true,
"time_zone": "Alaska",
"profile_sidebar_fill_color": "1c1c21",
"listed_count": 1,
"contributors_enabled": false,
"geo_enabled": true,
"created_at": "Sat Aug 14 07:31:28 +0000 2010",
"screen_name": "mikalabrags",
"follow_request_sent": null,
"profile_sidebar_border_color": "08080a",
"protected": false,
"url": null,
"default_profile": false,
"name": "Mika Labrague",
"is_translator": false,
"show_all_inline_media": true,
"verified": false,
"profile_use_background_image": true,
"followers_count": 214,
"profile_image_url": "http://a0.twimg.com/profile_images/2240536982/AtRKA77CIAAJRHT_normal.jpg",
"id": 178253493,
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/503549271/tumblr_m25lrjIjgT1qb6nmgo1_500.jpg",
"utc_offset": -32400,
"friends_count": 224,
"profile_text_color": "352e4d",
"location": "Mnl"
},
"retweeted": false,
"id": 2.106195128553431E+17,
"coordinates": null,
"geo": null
},
"in_reply_to_status_id": null,
"contributors": null,
"retweet_count": 0,
"favorited": false,
"truncated": false,
"source": "&amp;lt;a href=\"http://blackberry.com/twitter\" rel=\"nofollow\"&amp;gt;Twitter for BlackBerry®&amp;lt;/a&amp;gt;",
"in_reply_to_status_id_str": null,
"created_at": "Thu Jun 07 06:36:05 +0000 2012",
"in_reply_to_user_id_str": null,
"in_reply_to_user_id": null,
"user": {
"lang": "en",
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/542537222/534075_10150809727636812_541871811_10087628_844237475_n_large.jpg",
"id_str": "37200018",
"default_profile_image": false,
"statuses_count": 5715,
"profile_link_color": "CC3366",
"favourites_count": 46,
"profile_image_url_https": "https://si0.twimg.com/profile_images/2276155427/photo_201_1_normal.jpg",
"following": null,
"profile_background_color": "dbe9ed",
"description": "protège-moi de mes désirs  23107961 ☍",
"notifications": null,
"profile_background_tile": true,
"time_zone": "Singapore",
"profile_sidebar_fill_color": "ffffff",
"listed_count": 2,
"contributors_enabled": false,
"geo_enabled": true,
"created_at": "Sat May 02 13:55:49 +0000 2009",
"screen_name": "yoursweetiethea",
"follow_request_sent": null,
"profile_sidebar_border_color": "91f50e",
"protected": false,
"url": "http://yoursweetiethea.tumblr.com",
"default_profile": false,
"name": "Althea Arellano",
"is_translator": false,
"show_all_inline_media": false,
"verified": false,
"profile_use_background_image": true,
"followers_count": 306,
"profile_image_url": "http://a0.twimg.com/profile_images/2276155427/photo_201_1_normal.jpg",
"id": "37200018",
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/542537222/534075_10150809727636812_541871811_10087628_844237475_n_large.jpg",
"utc_offset": 28800,
"friends_count": 297,
"profile_text_color": "fa3c6b",
"location": "Christian's Heart"
},
"retweeted": false,
"id": "210621130703245300",
"coordinates": null,
"geo": null
}

The ANTLR grammar

The ANTLR grammar we are going to use here is,

grammar Search;
expr: term (op term)*;
term: exactText | hashText | toText | fromText;
op: AND | OR;
toText: 'to:'ID;
fromText: 'from:'ID;
hashText: '#'ID;
exactText: EXACTTEXT;
// lexer rule
EXACTTEXT: '"' ~'"'* '"';
OR: 'OR';
AND: 'AND';
ID: [a-zA-Z_] [a-zA-Z0-9_]*;
WS: [ \n\t\r]+ -> skip;

It's a small and simple grammar like the one we used in our scripting language tutorial. Since our REST resource query is essentially an expression, our entry rule will be expr. The railroad diagram for expr looks like the following.

This means we can have a single search term or multiple search terms chained by an op rule. The op rule is nothing but the two relational operator we support: AND and OR.

Along side of the op rule, we also need to know what the term rule stands for. The term rule stands for the search term format we are allowed to use. In our sample query stated above, we have terms like from:terminator, to:robocop, "I'm back" or a hashtag like #HastaLaVista. That's why we have 4 rules defining all of these cases and the term rule is a OR relationship between them.

I'm not going to post the railroad diagram for toText, fromText, hashText and exactText rules since they are pretty self-explanatory if you have a cursory look at the grammar.

So, what are we waiting for? Let's start writing our little codebase that will parse this query string and translate it to an Azure DocumentDB SQL that we can use to fetch the tweets. For that, we need a small repository that will connect to our desired database and collection in DocumentDB and will let us fetch some items. I only added methods that will allow us to read the tweets and connect to the database. I ignored the rest since you can always have a look at those in the quick start for azure document db.

Here's our small rough database repository. Please remember to keep your endpoint and key strings somewhere secret and safe in production environment. Since this is a tutorial, I went with what is easy and fast to go for a proof of concept.

namespace TweetQuery.Lib
{
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Linq;
public class CosmosDBRepository<T> where T : class
{
private readonly string Endpoint = "https://tweet.documents.azure.com:443/";
private readonly string Key = "fjp9Z3qKPxSOfE0KS1aaKvUY27B8IoL347sdtMBMjkCQqPmoaKjGXoyltrItNXNN6h4QjAYLSY5nyb2djWWUOQ==";
private readonly string DatabaseId = "tweetdb";
private readonly string CollectionId = "tweets";
private DocumentClient client;
public CosmosDBRepository()
{
client = new DocumentClient(new Uri(Endpoint), Key);
}
public async Task<IEnumerable<T>> GetItemsAsync(string sql)
{
if (string.IsNullOrEmpty(sql))
throw new ArgumentNullException(nameof(sql));
FeedOptions queryOptions = new FeedOptions { MaxItemCount = -1 };
var query = this.client.CreateDocumentQuery<T>(
UriFactory.CreateDocumentCollectionUri(DatabaseId, CollectionId),
sql, queryOptions)
.AsDocumentQuery();
List<T> results = new List<T>();
while (query.HasMoreResults)
{
results.AddRange(await query.ExecuteNextAsync<T>());
}
return results;
}
}
}

I opted for executing an SQL query instead of a LINQ expression because constructing SQL is easier and simpler for a tutorial. Additionally, it decouples the query structure from compile time POCOs that we use as our models too.

I created a DocumentDbListener class based on the SearchBaseListener which was auto-generated from our ANTLR grammar. The sole purpose of this class is to generate a simple SQL against our search expression. To search inside nested arrays, I used a user defined function for azure document db. All of these are very crudely written, so forgive my indecency. Since this is just a tutorial, I tried to keep it as simple as possible.

function matchArrayElement(array, match) {
if (!Array.isArray(array)) return false;
for (var index = 0; index < array.length; index++) {
var element = array[index];
if (typeof match === "object") {
for (var key in match) {
if (match.hasOwnProperty(key) && element.hasOwnProperty(key)) {
var matchVal = match[key];
var elemVal = element[key];
return matchVal == elemVal;
}
}
}
else {
return (element == match)
}
}
return false;
}

All this method does is it tries to find nested array elements based on the match we send back. You can achieve the same result thorough JOINs in Azure DocumentDB or Array method ARRAY_CONTAINS, but I preferred a user defined function since it serves my purpose easily.

Constructing SQL from the query expression

To understand how the SQL is generated from the query expression, let's begin with the to:UserAccount expression. Since we start with the rule expr, let's override the SearchBaseListener method EnterExpr first.

namespace TweetSearch.CosmosDb.DocumentDb
{
using Antlr4.Runtime.Misc;
using TweetSearch.CosmosDb.Util;
public class DocumentDbListener : SearchBaseListener
{
private string projectionClause = "SELECT * FROM twt";
private string whereClause;
public string Output
{
get { return projectionClause + " " + whereClause; }
}
public override void EnterExpr([NotNull] SearchParser.ExprContext context)
{
this.whereClause = "WHERE";
}
}
}

The approach I took here is essentially the simplest. I handle the events fired the moment ANTLR enters a specific rule and I keep appending the SQL string to the whereClause. Since, entering the expr rule means that I will need a where SQL clause, I initialized it with "WHERE". The thing to notice here is instead of concatenating I chose to initialize it because I expect this event to be fired exactly once since that is how the grammar is designed.

Following the same trail the next thing to handle will be the EnterTerm event. But, term is nothing but an OR relationship between 4 other rules. Handling them specifically gives me the edge since they produce simpler and smaller readable methods. For example, if we want to handle the to:UserAccount expression, a simple method like following should be sufficient for our use case.

public override void EnterToText([NotNull] SearchParser.ToTextContext context)
{
var screenName = context.GetText().Substring(3).Enquote();
this.whereClause = string.Concat(whereClause, " ", $"udf.matchArrayElement(twt.entities.user_mentions, {{ \"screen_name\" : {screenName} }} )");
}

This is where our user defined function also comes in play though. I'm trying to find any tweet that has an user mention to the parsed user account I fetched from the query.

By following the same rule I completed the rest of the four rules and my full listener class looks like,

namespace TweetSearch.CosmosDb.DocumentDb
{
using Antlr4.Runtime.Misc;
using TweetSearch.CosmosDb.Util;
public class DocumentDbListener : SearchBaseListener
{
private string projectionClause = "SELECT * FROM twt";
private string whereClause;
public string Output
{
get { return projectionClause + " " + whereClause; }
}
public override void EnterExpr([NotNull] SearchParser.ExprContext context)
{
this.whereClause = "WHERE";
}
public override void EnterFromText([NotNull] SearchParser.FromTextContext context)
{
var screenName = context.GetText().Substring(5).Enquote();
this.whereClause = string.Concat(whereClause, " ", "twt.user.screen_name = ", screenName);
}
public override void EnterOp([NotNull] SearchParser.OpContext context)
{
var text = context.GetText();
this.whereClause = string.Concat(this.whereClause, " ", text.ToUpper());
}
public override void EnterToText([NotNull] SearchParser.ToTextContext context)
{
var screenName = context.GetText().Substring(3).Enquote();
this.whereClause = string.Concat(whereClause, " ", $"udf.matchArrayElement(twt.entities.user_mentions, {{ \"screen_name\" : {screenName} }} )");
}
public override void EnterHashText([NotNull] SearchParser.HashTextContext context)
{
var hashtag = context.GetText().Enquote();
this.whereClause = string.Concat(whereClause, " ", $"udf.matchArrayElement(twt.entities.hashtags, {hashtag})");
}
public override void EnterExactText([NotNull] SearchParser.ExactTextContext context)
{
var text = context.GetText();
this.whereClause = string.Concat(whereClause, " ", $"CONTAINS(twt.text, {text})");
}
}
}

We got our listener ready! Now, all we need is a context class that will bootstrap the lexer and parser and tokens so the input expression is transpiled and the output SQL is generated. Just like our last work on ANTLR, the TweetQueryContext class will look like the following,

namespace TweetSearch.CosmosDb.DocumentDb
{
using Antlr4.Runtime;
using Antlr4.Runtime.Tree;
public class TweetQueryContext
{
private DocumentDbListener listener;
public TweetQueryContext()
{
this.listener = new DocumentDbListener();
}
public SearchParser.ExprContext GenerateAST(string input)
{
var inputStream = new AntlrInputStream(input);
var lexer = new SearchLexer(inputStream);
var tokens = new CommonTokenStream(lexer);
var parser = new SearchParser(tokens);
parser.ErrorHandler = new BailErrorStrategy();
return parser.expr();
}
public string GenerateQuery(string inputText)
{
var astree = this.GenerateAST(inputText);
ParseTreeWalker.Default.Walk(listener, astree);
return listener.Output;
}
}
}

Whew! That was easy, right?

Bootstrapping the api layer

We have all we need except the api. Thanks to asp .net core, that is two clicks away. Open Visual Studio and open a .net core api project. Our TweetsController class looks like the following,

namespace TweetQuery.Controllers
{
using Microsoft.AspNetCore.Mvc;
using System.Threading.Tasks;
using TweetQuery.Lib;
using TweetQuery.Lib.Model;
using TweetSearch.CosmosDb.DocumentDb;
[Route("api/[controller]")]
public class TweetsController : Controller
{
private CosmosDBRepository<Tweet> repository;
private TweetQueryContext context;
public TweetsController(CosmosDBRepository<Tweet> repository)
{
this.repository = repository;
this.context = new TweetQueryContext();
}
[HttpGet("search")]
public async Task<IActionResult> Search([FromQuery] string q)
{
if (string.IsNullOrEmpty(q))
return BadRequest();
var querySql = this.context.GenerateQuery(q).Trim();
var result = await repository.GetItemsAsync(querySql);
return Ok(result);
}
}
}

I reused the db repository we created earlier and as you see that is dependency injected in the controller which you have to configure in the ConfigureServices method in your Startup class. I'm not adding that specific code here since it is already in the sample code and doesn't belong to the scope of this tutorial. Same goes for the model class Tweet and the classes it uses inside.

Time to test!

The project is hosted here in github. I also attached it with this article. Clone or download the code from here. Build and run it from your visual studio. As a sample query try the following:

http://localhost:5000/api/tweets/search?q=to:hatena_sugoi AND from:maeta_tw OR %23HopesUp OR "Surely June is a summer"

I url-encoded the hashtag here just to be nice on the REST client you might use. I highly suggest Postman if you don't want anything heavy.

I only took the minimalists way of using ANTLR here, you can build your own expression tree based on the auto-generated listener and can do so much more if you want.

I hope this was fun. Happy RESTing!