Friday, January 21, 2022

CosmosDB: compare queries performance against strings and array properties

Imagine that we have CosmosDB instance running in Azure (created with SQL syntax enabled) which has groups entities. Each group has members property - JSON array of members of this group. Also each group has the same members serialized to string in membersStr property:

{
    "id": "cf66e4ea-39b6-43bf-a237-8ae997c8cd6a",
    "membersStr": "[{\"loginName\":\"user.automatic17@mytenant.onmicrosoft.com\",\"displayName\":\"User Automatic17\",\"id\":\"53bcf941-fa64-4af2-ac47-c7f48430df29\"},{\"loginName\":\"user.automatic52@mytenant.onmicrosoft.com\",\"displayName\":\"User Automatic52\",\"msGraphUserId\":\"515ee5ae-71dc-4c5a-a844-41af5f9d419c\"},{\"loginName\":\"user.automatic67@mytenant.onmicrosoft.com\",\"displayName\":\"User Automatic67\",\"msGraphUserId\":\"21106ae1-25fb-43ac-a5e5-aa1f0d3f9f2f\"}]",
    "members": [{
        "loginName": "user.automatic17@mytenant.onmicrosoft.com",
        "displayName": "User Automatic17",
        "id": "53bcf941-fa64-4af2-ac47-c7f48430df29"
    }, {
        "loginName": "user.automatic52@mytenant.onmicrosoft.com",
        "displayName": "User Automatic52",
        "msGraphUserId": "515ee5ae-71dc-4c5a-a844-41af5f9d419c"
    }, {
        "loginName": "user.automatic67@mytenant.onmicrosoft.com",
        "displayName": "User Automatic67",
        "msGraphUserId": "21106ae1-25fb-43ac-a5e5-aa1f0d3f9f2f"
    }]
	...
}

The question which we wanted to clarify is if we want to get groups where specific user is member - what query will be faster:

1. which uses membersStr property and LIKE operator:

SELECT c FROM c WHERE c.membersStr LIKE '%{userId}%'

2. or which uses array and object-oriented query against members properties:

SELECT c FROM c JOIN m IN c.members WHERE m.id = '{userId}'

For running this test I used CosmosDB with 10K entities with random users data. For testing I used the following C# methods:

private static List<object> GetGroupsByArray(string userId)
{
    string query = $"SELECT c FROM c JOIN m IN c.members WHERE m.id = '{userId}'";
    QueryDefinition queryDefinition = new QueryDefinition(query);
    var groupsContainer = cosmosClient.GetContainer(DBName, ContainerName);
    FeedIterator<object> queryResultSetIterator = groupsContainer.GetItemQueryIterator<object>(queryDefinition);
    var groups = new List<object>();
    while (queryResultSetIterator.HasMoreResults)
    {
        FeedResponse<object> currentResultSet = queryResultSetIterator.ReadNextAsync().GetAwaiter().GetResult();
        foreach (var group in currentResultSet)
        {
            groups.Add(group);
        }
    }

    return groups;
}

private static List<object> GetGroupsByString(string userId)
{
    string query = $"SELECT c FROM c WHERE c.membersStr LIKE '%{userId}%'";
    QueryDefinition queryDefinition = new QueryDefinition(query);
    var groupsContainer = cosmosClient.GetContainer(DBName, ContainerName);
    FeedIterator<object> queryResultSetIterator = groupsContainer.GetItemQueryIterator<object>(queryDefinition);
    var groups = new List<object>();
    while (queryResultSetIterator.HasMoreResults)
    {
        FeedResponse<object> currentResultSet = queryResultSetIterator.ReadNextAsync().GetAwaiter().GetResult();
        foreach (var group in currentResultSet)
        {
            groups.Add(group);
        }
    }

    return groups;
}

Each method was called with 10 random user ids. In both cases time of query depended from number of groups user belongs to (the more groups user was member of the more time query took). I made 10 test runs of both methods and measured execution time with System.Diagnostics.Stopwatch. Then I calculated average execution time for each run of 1st method and separately for 2nd method. Finally I measured average execution time for all 10 test runs.

Result was that query against string property (JSON-serialized array) was a bit faster (929.53 milliseconds) than query agianst array property (1362.3). Conclusion is that if performance is not critical for you and you want to have nice-looking entities in CosmosDB you may use array properties. But if performance is critical you may consider to use JSON-serialized property (or use both as an option).

Update 2022-04-14: we also compared internal query engine execution time and got different results. For array-based approach performance was better. For string-based approach query took 38ms:

while for array-based approach only 18ms:

Probably .Net SDK which was used during 1st test added own delays. Consider that when you will choose approach for your scenario.

No comments:

Post a Comment