Recently I dealt with a large IoT dataset where data was flowing into a SQL Server database, fed from a REST API call that our developers were making to a third-party service. The results returned by those REST calls were in JSON format, and the developers chose to write this data to a table, exactly as received. Later, Business Intelligence was asked to build several SAP Business Object reports based on this data. That led to the conclusion we needed to transform the data into a SQL data store, since that was going to allow for consumption in SAP BO, merging with other SQL-based data as well.
In this article, I’ll explain some of the mechanics of the solution, using fake data to “hide the innocent” but still offer some concrete details you can use in your own solutions.
Problem Set
We could have gone a few different directions with this project. The data we were seeing looked something like this,
An individual record here includes JSON that had a certain amount of nesting involved – for example,
- {
- "Device": {
- "ID": 2388122,
- "Status": "Active"
- },
- "Box": {
- "Tracking": "GSK922391",
- "Measures": {
- "M1": 821.2,
- "M2": 9328,
- "M3": {
- "Hi": 42.6,
- "Low": 18.3
- }
- },
- "Alarms": [
- {
- "Code": 4,
- "TriggeredBy": 2
- },
- {
- "Code": 12,
- "TriggeredBy": 3
- }
- ]
- }
- }
Of note, we have Alarms as an array, nested within the Box object. What we’d like to do is build reports that could look something like this,
Here we’ve taken some data that would be sitting in JSON and combined it with other data in our SQL data warehouse (e.g. arrival and departure dates, where we might have looked these up using a key such as the tracking or device ID).
We were dealing with SQL 2014 on-prem, so we didn’t have the ability to use native JSON parsing functionality which became available in SQL 2016. We did find a free, third-party JSON parser that’s implemented as a SQL user-defined function (UDF). This would have allowed our team who know SQL well to easily “join in place” raw data from our IoT feed with the other data – but I was concerned with performance and scalability. With SQL 2016, forethought has been given to offering good performing solutions that involve JSON, including the ability to create indexes against JSON “values.” However, this approach still doesn’t address advanced parsing, such as splitting arrays into multiple rows, or concatenating values into a list, and having these persisted in a high-performance manner.
Also, we were seeing growth of about 500,000 (likely increasing) rows per month, and rather soon it would be nearly impossible to run live queries against a “full set” of data, using what looks like a CPU-heavy, scan-hungry UDF. One option might have been to only look at new records since the “last load” and shred the JSON into a different, more SQL-friendly format. That could have, for example, meant moving data into a table that looks like this,
- CREATE TABLE Demo.IoTEvent
- (IoTDumpID int NOT NULL PRIMARY KEY,
- TrackingNumber varchar(20) NOT NULL,
- M1 decimal(12,4) NOT NULL,
- M2 int NOT NULL,
- M3Low decimal(12,4) NOT NULL,
- M3High decimal(12,4) NOT NULL,
- AlarmList varchar(100) NULL)
Here our IoTDumpID is retained from the source table, for traceability back to the row’s origins. Explicit fields for each tracked attribute can be strongly-typed and named in a meaningful way. If we wanted to call out alarm codes in a one-to-zero-one-or-many way, a child table would be warranted. However, in our case there was no indication of any value in alarm codes beyond a simple, concatenated list.
An Alternative: Power Query
What other technology lets us consume JSON and build reports or otherwise see it in a tabular way? Power BI can do it, as illustrated here,
Here we’re seeing Power BI offering a way to split JSON objects and arrays into columns. But it’s Power Query as used by Power BI that’s the key here; what we’d really like is a way to use it programmatically and include it in an ETL process. One way to do so is by using Analysis Services. As a SQL Server shop, we could have done this using on-prem SQL licensing, but we had no existing tabular AS instances configured. (Nor did we plan to: we were already using AAS.) Spinning up AAS instances is very easy – so friction is very low. Power Query has been available to SSAS both in the on-prem and Azure-based (AAS) versions for a while.
The ability to use Power Query though AAS for this purpose can be achieved using this general data flow,
Notice that our landed JSON data is on-prem, so we’re using our installed On-Premises Data Gateway. The gateway is out-of-scope for this article, but the general principle is we can craft a solution that seems to work as if it’s on-prem: we can refer to server names that are on-prem, and then we deploy it to AAS – and it just keeps working!
If we stuck with using JSON parsing in the SQL layer, it would have meant,
- Since we’d only ever persist a subset of fields (the real JSON had about 100 attributes), if the business decided they wanted to include a field in the original JSON that was not being included in the shredded output, we’d have had to do a “retroactive load” of older data, potentially. Given the dataset size, this would eventually become a big, slow deal, having to run a UDF over millions of rows. Using a tabular model layer means we’d flatted every single field available, up-front. Our choice of putting a subset in a SQL table comes down to running a query against our tabular model (e.g. using DAX – a query that itself can be sophisticated, if needed).
- The effort to build a tabular model from the shredded relational data would have been “extra” work. A natural byproduct of using AAS to be our “shredder of JSON” is we get a BISM that could stand on its own or be part of a larger model. (If someone asks for a field we didn’t shred to SQL – why not use Power BI against the existing AAS model? Our general trend is the encourage use of Power BI over SAP BO, anyhow.)
- We’d need to cut SQL code to shred JSON, not using - oh say, a visual designer!
Here I’ve created a tabular model project in Visual Studio, with a connection to our database, and I’ve added the table that our JSON was being landed in,
If we look at “Table Properties”, we get a pop-up that looks like this,
If we click “Design”, another pop-up comes up. This one is the “Power Query Editor.”
This is just a graphical interface sitting on top of the “Sql statement” shown on the Table Properties window, which in fact is showing us a Power Query Formula Language (aka “M”) expression. More on that shortly. In my example, we need to have the APIData column first be formally recognized as JSON,
With that done, the arrow on the column changes into a “split” icon. Clicking on it, we see that Power Query is prepared to split two more objects – “Device” and “Box” into new columns,
Notice here, each column name is built by default from the object hierarchy it comes from. Also, “Record” is still showing, indicating that each column is still encapsulating a JSON object. We can continue to split, further and further, until we’re nearly at a flattened table. (As you do this, notice how your “Applied Steps” builds up to reflect each split.) When we come to an array, we get a different option,
In this case, it certainly seems like we would want to “Extract Values…” – the option even asks us what delimiter character we want to use (comma, please!). However, our list elements being JSON objects themselves cause problems for the designer. One option is to expand to new rows and do the desired concatenation in your DAX expression (e.g. using CONCATENATEX). However, I was convinced we could do this strictly with M, and sure enough with some tinkering, this seems to work (adding as an additional column called “AlarmList”, to illustrate),
- #"Added Custom" = Table.AddColumn(#"Expanded APIData.Box.Measures.M3", "AlarmList",
- each
- if Table.IsEmpty(Table.FromRecords([APIData.Box.Alarms])) then null else
- Text.Combine(
- Table.ToList(
- Table.SelectColumns(
- Table.TransformColumnTypes(
- Table.FromRecords([APIData.Box.Alarms]),
- {{"Code", type text}}
- ),
- "Code"
- )
- ),
- ","
- )
- )
At this point, we’re going off-road into hand-written M, but it is possible and emphasizes the “power” part of Power Query.
In terms of deployment – where are we deploying this? Again: you could use a local SSAS instance or Azure. With Azure, this is the starting point,
Functionally, the only real difference from on-prem and AAS is your reference will be a URL that has a scheme of “asazure”. The key now is we’ve got a tabular model, connected to our data source, and it’ll be updated when the model is processed. This processing step will be worked into our larger solution, described next.
Working Solution
The tabular model is only one piece of our solution. Per the architecture diagram, we want to pull data out from the tabular model, into SQL tables for select fields. This enables our requirement to join with other relational data for reporting. To coordinate the flow of data, I’ll use SSIS (already in use by IT). The package structure might look like this,
Of note, I’m using a script task that performs processing of the tabular model, details of which I’ll make available later on GitHub. A “full process” is possible, but for performance sake, I opted to do incremental processing and loading. To achieve this, I determine the “last record” that was shredded, based on the IDENTITY column that’s present on our JSON landing table. (This value is carried through the entire process, so we can look at the maximum value we have in the final, shredded destination table – IoTEvent in my example.)
The data flow here is moving data from our AAS model to a SQL landing table,
The data source here is making a connection not to a SQL database, but to our AAS database, using an OLE DB (for Analysis Services) connection,
We can potentially use a direct mapping, but in my case, the source is a DAX expression to support incremental loading. To do that, we can use a SSIS variable that is composed based on the “last ID” mentioned previously. A preview of our DAX is shown here,
Notice that FILTER is being used, and the “> 0” is really based on “> the last IoTDumpID present in the target.” With this in place for our source, our destination can simply map our fields (with conversions performed for Unicode strings),
Realized Benefits – Model Changes
Sure enough, a new requirement came along a couple of months later, saying our data vendor was going to add a new data element to the existing JSON. If we didn’t need this new data, our existing solution would continue to work fine: new fields would just not be imported to our model. However, we wanted this data, and the data was an array of values. Each array element should turn into a unique row in both our tabular model and our SQL representation. Using my example app, an analogue would have been if our JSON changed to look like this,
- {
- "Device": {
- "ID": 2388122,
- "Status": "Active"
- },
- "Box": {
- "EventTypes": [
- "Scan",
- "RFIDFence"
- ],
- "Tracking": "GSK922391",
- "Measures": {
- "M1": 821.2,
- "M2": 9328,
- "M3": {
- "Hi": 42.6,
- "Low": 18.3
- }
- },
- "Alarms": [
- {
- "Code": 4,
- "TriggeredBy": 2
- },
- {
- "Code": 12,
- "TriggeredBy": 3
- }
- ]
- }
- }
Notice “EventTypes” is new, and we’d like a new column called EventType in our IoTEvent table that repeats all the same other details, but with the different event type text (we expect at least one array element and that they’re unique). All that was needed to make this work was to go into the tabular model, go to “Table Properties”, invoke the “Power Query Editor”, and locate our new array,
Using zero code, I simply elected to “split” this field into multiple rows,
In my example, the 4th and 5th rows came from my sample JSON above. Keep in mind, if we’d gone the UDF route, things wouldn’t have been nearly as easy, since it wasn’t as simple as extracting a new value from the JSON: we had to change the very granularity of the target data.
With the change, our model now emits one or more rows for each “ID” value of our landing table. If we’d previously created say a primary key in our SQL representation of the data that used this ID value – we’d need to adjust that, expanding it to be a compound key (including the EventType). In the end, the overall solution has succeeded well in production: the SSIS package runs for about three minutes, on the hour, and data has been flowing reliably for reporting. Making the change for the new field was shockingly simple and well received.
I may discuss related topics further, so stay tuned. Also, there are other related technologies that attempt to do what we’ve done with Power Query here, and I’m interested in hearing about other approaches, too! Did I not explain something well, above? Leave a comment and I’ll try to clarify.