Problem Statement
Power BI REST API: Datasets – Execute Queries has the below limitation
Maximum of 1000,000 rows or 1,000,000 values per query (whichever is hit first). For example, if you query for 5 columns, you can get back max 100,000 rows. If you query for 20 columns, you can get back max 50,000 rows(1 million divided by 20).
maximum of 15 MB of data per query. Once 15 MB is exceeded, the current row will be completed but no additional rows will be written.
Is it possible to have an automated way for data extraction from the Power BI dataset with data greater than 15MB or dataset greater than 100,000 rows?
Prerequisites
- Power BI Report / Dataset
- Azure Data Factory / Synapse
- Azure SQL Database / SQL Database
- Azure Blob Storage
- Logic App ( This Offering can be avoided if we leverage Web Activity / Dataflow in Data Factory / Synapse )
In this article, we have leveraged Service Principal Authentication for data extraction from Power BI, so for a change of scenario; in this use case, we would be leveraging Logic App and AD User Authentication for extracting data from Dataset.
Solution
The pipeline would have 4 variables
- To overcome the limitation of Data row / Size, we need to get the overall row count (total number of rows within the dataset) and partition the dataset in some batch range to extract the data in partitions and finally merge the data into a single file.
- To get the overall Row Count, we would be leveraging Web activity to execute a query on the Power BI dataset.
URL : @concat('https://api.powerbi.com/v1.0/myorg/datasets/',pipeline().parameters.PBIDatasetId,'/executeQueries')
Body :
{"queries":[{"query":"@{pipeline().parameters.RowCountQuery}"}],"serializerSettings":{"includeNulls":true}}
Where value for Parameter: RowCountQuery would be a Dax Query to get the Count of Rows.
EVALUATE SUMMARIZE(DataSharkX,"Count Row",COUNT(DataSharkX[EmployeeID]))
Below is my Power BI Dataset table.
Output
We would now need to extract the row count from the JSON output and assign it to the RowCount variable.
Expression:
@replace(string(last(split(string(activity('Row Count').output.results[0].tables[0].rows[0]),':'))),'}','')
Output
For this current scenario, we plan to extract the data in batches of 90000 rows. Hence, we have assigned a 90000 value to the PartitionRange pipeline parameter.
Now based on the overall row count and Partition range, we need to identify the number of iterations/batches in which we would be extracting the data.
For our use case, we have used a Lookup Activity to execute a SQL query on an Azure SQL Database.
-- Query Value:
DECLARE @@initialize INT = 1;
DECLARE @@partition INT = @{pipeline().parameters.PartitionRange};
DECLARE @@count INT = @{variables('RowCount')};
DECLARE @@maxrun INT = CEILING(@@count / @@partition);
DECLARE @@dividendcount INT = @@partition;
DECLARE @@con VARCHAR(MAX) = (SELECT CONCAT(@@initialize, ';', @@dividendcount));
DECLARE @@counter INT = 0;
WHILE (@@counter < @@maxrun)
BEGIN
SET @@initialize = (SELECT @@dividendcount);
SET @@dividendcount = (SELECT @@partition + @@dividendcount);
SET @@con = (SELECT CONCAT(@@con, '|', @@initialize, ';', @@dividendcount));
SET @@counter = (SELECT @@counter + 1);
END
SELECT @@con AS ConValue;
Output
where we have generated 3 partitions in the range of.
- 1;90000
- 90000;180000
- 180000;270000
Split the ConValue (output in step #4) via ‘|’ and iterate it over For Each activity.
Items :
@split(string(activity('Generate Partitions').output.value[0].ConValue),'|')
6. Within For Each Activity
Set up the values for the variables of LowerRange & UpperRange.
- LowerRange
Expression :
@substring(item(),add(indexOf(item(),';'),1),sub(sub(length(item()),indexOf(item(),';')),1))
Output
- UpperRange
Expression :
@substring(item(),add(indexOf(item(),';'),1),sub(sub(length(item()),indexOf(item(),';')),1))
- Generate the Final Query to Execute on the Power BI Dataset to extract the data in batches
- DAX Query below is Equivalent to the Row_Number function in T-SQL, wherein we generate a Rank Column in the source dataset and filter the dataset in every iteration based on data between the LowerRange and UpperRange
DEFINE
VAR Employee1 =
SUMMARIZECOLUMNS(
DataSharkX[EmployeeID],
DataSharkX[EmployeeName],
DataSharkX[Rating],
"EID",
VALUES(DataSharkX[EmployeeID])
)
VAR EmployeeTable =
ADDCOLUMNS(
Employee1,
"Rank",
RANK.EQ([EID], DataSharkX[EmployeeID], ASC)
)
EVALUATE
FILTER(
EmployeeTable,
[Rank] >= @{variables('LowerRange')} &&
[Rank] < @{variables('UpperRange')}
)
Output
- Trigger the Logic App Synchronously that would execute this FinalQuery on the Power BI Dataset and upload the file in Azure Blob Storage.
Logic App Design
The Highlighted section (of Sharepoint) can be ignored in the current context if need be.
- HTTP Request
Query and Iteration are the input parameters that would map to the FinalQuery and LowerRange Variable values from ADF/Synapse.
- Run Query against Dataset
Login via the User ID that has the necessary access on the Power BI Dataset and Select the necessary Workspace Name and the Dataset.
- Create CSV Table
- Create Blob
Connect to the Resource Group and Provide the Storage Account Details.
The blob Name is below the expression
concat(utcNow(),triggerBody()?['Iteration'],'pbi.csv')
- (Optional) Create File Sharepoint
Where FileName is below the expression
concat(utcNow(),triggerBody()?['Iteration'],'pbi.csv')
- For Synchronous calls and ADF / Synapse Webhook, Instead of using a request-response, we’ll use an HTTP POST message to a CallBackUri.
output after all Iterations
To Merge/Club the different files into a single file, leverage the Copy Activity
Source Setting
Source Dataset
Sink Setting
Sink Dataset
Output
Finally, Delete the Individual Partitioned data files
Note. One can leverage Web Activity in place of Webhook/Logic App to execute the queries as well, but the partition range needs to be much lower than the 100000 range as Web Activity has a lower response range.
Error Message for greater size
In one case, use Dataflow activity to execute the queries.
File Output