Problem Statement
Data processing issues which we can resolve with big data implementation
For example, the rapidly growing digital world is furnishing us with numerous benefits but on the other hand, gives birth to various kinds of frauds as well. As part of auditing activity usually auditors are verifying the transaction to identify the fraud/fake transactions.
As per my understanding, what auditors does is, they will be collecting samples from the total transactions and will analysing those only. They will be making their decision based on the output of the sample analysis
This will not be accurate as we are making the decision based on samples instead of analysing all transactions. Analysing all the transactions will cause serious performance issues.
For example, if we are analysing the transactions of bigger companies then the number of transactions will be millions or trillions. Similarly, there will be 'n' number of scenarios where we will be analysing huge transactions. SQL based approach for this kind of bigger transactions will not be that much easy and efficient.
See one specific scenario
We were facing lots of issues while dealing with huge data sets. For example, while dealing with the SQL tables which are having a huge number of records and the stored procedures which takes days to process the same.
As part of Journal entry analysis and processing, we have some sets of logics which will be doing the fraud detection which is most important thing for the auditors to do accurate auditing.
We have huge number of transactions for bigger clients which is in effect taking 3 to 4 days for processing as the current implementation is with SQL stored procedures.
Big data tools like spark SQL can be used to resolve the above issues. With traditional SQL methods we won’t be able to use the parallel processing and faster data analysis. The same can be easily achieved with big data technologies.
The main goal behind this initiative was to find a better solution which can improve the performance of the fraud detection mechanisms. We want to get all these fraud detection logics to be executed on top of bigger client within 4 to 6 hours maximum.
Current State - SQL based approach
As per the current implementation, we have separate stored procedures for each fraud detection logic and each of them is getting triggered one by one. Once the process starts, it will trigger the logic on top of all the records available in the journal entry transaction table. There is no batchwise operation. It will start from the record one till end.
On average, it is taking like 3 to 4 days for completing the entire processing.
For better performance we have done the below
- SQL File partitioning
- Numa node settings
- Columnstore indexes
- Blocked any other parallel operations to these tables during the processing
Still the performance is not up to the mark and we have some inconsistent issues too.
Expectation with new approach - Spark SQL based approach
The problem that we were facing was with the performance of stored procedure that does tremendous amount of fraud detection logic on top of huge amount of SQL records.
With the data integration techniques, we got the idea about having high performance big data mechanisms to process the data in parallel with most efficient data processing mechanisms.
Execution time of these stored procedures vary based on the logic, data that we have for that specific scenarios, number of concurrent operations, etc. The average execution time for these stored procedures were varying between 12 hours to 3 days.
We can go with Spark SQL based approach for getting easy and efficient methods for achieving the same.
The above problem can be easily sorted out by applying required logics with below architecture.
Solution – Implement big data for finding fraud/fake transactions
We have implemented the spark SQL implementation with HD insight, clusters, and scala. This helped us to increase the performance of the execution. With new implementation average execution time changes to 1 to 4 hours from 12 to 3 days.
This was a great achievement as we were able to achieve such a performance improvement.
One of the problems that we faced during the spark SQL implementation was the effort needed for creating the spark SQL queries for the fraud detection logics.
But we tried converting existing stored procedures to spark SQL instead of creating a new one. We could achieve the same with small number of changes. Like replacing few of the build-in SQL functions which are not supported by spark SQL.
Process Flow
Convert existing SQL stored procedures to spark SQL
With minimum effort we will be able to convert existing SQL stored procedures to spark SQL. Some of the built-in functions are not supported in spark SQL. So that need to be replaced with some other logic
Create Jobs in scala
We need some jobs to execute the spark SQL which we created with above step.
We can Jetbrains InteliJ IDE to develop scala Jobs. This job will be executing the spark SQL that we created for executing the fraud detection logic.
Create clusters with HD insight and host the scala jobs
We will host the scala jobs in the HD insight clusters once it successfully created. These Jobs can be triggered from different platforms. In our case, we triggered it from .net code as our existing implementations were in .net
Implementing big data here will improve the quality of the auditing and can make it quick as we are doing it through tools.
There will be ‘n’ number of rules we (auditors) use to identify the fake transactions. These rules can be used while implementing the big data.
Assess Impact
With the new big data-oriented approach, we could achieve huge performance gain. With traditional SQL methods we won’t be able to use parallel processing and faster data analysis. The same can be easily achieved with big data technologies.
The main goal behind this initiative was to find a better solution that can improve the performance of the fraud detection mechanism. We were trying to get all these fraud detection logics to be executed on top of bigger client within 4 to 6 hours maximum and we achieved the same.
Result
Stored Procedure Approach |
SlNo |
Fraud Detection Logic Number |
No Of Records |
Execution time |
Description |
1 |
FDL103 |
10+ Cr |
14 + Hrs |
Got timeout 3/10 attempts |
2 |
FDL108 |
10+ Cr |
9+ Hrs |
Got timeout 1/10 attempts |
3 |
FDL102 |
10+ Cr |
12+ Hrs |
Got timeout 3/10 attempts |
Spark SQL Approach |
SlNo |
Fraud Detection Logic Number |
No Of Records |
Execution time |
Description |
1 |
FDL103 |
10+ Cr |
45 Minutes |
No errors |
2 |
FDL108 |
10+ Cr |
27 Minutes |
No errors |
3 |
FDL102 |
10+ Cr |
31 Minutes |
No errors |