Bulk Processing vs Single Item Processing
Avoid bulk processing whenever possible; focus on single item processing.
Popularity and Problems of Bulk Processing
In today's world, many software applications are focused on big amount of data and problems around processing it. Big amount of data means many rows or objects in a database or other kind of storage, when many of them need to go through the same kind of process. When developers solve such assignments, they tend to think that the obvious answer is in bulk processing.
I must say, there is nothing wrong in bulk processing in specific circumstances. It's not the silver bullet though. In concrete cases (which also seems to be the most common case), you should be avoiding bulk processing and favoring a single item processing instead.
Let's assume that we are writing a fairly simple application, where a user provides some kind of input, and then the program makes a single kind of modification to many rows in the database. If that's all the program does, we should not be worried at all and use bulk processing. Indeed, if we look at the code or try to debug it, we expect that many rows will be looped through and modified by the program code.
Unfortunately, most of the applications we write professionally, are not that simple, and bulk processing can work against our productivity long run.
To make this point clearer, let's assume that the program I mentioned above needs to be upgraded so that it makes several kinds of changes to the same number of rows, for the same kind of input. Specifically, database holds list of employees, their bonuses, and their addresses. User enters a percentage of bonus (multiplier), which is same for all the employees, based on company's performance. Steps that the program should do include:
- For each employee record, calculate bonus amount based on the company's performance multiplier.
- Then, for each employee record, find the hiring date, and subtract the prorated amount from the bonus amount.
- Once these numbers are found, send a request to the HR department for issuing a check with that amount to the addresses of the employees.
It's an unfortunate reality that the requirements are often formed exactly as it's written above. Don't get me wrong - it's absolutely natural to speak about the expectations in this manner - just by listing all the things that need to happen, and describing targets for each kind of operation. It's only unfortunate from the technical standpoint, because developers often take it literally and implement as it's written - as a bulk processing - without applying any modeling or design thinking to the problem.
So, if we naivly design it as it sounds, we will have 3 consecutive loops, one for each item in the above list of steps; each of the loops goes through all the records (either employees or their bonuses or their addresses) and does the actions described above, for each of the found row, within a loop.
Below is the sample pseudo code for it. Though this is very naive example for the sake of example. In real life code, different loops are span across several different method bodies, all modifying same collection in series of procedures, by passing it around everywhere (do you recall you've seen something like that?):
//bulk processing - warning: not recommended!
public void ProcessAllEmployees(int bonusMultiplier)
var allEmployees = GetAllEmployees();
ApplyMultiplier(allEmployees); //loop inside.
SubtractProratedBonusDueToHiringDate(allEmployees); //loop inside.
SendHRCheckRequest(allEmployees); //loop inside.
This is bulk processing, and here are the problems it brings:
- It's hard to grasp the overall idea behind the code once it's written and time has passed. We just see 3 loops, doing something, somehow related to each other. Code authors try to simplify the matter by putting many comments describing what each line does, and how it's connected to the previous line. This indeed helps, but we spend time reading and understanding comments, while we should be learning and improving the code instead.
- It's hard to catch problems in the resulting outcome of the program. If any of the employees didn't receive a check, we can't just say whether it's because the first loop didn't function, or second, or third. We just know that the results of those all loops are wrong as a whole. Data of the loops leaks into each other, but there is no clear indication what could go wrong there. Maybe we just didn't pass the data correctly between the loops? or is it the problem of the loop's body? which loop's body is the problem then? Loops, in general, loosen the logic and make it hard to digest the big picture, that's why they seem redundant when we investigate issues or try to debug the code.
- It's hard to unit test the loops which exist due to bulk processing. Okay, I can unit test the case when the loop is empty, and when there is one single element in the loop; now I suspect it may not work properly when there are 2 items; what about 10? what about 100+? will I ever be able to say that my unit tests cover enough number of cases for the loop (let alone 3 loops in a single function)? I can also cross my fingers and hope that it will work fine for any number of records, but I won't ever be sure. Unit tests covering the same thing just for different number of loop iterations also seem redundant.
- It's hard to scale bulk processing code. When we want to horizontally scale our application, we add servers, trying to evenly distribute the load across them. However, if we have loops and processing all rows in one go (and thus on a single node, while all other nodes are resting), then what's the point of the horizontal scaling? We can also split between the different kinds of loops - so one node does one kind of operation using a loop, and others do another kind of operation. But what if the first step of the operation is heavier than others and the loop cannot complete in reasonable time? then all other nodes are in a waiting state until the first loop is over - so the throughput is not optimal at all. This can become a very challenging task to solve if we go this route and try to figure out the optimal number of loop iterations per node and per operation type; and then keep calibrating it over time, since we keep changing the loop's body after all. Why bother with these problems
at all, if the whole industry around scaling tries to take us into an opposite direction - item by item processing?
Workflow and Benefits of Single Item Processing
Requirements described above can be rewritten in a slightly different manner, which makes relevance of the single item processing apparant.
Take a single employee and run the following actions for it:
- Given the multiplier, calculate bonus amount for the employee.
- Given the employee's hiring date and bonus, calculate prorated bonus amount for the employee.
- Given the employee's most recent address and prorated bonus amount, issue a request to HR system for sending a check.
Repeat steps above for each employee in the company.
Even though uncommon, but if the requirements were written like I just showed above, even the same naive developers would probably think about writing a code without loops. Maybe just one loop which initiates the process for a single employee, but not three different loops in a row. This approach is not a bulk processing anymore, since the flow won't be expressed as just loops flowing into each other. Instead, we would have a continuous chain of actions for the single employee. This makes our intentions apparent, since we just do the 3 steps in a row.
Pseudo code is below. In projects with senior technical people, most of the methods will be called on the employee instance itself. Below code looks like a procedural data-driven code just for the simplicity's sake:
//single item processing - recommended!
public void ProcessAllEmployees(int bonusMultiplier)
var allEmployees = GetAllEmployees();
foreach (var employee in allEmployees)
Here are the benefits that we gain, compared to the problems described with bulk processing before:
- Overall idea is very clearly expressed in the code. We are not distracted by the loops in between the steps, not until the single employee processing is over; then we turn to the next employee record, which has nothing to do with the previous one (no pending actions for the previous record as with bulk processing). We also don't need to handle data passing or holding between the loops. We just write clear logic for the customer record.
- When debugging a problem, we can simply debug a single employee specific code. No need to debug how the data flows between the loops representing different steps. If a single employee didn't receive a check, we can simply rerun the single employee method for the whole processing, and see where it went wrong.
- Unit testing has never been easier. You write a unit test for a single employee's processing logic only. Your unit test invokes this processing and checks whether the results of this single employee are as expected. Another unit test, much simpler, just checks that the same method is invoked for all the employees in the loop, regardless of the total number of employees. Since we unit test the whole flow for a single employee separately, no need to keep testing the behavior in a loop. Just test that the behavior's entry point (method) is being invoked and that's all. Unit testing is simplified greatly because the test will only need to run a single loop, not three loops coupled with each other.
- Scaling is based on best practices - deploy a single employee processing code into each node, each waiting for a command to process a one particular record. Then dispatch commands each referring to a single employee of the company. Due to the messaging infrastructures and how they handle commands, messages will be delivered to all the nodes evenly, until all the employee records are processed. Throughput is optimal since there is no node waiting for others to finish processing.
When to Prefer Bulk Processing
Example described above - although most common among the requirements - is somewhat specific. Processing of a single employee record is not tied with other employee records. If I had asked to calculate an average age of all the employees, we wouldn't be able to solve it without bulk processing. That's because for calculating the average number, we need to loop through all the records (or all the ages specifically) within a single loop, sum them up, and then divide the sum with the total number of all rows.
It's very important to understand that most of the time bulk processing can be avoided, even though requirements may sound otherwise. Given the strong benefits of the single item processing approach, I highly recommend that you keep eye on opportunities for using it.