Optimizing assignment of people to projects is a very complex problem and classical optimization techniques are not very useful. The topic this post is a project assignment optimization problem where people should be assigned to projects in a way that will minimize the cost.
This kind of optimization problems involving discrete or categorical variables are called combinatorial optimization problems and they generally don’t have analytical solution. You have to resort to other non conventional techniques. However, these alternative techniques won’t guarantee optimal solution. Simulated Annealing is one such technique and broadly comes under category of algorithms called Stochastic Optimization.
We will discuss solution of the project assignment optimization problem using Simulated Annealing implemented on Spark. The Scala based implementation is available in my open source project avenir.
In optimization problem, given a set of variables and cost as a function of the variables, we want to find the combination of variable values that will result in minimum cost. Additionally, if there are constraints on the variables, then it’s a constrained optimization problem.
Mathematically, it can be stated as follows: minimize f(x), subject to g1(x), g2(x), .. gn(x) where x is the vector of decision variables, f(x) is the cost or objective function and gi(x) are the constraints.
Classical optimization algorithms e.g., linear programming, quadratic programming etc work well when the following restrictions apply
- All variables are numeric
- Cost function is analytically expressible
- Cost function has well define gradients (for some algorithms)
- Number of variables and constraints are within some reasonable limit
However these restrictions are not applicable for many real world problems and they often have these characteristics
- Variables are discrete or categorical. These are know combinatorial optimization problem
- Cost function is not mathematically expressible. However, it can be calculated given the variables.
- Even if the variables are numeric and cost function is matehematically expressible, it’s not not convex i.e., multi model
- Constraints are not analytically expressible.
- Have too many decision variables and / or constraints
For such optimization problems, one has to resort to stochastic optimization techniques which are non deterministic and won’t guarantee the optimal solution.
Stochastic Optimization Techniques
Stochastic optimization algorithms are the only viable option for many real world problems and they have the following characteristics.
- As the name suggests, there is an element of randomness in these algorithms and they are no deterministic. Repeated execution under the same settings will most likely produce different results.
- They are heuristic in nature with the focus being generation of good enough approximate solution, They trade off accuracy for reduced computation efforts.
- They can not provide a rigorous explanation of why an algorithms works. They are inductive and not deductive in nature.
- Some of the algorithms are inspired by natural or biological phenomena.
The search strategies for the algorithms are either trajectory based or population based. With trajectory based strategy, it starts with an initial solution and gradually moves towards the optimal solution, while keeping track of the best solution found so far. Simulated annealing uses this strategy.
In population based strategy, an initial population is used to start. Then the population is gradually improved through various techniques borrowed from natural evolution and genetics, while keeping track of the best solution found so far.
Jason Brownlee in his excellent book Clever Algorithms: Nature-Inspired Programming Recipes categorizes the algorithms into 6 categories as follows
- Stochastic : Introduces randomness into the heuristics e.g., Greedy Randomized Adoptive Search
- Evolutionary: Inspired by theory of evolution and natural selection e.g., Differential Evolution
- Physical: Inspired by physical and social systems e.g., Simulated Annealing
- Probabilistic: Uses probabilistic model for candidate solutions e.g., Cross Entropy Method
- Swarm: Uses collective intelligence of homogeneous agents in nature e.g., Ant Colony System
- Immune: Inspired by adaptive immune system of vertebrates e.g., Negative Selection
Simulated Annealing comes under the category of Physical. It draws on the physical process of a cooling material.
Now that we have set the context of various stochastic optimization techniques, let’s delve into the specific problem we are trying to solve. There are two main entities : projects and people. A project has the following attributes
- Start date
- End date
- Skill set required
We need to assign people to projects in a way that will minimize cost.Here are the attributes of a person
- Skill set
If there are m projects and n persons, it may appear to be a simple combinatorial problem with the number of combination to explore is m! / ((m-n)!. However. there is a caveat. Same people may be assigned to multiple projects, subject to some constraints. So the number of possible solutions is far more than what the factorial formula implies.
In our example, the cost that we alluded to, has the following components. Some of them are independent of the person assigned to a project.
- Per diem
- Cost due to skill mismatch
Let’s discuss constraint for this problem. The only constraint is that if the same person is assigned to multiple projects, there should be a minimum gap between the end date of one project and the start date of the next project assigned to the same person.
As mentioned before, Simulated Annealing (SA) belongs to the category of Physical process inspired algorithms. It is inspired by the physical process of material cooling and how material cools into a minimum energy crystalline structure.
Here is an analogy for an intuitive understanding of the algorithm. Imagine a bouncing ball that is dropped from above into a mountainous region with many valleys and peaks. The valleys correspond to the various minimum points in the cost function. The lowest point of the lowest valley corresponds to the global minimum.
In the beginning, the ball has high energy i.e. high temperature in SA context, and it’s more likely to escape one valley into another. In the SA context, it means we are able to escape local minimum for more exploration and hopefully navigate towards a global minimum.
As the energy level drops, the ball is less able to bounce off one valley into another. In SA terms, it’s ability to escape local minimum decreases. In other words, in the beginning the algorithm does more exploration of the search space with more ability to escape local minimum. As iteration continues, the focus shifts to exploitation of already known minimum. Here is the algorithm is pseudo code
create a solution as current solution get cost for the current solution best solution = current solution best solution cost = current solution cost for 1 to max iteration get next solution get cost for the next solution if next solution cost < current solution cost current solution = next solution current solution cost = next solution cost if current solution cost < best solutiuon cost best solution = current solution best solution cost = current solution cost else calculate acceptanace probability based on current temerature if random() < acceptanace probability current solution = next solution current solution cost = next solution cost decrease temperature return best solution and best solution cost
The domain callback object is responsible for creating the initial solution, subsequent solutions and the cost of a solution.
The initial solution is created randomly. The next solution is created by making a random mutation of the current solution. For project assignment, we select a project randomly and then replace assignment with a randomly selected person.
Whenever a solution is created or mutated to a create another solution, it’s always checked for constraints and only a valid solution is returned.
Sometimes, there may be islands of valid solutions in the search space and it’s not possible to navigate from one valid solution to another without going through one or more invalid solutions. In such cases invalid solutions may be returned by the domain callback object. Such solutions are given higher than normal cost.
Search Problem Domain
The search domain is problem specific and it needs to be decoupled from the core algorithm, so that the algorithm can be used as a general purpose solution. I have defined an abstract base class to encapsulate the problem domain specific logic. This base class will also be used by other stochastic optimization algorithms I will be implementing in future.
For any new problem, the base class needs to be extended and all the abstract methods implemented. An instance of this class is used as the callback handler by the Spark implementation of SA. For the project assignment problem, I have extended this class to implement all the problem specific logic.
The solution is represented as a string. For your specific problem, you have to come up with a string encoding format to represent a solution. For the current problem, the solution is a delimeter separated list of assignments, where each assignment is delimeter separated project ID and employee ID.
Meta Optimization and Parallel Processing
Many stochastic optimization, including Simulated Annealing are sequential and state full in nature. You might be wondering how Spark helps, given that Spark processing is parallel and state less.
It turns out that although the core algorithm can not be parallelized, we can leverage parallelism in Spark at the meta optimization level i.e. a level above the core algorithm, as follows
- Multiple restarts : Since the algorithm is non deterministic, typically we run it multiple times and every time with a random starting point and then pick the best solution among the multiple runs. These executions can be run in parallel
- Parameter tuning: The algorithm uses a set of configuration parameters. It should be re executed for different combination of values of these parameters. These executions can also be run in parallel.
I have used only multiple restart parallel processing in my implementation. Each restart is defined by an initial solution. the initial solutions can be auto generated or provided through a file. The set of initial solutions define the Spark input RDD.
The Spark implemetation include only map partition operations. The domain callback object is broadcast to spark nodes. Each partition before it starts processing, creates a clone of the object. For each each record i.e initial solution in the partition the SA algorithm is executed. the output consists of the best solution and associated cost.
Here is some sample output. We used 8 initial solutions and each for invocation of SA, the iteration count was 300. Each initial solution yielded one lowest cost solution.
(6FU70K4HGKRJ:BPA8AF1R61;K31WEUG60UUF:3NDZVY6684;2WU834GB4WZ3:29OYMGZED9;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:VGRTL83D13;S944IWM236Y0:0S64F5G7WF;ZYJT39IU0NTF:QEA56J231J;KDXRJS6B7S1O:29OYMGZED9;733W31P7B9J5:BCRTP73603;LF852I5MONC3:0S64F5G7WF,53.526) (6FU70K4HGKRJ:29OYMGZED9;K31WEUG60UUF:VGRTL83D13;2WU834GB4WZ3:BCRTP73603;G9DW6I5U0162:0S64F5G7WF;57A87XKI8595:BPA8AF1R61;S944IWM236Y0:3NDZVY6684;ZYJT39IU0NTF:QEA56J231J;KDXRJS6B7S1O:BCRTP73603;733W31P7B9J5:QEA56J231J;LF852I5MONC3:KSA5FY237W,54.156) (6FU70K4HGKRJ:BCRTP73603;K31WEUG60UUF:QEA56J231J;2WU834GB4WZ3:VGRTL83D13;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:BPA8AF1R61;S944IWM236Y0:KSA5FY237W;ZYJT39IU0NTF:0S64F5G7WF;KDXRJS6B7S1O:29OYMGZED9;733W31P7B9J5:3NDZVY6684;LF852I5MONC3:KSA5FY237W,52.779) (6FU70K4HGKRJ:BPA8AF1R61;K31WEUG60UUF:KSA5FY237W;2WU834GB4WZ3:VGRTL83D13;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:29OYMGZED9;S944IWM236Y0:QEA56J231J;ZYJT39IU0NTF:0S64F5G7WF;KDXRJS6B7S1O:BCRTP73603;733W31P7B9J5:QEA56J231J;LF852I5MONC3:QEA56J231J,52.122) (6FU70K4HGKRJ:3NDZVY6684;K31WEUG60UUF:VGRTL83D13;2WU834GB4WZ3:29OYMGZED9;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:3NDZVY6684;S944IWM236Y0:KSA5FY237W;ZYJT39IU0NTF:29OYMGZED9;KDXRJS6B7S1O:3NDZVY6684;733W31P7B9J5:QEA56J231J;LF852I5MONC3:0S64F5G7WF,58.810) (6FU70K4HGKRJ:29OYMGZED9;K31WEUG60UUF:29OYMGZED9;2WU834GB4WZ3:VGRTL83D13;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:BPA8AF1R61;S944IWM236Y0:BCRTP73603;ZYJT39IU0NTF:0S64F5G7WF;KDXRJS6B7S1O:29OYMGZED9;733W31P7B9J5:29OYMGZED9;LF852I5MONC3:0S64F5G7WF,54.386) (6FU70K4HGKRJ:BCRTP73603;K31WEUG60UUF:0S64F5G7WF;2WU834GB4WZ3:3NDZVY6684;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:VGRTL83D13;S944IWM236Y0:KSA5FY237W;ZYJT39IU0NTF:QEA56J231J;KDXRJS6B7S1O:BCRTP73603;733W31P7B9J5:29OYMGZED9;LF852I5MONC3:BPA8AF1R61,54.455) (6FU70K4HGKRJ:BPA8AF1R61;K31WEUG60UUF:0S64F5G7WF;2WU834GB4WZ3:29OYMGZED9;G9DW6I5U0162:KSA5FY237W;57A87XKI8595:3NDZVY6684;S944IWM236Y0:0S64F5G7WF;ZYJT39IU0NTF:VGRTL83D13;KDXRJS6B7S1O:BCRTP73603;733W31P7B9J5:29OYMGZED9;LF852I5MONC3:QEA56J231J,53.969) number of cost increases:1421 average cost increase:8.748 esimated initial temperature:39.203
Among the 8 solutions, the 4th one seems to be the best with a cost of 52.122. There is no guarantee that this is the globally optimum solution. Actually it’s not, because I have seen cost around 49 in some others runs.
Each solution has 10 assignments, because there are 10 projects. Each assignment e.g K31WEUG60UUF:QEA56J231J has project ID and person ID separated by :.
There is a second phase of the algorithm where search is made around the best solution found so far, looking for an even better solution. I have not used that option for the output shown.
The last 2 lines of the output requires some explanation. The implementation always estimates the initial temperature based on the average increase in cost as the algorithm explores the search space moving from one solution to the next and the probability of accepting a higher cost solution, which is defined as a configuration parameter.
Selecting the right initial temperature is crucial for striking the right balance between exploration and exploitation. With a higher initial temperature, the algorithm will explore more i.e. it will be more likely to explore worst solution with the hope of jumping off one local minimum to another more promising one. You can start with an initial guess of initial temperature. It can refined based on recommendation received after running the algorithm.
The cooling rate dictates how quickly the algorithm transitions from exploration mode to exploitation mode. If the cooling rate is low, more iterations will be required.
As mentioned earlier configuration tuning is not automated. If we did automated tuning then all the configuration parameter value combinations could be made part of the RDD. Spark parallel processing comes really handy when we automate parameter tuning.
Here is an example. Let’s say we try 5 possible initial temperatures, 5 possible cooling rates and 2 possible iteration counts. There will be 50 possible combinations of these parameters. Each of the 8 initial solutions will have to be tried for the 50 configuration parameter combinations. So the RDD will contain 400 records, each one yielding an optimum solution.
The same approach for automated parameter tuning can be used in Machine Learning also. Parameter tuning in Machine Learning requires a search through the parameters space with the goal of finding parameter value combination that gives lowest generalization error for the trained model.
We have used a stochastic optimization technique called Simulated Annealing to solve a complex project assignment problem. This tutorial contains the steps to execute this use case.
For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,