Monte Carlo Simulation Library in Python with Project Cost Estimation as an Example

I was working on a solution for change point detection in time series, which led me to certain two sample statistic, for which critical values didn’t exist. The only option was to simulate the statistic values and estimate critical values from the resulting distribution. Looking for a solution, I only found some ad hoc Python implementation for Monte Carlo simulation which were not very reusable. It prompted me to digress from my original project and instead to work on a reusable Python implementation of Monte Carlo Simulation. The implementation is available in my open source git hub project avenir.

Monte Carlo Simulation has application for wide range of problems in science and engineering. In stead of my esoteric two sample statistic as an example, I decided to use the well know and ubiquitous problem of project cost estimation.

Monte Carlo Simulation

When do you need Monte Carlo(MC) Simulation? It’s needed in the following scenario that involves a function of several random variables

You have several random variables with know distributions
You have a complex non linear function of those variables and you want to estimate some statistical properties of the function

Analytical solution for the problem only exist under simple conditions and assumptions. But for most real world problems the problem is intractable because of the following reasons

Some the distributions may be parametric but intractable or non parametric
The function is non linear and complex, which is typically the case for real world problems

The idea behind Monte Carlo Simulation is sample and intuitive. It generates sample values for the input variables and they are used to calculates the function values, generating samples for the function values. The function value samples are used to derive various statistical properties. This is how my reusable Monte Carlo Simulation implementation is structured

You provide probability distributions for all the random variables
The simulator goes into a loop and in each iteration through the loop, samples all the distributions to generate a list of of values for all the variables
The simulator calls a callback function provided by the user passing the sample values.
The call back function calculates the function values to the simulator
Once the loop the simulator ends, you get all the the generated function values and perform whatever statistical operation you want to perform on the function values.
In my implementation you can get some of the common statistical properties of the function from the simulator. If that does not suffice, you have the option of getting the raw function values and perform whatever statistical operation you need to perform

The callback function implements the application logic i.e calculation of the function. The callback enables the MC Simulation implementation to be decoupled from any application logic.

Features of the Simulation Library

A wide array of distribution samplers should be supported for a simulator to be useful. Following probability distribution samplers are supported

Bernaulli trial sampler
Uniform distribution sampler
Normal distribution sampler
Log normal distribution sampler
Poisson distribution sampler
Pareto distribution sampler
Non parametric distributions sampler
Triangular distribution sampler
Muti variate normal distribution sampler
Muti variate non parametric distributions sampler

You can also have your own custom sampler, as long as the the sampler Python class implements a sample() method without arguments. Shortly, we will see how some of these distributions have been used for various variables for our project cost estimation use case. All the samplers are registered with the simulator by calling one of registerXyz….Sampler() family of methods.

The simulator will sample each of the samplers in the order they are registered and create a list of sampled values. You can also register some context objects with the simulator. They get appended to the list. Finally, the simulator calls the callback function passing the list as an argument.

As alluded to earlier, the MC simulator provides various methods to get several common statistical properties for the output of the simulator i.e function value which for our use case is the project cost, as listed below

Average
Std deviation
Median
Histogram
Max
Min
Integral
Lower tail statistic (percentiles)
Upper tail statistic (percentile)
Raw function values

If they don’t meet your requirements, you can always get the raw function values and save them if a file for later analysis.

The min and max properties might make you ponder whether MC simulation could be used for optimization problems. Yes it could be, in theory, but not a good idea. For this kind of point solution, search based optimization algorithms are more appropriate.

Data Science Project Cost

The project is a fictitious Data Science project for which which an offshore company wants to bid. Instead of doing back of the envelope estimates with some arbitrary margin to mitigate uncertainties, they decided to put their estimates on a better statistic ground and wanted an estimate with 95% confidence limit.

The project definition contains the following. Details of this example can be viewed in the Python implementation. It could also serve as an example if you want to implement a different use case.

Hourly rates of team members
For each task, project members and percentage participation

Here are the input variables along with the type of distribution used. Most of the variables are hours needed to complete a task or activity. Since there are uncertainties associated with them. That’s why they are specified probabilistically

Front end development task : Normal distribution
Machine Learning (ML) development task hours : Bi modal non parametric distribution.
Percentage participation of lead ML developer: Triangular distribution.
Deployment task hours : Normal distribution
Project management task hours : Normal distribution
Unexpected event of team member quitting or taking time off : Bernauli trial
Additional hours for the unexpected event : Normal distribution
Number of interruptions in day (e.g. phone, email, meeting) : Poisson distribution (6 of them, 1 for each team member)

Some of the choices may require clarification. For the ML task, it was anticipated that the first algorithms to be tried might fail and a second more complex model may have to be used. The bi modal distribution reflects this reality.

Regrading ML lead engineer participation, since the other member in the team is junior without lot of experience, it’s anticipated that the lead may have to spend extra hours to cover for the situation. Essentially, the participation percentage is treated as a random variable.

The unexpected event of quitting or taking time off is modeled with an Bernalli trial. Only if the sampled trial returns true, the additional hours for the unexpected event is added towards the total cost.

Number of interruptions e.g. phone call, email and meeting in day is modeled as a Poisson process. Each interruption event is assumed to consume some fixed about of time. All these translate to time spent by team members attending to the interruptions and eventually to additional cost.

To make estimates better, you could have separate Poisson distribution for each type of interruption and a distribution for time spent for that type of interruption.

Running Monte Carlo Simulator

You can browse the Python implementation for the MC simulator if curious. The driver code for project cost example is run as follows, where the number of iterations is 5000. The result will improve with higher number of iterations. The output is also shown, which includes mean, std deviation and upper tail statistics.

./pccb.py 5000

mean 23608.28
std dev 4906.31
upper critical values
0.850  29140
0.860  29327
0.870  29479
0.880  29667
0.890  29901
0.900  30057
0.910  30341
0.920  30634
0.930  30882
0.940  31208
0.950  31536
0.960  31950
0.970  32401
0.980  32963
0.990  33782
1.000  37104

Here is the distribution for the cost

The distribution has 2 modes. This is caused by the bi modal distribution for the ML task hours that we have used. From the output above, for project cost estimate at 95% confidence, the cost is $31536. The confidence level chosen should be based on risk aversion.

Project Elapsed Time

You may also be also be interested in a probabilistic estimate of the elapsed time for a project, which will enable you to find the confidence level of meeting the deadline. The following changes are necessary in the example code. You may want to try it out, if interested.

Provide distribution for overlap between each pair of tasks. Some task may be completely independent (complete overlap) or completely dependent (no overlap) on other tasks. They don’t need to be included for MC simulation.
Make necessary changes in the callback function. Each pairs of tasks should be considered along with sampled overlap between them to compute elapsed time for the whole project.

Summing Up

Monte Carlo Simulation is useful for finding statistical properties of a quantity which is a complex function of several random variables. The quality of MC simulation output will depend on the prudent choice of probability distributions for the independent variables. We have seen how it can be used for probabilistic estimate for a project cost. The tutorial document has details on how to execute this example use case