Monte Carlo Simulation Library in Python with Project Cost Estimation as an Example


I was working on a solution for change point detection in time series, which led me to certain two sample statistic, for which critical values didn’t exist. The only option was to simulate the statistic values and estimate critical values from the resulting distribution. Looking for a solution, I only found some ad hoc Python implementation for Monte Carlo simulation which were not very reusable. It prompted me to digress from my original project and instead to work on a reusable Python implementation of Monte Carlo Simulation. The implementation is available in my open source git hub project avenir.

Monte Carlo Simulation has application for wide range of problems in science and engineering. In stead of my esoteric two sample statistic as an example, I decided to use the well know and ubiquitous problem of project cost estimation.

Monte Carlo Simulation

When do you need Monte Carlo(MC) Simulation? It’s needed in the following scenario that involves a function of several random variables

  • You have several random variables with know distributions
  • You have a complex non linear function of those variables and you want to estimate some statistical properties of the function

Analytical solution for the problem only exist under simple conditions and assumptions. But for most real world problems the problem is intractable because of the following reasons

  • Some the distributions may be parametric but intractable or non parametric
  • The function is non linear and complex, which is typically the case for real world problems

The idea behind Monte Carlo Simulation is sample and intuitive. It generates sample values for the input variables and they are used to calculates the function values, generating samples for the function values. The function value samples are used to derive various statistical properties. This is how my reusable Monte Carlo Simulation implementation is structured

  1. You provide probability distributions for all the random variables
  2. The simulator goes into a loop and in each iteration through the loop, samples all the distributions to generate a list of of values for all the variables
  3. The simulator calls a callback function provided by the user passing the sample values.
  4. The call back function calculates the function values to the simulator
  5. Once the loop the simulator ends, you get all the the generated function values and perform whatever statistical operation you want to perform on the function values.
  6. In my implementation you can get some of the common statistical properties of the function from the simulator. If that does not suffice, you have the option of getting the raw function values and perform whatever statistical operation you need to perform

The callback function implements the application logic i.e calculation of the function. The callback enables the MC Simulation implementation to be decoupled from any application logic.

Features of the Simulation Library

A wide array of distribution samplers should be supported for a simulator to be useful. Following probability distribution samplers are supported

  1. Bernaulli trial sampler
  2. Uniform distribution sampler
  3. Normal distribution sampler
  4. Log normal distribution sampler
  5. Poisson distribution sampler
  6. Pareto distribution sampler
  7. Non parametric distributions sampler
  8. Triangular distribution sampler
  9. Muti variate normal distribution sampler
  10. Muti variate non parametric distributions sampler

You can also have your own custom sampler, as long as the the sampler Python class implements a sample() method without arguments. Shortly, we will see how some of these distributions have been used for various variables for our project cost estimation use case. All the samplers are registered with the simulator by calling one of registerXyz….Sampler() family of methods.

The simulator will sample each of the samplers in the order they are registered and create a list of sampled values. You can also register some context objects with the simulator. They get appended to the list. Finally, the simulator calls the callback function passing the list as an argument.

As alluded to earlier, the MC simulator provides various methods to get several common statistical properties for the output of the simulator i.e function value which for our use case is the project cost, as listed below

  1. Average
  2. Std deviation
  3. Median
  4. Histogram
  5. Max
  6. Min
  7. Integral
  8. Lower tail statistic (percentiles)
  9. Upper tail statistic (percentile)
  10. Raw function values

If they don’t meet your requirements, you can always get the raw function values and save them if a file for later analysis.

The min and max properties might  make you ponder whether MC simulation could be used for optimization problems. Yes it could be, in theory, but not a good idea. For this kind of point solution, search based optimization algorithms are more appropriate.

Data Science Project Cost

The project is a fictitious Data Science project for which which an offshore company wants to bid. Instead of doing back of the envelope estimates with some arbitrary margin to mitigate uncertainties, they decided to put their estimates on a better statistic ground and wanted an estimate with 95% confidence limit.

The project definition contains the following. Details of this example can be viewed in the Python implementation. It could also serve as an example if you want to implement a different use case.

  1. Hourly rates of team members
  2. For each task, project members and percentage participation

Here are the input variables along with the type of distribution used. Most of the variables are hours needed to complete a task or activity. Since there are uncertainties associated with them. That’s why they are specified probabilistically

  1. Front end development task : Normal distribution
  2. Machine Learning (ML) development task hours : Bi modal non parametric distribution.
  3. Percentage participation of lead ML developer: Triangular distribution.
  4. Deployment task hours : Normal distribution
  5. Project management task hours : Normal distribution
  6. Unexpected event of team member quitting or taking time off : Bernauli trial
  7. Additional hours for the unexpected event : Normal distribution
  8. Number of interruptions in day (e.g. phone, email, meeting) : Poisson distribution (6 of them, 1 for each team member)

Some of the choices may require clarification. For the ML task, it was anticipated that the first algorithms to be tried might fail and a second more complex model may have to be used. The bi modal distribution reflects this reality.

Regrading ML lead engineer participation, since the other member in the team is junior without lot of experience, it’s anticipated that the lead may have to spend extra hours to cover for the situation. Essentially, the participation percentage is treated as a random variable.

The unexpected event of quitting or taking time off is modeled with an Bernalli trial. Only if the sampled trial returns true, the additional hours for the unexpected event is added towards the total cost.

Number of interruptions e.g. phone call, email and meeting in day is modeled as a Poisson process. Each interruption event is assumed to consume some fixed about of time. All these translate to time spent by team members attending to the interruptions and eventually to additional cost.

To make estimates better, you could have separate Poisson distribution for each type of interruption and a distribution for time spent for that type of interruption.

Running Monte Carlo Simulator

You can browse the Python implementation for the MC simulator if curious. The driver code for project cost example is run as follows, where the number of iterations is 5000. The result will improve with higher number of iterations. The output is also shown, which includes mean, std deviation and upper tail statistics.

./pccb.py 5000

mean 23608.28 std dev 4906.31 upper critical values 0.850 29140 0.860 29327 0.870 29479 0.880 29667 0.890 29901 0.900 30057 0.910 30341 0.920 30634 0.930 30882 0.940 31208 0.950 31536 0.960 31950 0.970 32401 0.980 32963 0.990 33782 1.000 37104

Here is the distribution for the cost

The distribution has 2 modes. This is caused by the bi modal distribution for the ML task hours that we have used. From the output above, for project cost estimate at 95% confidence, the cost is $31536. The confidence level chosen should be based on risk aversion.

Project Elapsed Time

You may also be also be interested in a probabilistic estimate of the elapsed time for a project, which will enable you to find the confidence level of meeting the deadline. The following changes are necessary in the example code. You may want to try it out, if interested.

  1. Provide distribution for overlap between each pair of tasks. Some task may be completely independent (complete overlap) or completely dependent (no overlap) on other tasks. They don’t need to be included for MC simulation.
  2. Make necessary changes in the callback function. Each pairs of tasks should be considered along with sampled overlap between them to compute elapsed time for the whole project.

Summing Up

Monte Carlo Simulation is useful for finding statistical properties of a quantity which is a complex function of several random variables. The quality of MC simulation output will depend on the prudent choice of probability distributions for the independent variables. We have seen how it can be used for probabilistic estimate for a project cost. The tutorial document has details on how to execute this example use case

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Science, Python, Statistics and tagged , . Bookmark the permalink.

Leave a comment