Friday 9 November 2012

Windows 8 Pro Release...

...has a great uninstall application, as despite the Windows 8 Upgrade Assistant saying everything is hunky-dory, I spent the last 5 hours downloading and installing Windows 8 just to get the 2 screens shown below.

These appeared after the 'usual' Win8 ':( something has gone wrong' screen that I also got in CTP on a VirtualBOX VM, whatever I tried:

fig 1 - First error screen in the sequence


fig 2 - Second error screen in the sequence

It would be interesting to know from people who managed to install Win8 Pro on hardware that's a year or two old as I don't know anyone that's not had any problems at all.

Once installed, most people report a good experience (or at least some good experiences) but it is now 1am after wasting an unbelievable amount of time (and paying the cost of the software) that I won't get back, I am not in the mood to try to fix this now.

If Microsoft wants to compete in the tablet and phone markets with the likes of Apple, things have got to just work! With the diversity of the hardware platforms that it they will typically have to support in that arena, this isn't easy at best, but this certainly isn't the way to do it on a desktop platform they have dominated for a few decades.

I will have to maybe try it on a different box tomorrow. It depends if I can transfer that license across. Otherwise, dummy out of the pram, I'm sulking!

The Working Update

UPDATE: I finally managed to get it installed and working the day after. However, I had to reinstall all my applications (the majority of which were not on the Upgrade Assistant's list) and have still got some to do.

I had to choose to save only my files, not my apps. The BSOD error that was happening previously was coded as 0xC000021A. A quick Google seemed to suggest there were too many options at that point, a problem with winlogon.exe (which seemed to happen a lot in the history of Windows, including XP dying by itself), so I just thought blue word thoughts and installed the thing saving only my files.

Once installed, I am actually quite happy with it. It is very fast compared to Win7 on this box! Though I don't know if this is because I am not running some services which I used to. Apart from that, it is very responsive on my SSD based 3.6GHz quad core AMD Phenom II X4 975X Black edition.

The lack of Start menu was confusing, especially when I instinctively hit the Window key on the keyboard. The Metro interface does seem very simplified and closing Windows in Metro would be extremely long winded if I didn't know Alt+F4 existed. It requires a mouse to pick up and drag a Window to the bottom of the screen (think send it to the grave) or you could move the mouse to the top left hand corner of the screen to bring up the running apps bar, right click and select 'Close' (akin to right clicking an icon in the taskbar on Win7 and selecting 'Close Window').

The same is true of shutting Windows down. If you are o the desktop and choose Alt-F4 this brings up the usual Windows shut-down dialog box. Otherwise it is Win+C or move the mouse to the top right and select the 'Settings' cogwheel, then the power button, then 'Shutdown.' from the resulting context menu, then breathe!

I will continue to play and see where it takes me. There are a couple of annoying elements about Metro so far, but I hope this old dog will learn new tricks with time.

Sunday 4 November 2012

Chaining risks, the Markov way (Part 1)

This one is a bit of a musing, as it isn't currently an established norm in the world of software development.

I started this blog post with the aim of going end-to-end on the translating of risk information stored in a log or an FMEA into a Markov chain, which would be modelled through an adjacency table which could then be reasoned with. For example, finding the shortest path through the risks and impacts to find the least risky way through the development or operational stages. However, this proved to be a much longer task when running it through step by step.

So I have decided to split this down into a couple of blog posts. The first will deal with modelling and visualising the risk log as a directed graph. The second will then build an adjacency table to reason with computationally and so deal with the optimising of those risks.

Risk? What risk?

During the development of a piece of software, there are a number of risks which can cause the project to run into failure. These can be broadly categorized into development and operation risks.

Remembering that the lifetime of a piece of software actually includes the time after its deployment into the production environment, we shouldn't neglect the risks posed in the running of the code. On average this is generally 85% of the total time of the lifetime of a project, yet we often give the proportion of the risks only lip service.

In either case, we have to be aware of these risks and how they will impact the software at all stages. Most 'new age' companies are inherently their software products and therefore, the risks associated with these products inherently put the company as a whole at significant risk.

I was thinking the other day about the use of FMEA's and their role in the communication process. I tailed off my use of FMEA like processes years ago, but picked them up again in 2011 after a contract in Nottingham. The process is pretty easy and harks back to the yesteryear of House of Quality (HoQ) analyses, which I used a lot and still use to some degree in the form of weighted-factor models or multivariate  analyses. People familiar with statistical segmentation or quants will know this from their work with balanced scorecards.

What struck me about the FMEA, even in its renaissance in my world, is that the presentation of the FMEA, just as with any form of risk log, is that it is inherently tabular in nature. Whilst it is easy to read, this doesn't actually highlight the effects those risks will have adequately.

FMEAs and Risk Logs

An FMEA (Failure Mode Effect Analysis) is a technique which expands a standard risk log to include quantitative numbers and allows you to automatically prioritise mitigating the risks not just on the probability of occurrence and its impact but also on the effect of its mitigation (i.e. how acceptable its residual risk is).

Now, often risks don't stand alone. One risk, once it becomes an issue, can kick off a whole set of other causes (in themselves having risks) and these will have effects etc.

Consider for example, a situation where a key technical person (bus factor 1) responsibly for the technical storage solution, leaves a company and an enterprise system's disks storage array fails or that array loses connectivity. This will then cause errors in the entire enterprise application catalogue where data storage is a critical part of the system, which then loses customer service agents the ability to handle customer data, consequentially the company money both in terms of lost earnings but also in reputation and further opportunity costs caused by such damage to the brand.

A risk log or even FMEA will list these as different rows on a table. This is inadequate to visualise the risks. Indeed, many side-effects of this form of categorization exist. Such as, if the log is updated with the above risks at different times, the log may not have these items near each other if they are sorted by effects or entered at different times. So the connection is not immediately obvious.

What are you thinking, Markov?

I started thinking about better ways to visualise these related risks in a sea of other project risks. One way that came to mind is to use a probability-lattice/tree to expand the risks, but then it dawned on me that risks can split earlier in a chain and converge again later on.

OK, easy enough to cover off. I will use a directed graph. No problem. But then this felt a bit like deja vu.

The deja vu was because this is effectively what a Markov chain is.

A Markov chain is effectively a directed graph (specifically a state-chart) where the edges define the probability and show the system's state move from risk to risk.

This was a particularly important result. The reason for this is any directed graph can be represented as an adjacency matrix and as such, it can be reasoned about computationally. For example, a travelling salesman algorithm can then be used to find the shortest path through this adjacency table and thus, these risks.

I have deliberately used the words 'cause' and 'effect' to better illustrate how the risk log could be linked to the Markov chain. Let's consider the risk log elements defined in the following table for the purpose of illustration:

Risk No Cause (Risk) Effect (Impact) Unmitigated risk (Risk,Impact) Mitigation Residual risk (Risk,Impact)
1 DB disk failure Data cannot be retrieved or persisted L,H Introduce SAN Cluster L,L
2 DB disk full without notification Data cannot be persisted M,L Set up instrumentation alerts L,L
3 Cannot retrieve customer data Customer purchases cannot be completed automatically M,H Set Hot standby systems to failover onto L,L
4 Cannot process payments through PCI-DSS payment processor Customer purchases cannot be completed automatically M,H Have a secondary connection to the payment gateway to failover onto L,L
5 Customer purchases cannot be completed automatically Net revenue is down at a rate of £1 million a day M,H Have a manual BAU process L,M


I have not included monitoring tasks in this, plus this is an example of an operation risk profile. However, if you look carefully, you'll note that the risks play into one another. In particular, there are many ways to get into the 'Customer purchases cannot be completed automatically' or 'Data cannot be persisted' effects. However, it is not immediately obvious that these risks are related.

We can model risks as the bi-variable(r,s),where r is the probability of the issue occurring and s as the impact if the risk occurs (i.e. sensitivity to that risk).

The values of these bi-variables are the L,M,H of each of risk and impact in a risk log or FMEA (in the latter case, it is possible to use the RPN - Relative Priority Number - to define the weighting of the edge which simplifies the process somewhat).

Taking the risk component alone, this will eventually be used as the elements in an adjacency table. But first, to introduce the Markov chain. Obviously, if you are familiar with Markov chains, you can skip to the next section.

Markov Chains. Linking Risks.

Markov chains are a graphical representation of the probability of events occurring, with each node/vertex representing the state and the edges the probability of that event occurring. For each node in the chain, it must have the sum of all probabilities leaving the node equal to one. Consider this the same as a state transition diagram, where the edges are the probabilities of events occurring.

In a Markov chain, every output has to total 1. Thus you have to show the transitions which do not result in a change of state if applicable. If a probability is not shown in the risk log, then it is not a failure transition (thus is is no issue) so you include that as 1 minus the sum of all outgoing transition probabilities. Effectively letting any success on a node loop back on itself.

If we set low risk to be 0.25, medium 0.5 and high 0.75 with critical risks at anything 0.76 - 1.00, Then the following diagram shows the modelling of the above risk log as a Markov chain:
Fig 1 - Markov Chain of above risk log

To explain what is going on here, you need to understand what a Markov chain is. A little bit of time reading the wiki link would be useful. However, basically, combining all the state effects together, we have built this chain which shows the way these effects interplay. With each effect, there is a further chance of something happening which then leads to the next potential effect. From the above network, it is immediately clear that some risks interplay. Often, the risks which have the most lines coming in to them need to be mitigated, as any of those incoming lines could cause that state to be entered.

The results can be analysed straight from this. Given each risk is an independent event to any other, the probabilities can simply be multiplied along the chain to the target. We can ask questions such as:

Q: What is the chance we lose 1 million GBP or more?
A: This particular chain only contains nodes which have only 2 types of event emanating from it. Thus we can deduce that the effect of losses can happen from any of the working states through the chain, but there are two ways to work this out. The long winded way which is to follow all the chains through, or use the short winded way which is to look at the situation where everything is working and subtracting this away from 1, to give use the chance of losing 1 million GBP a day. Because I am lazy, I prefer the latter way, which gives:
The latter way also takes into account more than 2 exists in each node. This is particularly important when there may be 3 or more risks that could happen at each chain.


Q: What is the effect of a failure on the DB disk?
A: By following the chain through and expanding  a probability tree (wiki really need someone to expand on this entry, it's rubbish!), assuming the disk has failed, we get:

chance of missing customer data = 100%
chance of lost purchases = 50%
chance of loss of £1 million or more = 25%

The reason for the latter is:

Summary

Although I have not used these in earnest, I am keen to look at the use of Markov chains and will be exploring the use of them when transformed into adjacency tables for computational purposes using linear algebra in the next blog entry. 

Markov chains are widely used in analytics/operations research circles, so it would be useful to see how they apply here. But already from this you can immediately see how the effects interplay and what sort of reasoning can be accomplished using them. This shouldn't be too new to those that have studied PERT, six sigma and network analysis techniques in project management/process optimisation courses, as they are effectively practical applications of this very same technique. Indeed, a blog I did a while back on availability is a practical example of this at system level.

To be continued :-)