Data Visualization: The Quest for Patterns

Recently I was trying to present a way for individuals to find patterns in large amounts of data. The end result was considered successful and I thought I would share at a high level what was done.

Firstly, the data looked a tad liked this…

Note 1: this is NOT the actual data.
Note 2: C1 and C2 are where I wish to draw interest to in this example. I.e. the final outcome should be a clean visual of the all data with an intuitive focus where C1 and C2 get and send their data.

Certainly not terribly daunting, interesting, nor unfamiliar for the average office/information worker/victim 🙂
Running this through Excel Services really did not produce anything useful. It hinted at associations between the data points but that was it. See for yourself…

Not the best presentation of data I’m sure you’d agree… (Even forgiving the quality of the image.)
So I ran the data around a different set of axis with a bit of custom code and got a dramatically clearer view of the columns and their content associations. At least I think it is.

Colour coding C1 and C2 made things even more clear.

A successful solution!

Euler’s identity: a mental segue

Working with code as a creator, configuror (is this really a word?), or troubleshooter has rewards and costs. One of the rewards is seeing / discovering crisp and novel logical patterns used to translate business processes from physical to digital.
Recently I was discussing with somebody how best to present large amounts of somewhat abstract data related to individuals with a goal of ‘perceiving’ subject matter experts in as fluid a fashion as possible. (Clearly I struggled with this sentence.) Musing for a few days on the topic I started to visualize a way to map each persons data points onto a sphere and what the values of those data points could, and should, do to the sphere and its surroundings. Then I remembered something I had not thought or heard anybody mention in a long long time. Euler’s identity.
Euler’s identity is not something most people think or even know about. Considered by many to be exceptionally remarkable for its mathematical beauty. It is often mentioned as the literal pinnacle of mathematical elegance and beauty. And it looks like this:
e^{i pi} + 1 = 0,!
It comprises of three basic arithmetic operations that occur exactly once each:
  • addition
  • multiplication
  • exponentiation.
The identity also links five fundamental mathematical constants:
  • The number 0.
  • The number 1.
  • The number ”π”, which is ubiquitous in trigonometry, geometry of Euclidean space, and mathematical analysis (π ≈ 3.14159).
  • The number ”e”, the base of natural logarithms, which also occurs widely in mathematical analysis (e ≈ 2.71828).
  • The number ”i”, imaginary unit of the complex numbers, which contain the roots of all nonconstant polynomials and lead to deeper insight into many operators, such as integration.

So what does that mean? Euler’s formula demonstrates, albeit for still not terribly clearly known reasons, that there is an intrinsic connection between complex exponential functions and trigonometric transformations.

So… what does that really mean? Sometimes other people put it best. “The true beauty of Euler’s Identity comes from the fact that, while the true nature of several of these constants continues to remain a mystery to mathematicians (though it is clear that they possess many real-world manifestations), within the confines of this equation they all work together in such a way that they interlock like pieces of a mathematical jigsaw puzzle, the end result of which has the mathematical traveler ending up right back where he began – at the journey’s origin.
Okay…
If it does not ‘make sense’ don’t worry. The mathematician Carl Friedrich Gauss supposedly said that if this formula was not immediately apparent to a student upon being shown it, that student would never become a first-class mathematician. Few people are. I know I’m not. But that still does not help if you cannot “see” it, so here’s a picture. A moving one that hopefully will act to shed some light on what Euler’s identity tells us. (I got the picture and the text below straight from Wikipedia.)

The exponential function ez can be defined as the limit of(1 + z/N)N, as N approaches infinity, and thus e is the limit of (1 + iπ/N)N. In this animation N takes various increasing values from 1 to 100. The computation of (1 + iπ/N)N is displayed as the combined effect of N repeated multiplications in the complex plane, with the final point being the actual value of (1 + iπ/N)N. It can be seen that as N gets larger (1 + iπ/N)N approaches a limit of −1.


If you’re still not sure what it is, Midnight tutor has a good video here: http://www.midnighttutor.com/EulerFormula.html

In case you were wondering. No, it did not help solve the original business need. But it did help to continue to keep alive the awareness that everyday people I work with produce clever, and sometimes stunning, solutions with logical blocks. It’s like working with artists who craft in ether.

It’s a different type of Flash…

Sencha (previously known as Ext JS) has released Sencha Touch, a HTML5 Mobile Application Framework. which allows you to develop web applications that look and feel native on Apple iOS and Google Android touchscreen devices.
It makes use of HTML5 for delivering audio/video or localStorage and CSS3 for maximum styling like rounded corners, background gradients, and shadows.
The code created is resolution independent. It uses a method which allows developers to change the overall scale of their interfaces on the fly with no pixellation.

Sencha Touch has a powerful animation system that makes flexible animations between screens and views possible.

Slide, pop, and fade animations are included with the library, each with a robust set of options to change attributes like direction and masking style.
And, as they are created with CSS, building custom animations is a joy.

Also, it includes a set of common icons for using them in toolbars and tab bars.

Compatibility: Apple iOS and Google Android
Website: http://www.sencha.com/products/touch/
Demo: http://www.sencha.com/products/touch/demos.php

Free SEO Toolkit From Microsoft

Search Engine Optimization (SEO) Toolkit is a free tool from Microsoft forimproving a website’s relevance in search results.

Features:

  • full-featured crawler engine
  • query builder interface that allows you to build custom reports
  • display of detailed information for each UR
  • ability to manage robots.txt file
  • ability to manage sitemap.xml file
  • It requires an IIS7 enabled computer to run which is basically Vista, Windows 7 or Server 2008. After that it can analyze any local or remote website.

SEO Toolkit can be installed easily using the Microsoft Web Platform.

Consulting: It’s just Common sense

Consulting is difficult. You have to know a couple of things to get it all to work properly.

  • You have to know your “stuff”
  • You have to know how to read peoples expectations
  • You have to know how to communicate your stuff to people
  • You have to know in what way you can trust those around you
  • You have to know your own limitations
  • You have to know when to call it a day on a “bad” project

The last one can be the toughest… Typically you get to this juncture from one of three paths

  1. You can’t get them, the client, to their goal.
  2. Focus is on cost rather than net results.
  3. Nobody can see anything other than failure.

Seeing those points written down starts to make it all seem easy but real life really does make the water much more murky. It is no coincidence that each sentence starts with “You have to know“. It is also no coincidence that these points apply equally to real life…

Success is measured as equally by inward as outward focus.

So, it is not easy but when any, or all, of these stars start to align it is best to bring the issue to front, acknowledge the possible mismatch, and if appropriate refer them to somebody else.

140 Interview Questions used by Google

Recently I found these and thought they were very interesting… Apparently they have not been used in quite some time but it’s rather interesting to see what some people have experienced whilst job seeking. The full article on them is here: http://blog.seattleinterviewcoach.com/2009/02/140-google-interview-questions.html

  • Why do you want to join Google?
  • What do you know about Google’s product and technology?
  • If you are Product Manager for Google’s Adwords, how do you plan to market this?
  • What would you say during an AdWords or AdSense product seminar?
  • Who are Google competitors, and how does Google compete with them?
  • Have you ever used Google’s products? Gmail?
  • What’s a creative way of marketing Google’s brand name and product?
  • If you are the product marketing manager for Google’s Gmail product, how do you plan to market it so as to achieve 100 million customers in 6 months?
  • How would you boost the GMail subscription base?
  • What is the most efficient way to sort a million integers?
  • How would you re-position Google’s offerings to counteract competitive threats from Microsoft?
  • How many golf balls can fit in a school bus?
  • You are shrunk to the height of a nickel and your mass is proportionally reduced so as to maintain your original density. You are then thrown into an empty glass blender. The blades will start moving in 60 seconds. What do you do?
  • How much should you charge to wash all the windows in Seattle?
  • How would you find out if a machine’s stack grows up or down in memory?
  • Explain a database in three sentences to your eight-year-old nephew.
  • How many times a day does a clock’s hands overlap?
  • You have to get from point A to point B. You don’t know if you can get there. What would you do?
  • Imagine you have a closet full of shirts. It’s very hard to find a shirt. So what can you do to organize your shirts for easy retrieval?
  • Every man in a village of 100 married couples has cheated on his wife. Every wife in the village instantly knows when a man other than her husband has cheated, but does not know when her own husband has. The village has a law that does not allow for adultery. Any wife who can prove that her husband is unfaithful must kill him that very day. The women of the village would never disobey this law. One day, the queen of the village visits and announces that at least one husband has been unfaithful. What happens?
  • In a country in which people only want boys, every family continues to have children until they have a boy. If they have a girl, they have another child. If they have a boy, they stop. What is the proportion of boys to girls in the country?
  • If the probability of observing a car in 30 minutes on a highway is 0.95, what is the probability of observing a car in 10 minutes (assuming constant default probability)?
  • If you look at a clock and the time is 3:15, what is the angle between the hour and the minute hands? (The answer to this is not zero!)
  • Four people need to cross a rickety rope bridge to get back to their camp at night. Unfortunately, they only have one flashlight and it only has enough light left for seventeen minutes. The bridge is too dangerous to cross without a flashlight, and it’s only strong enough to support two people at any given time. Each of the campers walks at a different speed. One can cross the bridge in 1 minute, another in 2 minutes, the third in 5 minutes, and the slow poke takes 10 minutes to cross. How do the campers make it across in 17 minutes?
  • You are at a party with a friend and 10 people are present including you and the friend. your friend makes you a wager that for every person you find that has the same birthday as you, you get $1; for every person he finds that does not have the same birthday as you, he gets $2. would you accept the wager?
  • How many piano tuners are there in the entire world?
  • You have eight balls all of the same size. 7 of them weigh the same, and one of them weighs slightly more. How can you find the ball that is heavier by using a balance and only two weighings?
  • You have five pirates, ranked from 5 to 1 in descending order. The top pirate has the right to propose how 100 gold coins should be divided among them. But the others get to vote on his plan, and if fewer than half agree with him, he gets killed. How should he allocate the gold in order to maximize his share but live to enjoy it? (Hint: One pirate ends up with 98 percent of the gold.)
  • You are given 2 eggs. You have access to a 100-story building. Eggs can be very hard or very fragile means it may break if dropped from the first floor or may not even break if dropped from 100th floor. Both eggs are identical. You need to figure out the highest floor of a 100-story building an egg can be dropped without breaking. The question is how many drops you need to make. You are allowed to break 2 eggs in the process.
  • Describe a technical problem you had and how you solved it.
  • How would you design a simple search engine?
  • Design an evacuation plan for San Francisco.
  • There’s a latency problem in South Africa. Diagnose it.
  • What are three long term challenges facing google?
  • Why are manhole covers round?
  • What is the difference between a mutex and a semaphore? Which one would you use to protect access to an increment operation?
  • A man pushed his car to a hotel and lost his fortune. What happened?
  • Explain the significance of “dead beef”.
  • Write a C program which measures the the speed of a context switch on a UNIX/Linux system.
  • Given a function which produces a random integer in the range 1 to 5, write a function which produces a random integer in the range 1 to 7.
  • Describe the algorithm for a depth-first graph traversal.
  • Design a class library for writing card games.
  • You need to check that your friend, Bob, has your correct phone number, but you cannot ask him directly. You must write a the question on a card which and give it to Eve who will take the card to Bob and return the answer to you. What must you write on the card, besides the question, to ensure Bob can encode the message so that Eve cannot read your phone number?
  • How are cookies passed in the HTTP protocol?
  • Design the SQL database tables for a car rental database.
  • Write a regular expression which matches a email address.
  • Write a function f(a, b) which takes two character string arguments and returns a string containing only the characters found in both strings in the order of a. Write a version which is order N-squared and one which is order N.
  • You are given a the source to a application which is crashing when run. After running it 10 times in a debugger, you find it never crashes in the same place. The application is single threaded, and uses only the C standard library. What programming errors could be causing this crash? How would you test each one?
  • Explain how congestion control works in the TCP protocol.
  • In Java, what is the difference between final, finally, and finalize?
  • What is multithreaded programming? What is a deadlock?
  • Write a function (with helper functions if needed) called to Excel that takes an excel column value (A,B,C,D…AA,AB,AC,… AAA..) and returns a corresponding integer value (A=1,B=2,… AA=26..).
  • You have a stream of infinite queries (ie: real time Google search queries that people are entering). Describe how you would go about finding a good estimate of 1000 samples from this never ending set of data and then write code for it.
  • Tree search algorithms. Write BFS and DFS code, explain run time and space requirements. Modify the code to handle trees with weighted edges and loops with BFS and DFS, make the code print out path to goal state.
  • You are given a list of numbers. When you reach the end of the list you will come back to the beginning of the list (a circular list). Write the most efficient algorithm to find the minimum # in this list. Find any given # in the list. The numbers in the list are always increasing but you don’t know where the circular list begins, ie: 38, 40, 55, 89, 6, 13, 20, 23, 36.
  • Describe the data structure that is used to manage memory. (stack)
  • What’s the difference between local and global variables?
  • If you have 1 million integers, how would you sort them efficiently? (modify a specific sorting algorithm to solve this)
  • In Java, what is the difference between static, final, and const. (if you don’t know Java they will ask something similar for C or C++).
  • Talk about your class projects or work projects (pick something easy)… then describe how you could make them more efficient (in terms of algorithms).
  • Suppose you have an NxN matrix of positive and negative integers. Write some code that finds the sub-matrix with the maximum sum of its elements.
  • Write some code to reverse a string.
  • Implement division (without using the divide operator, obviously).
  • Write some code to find all permutations of the letters in a particular string.
  • What method would you use to look up a word in a dictionary?
  • Imagine you have a closet full of shirts. It’s very hard to find a shirt. So what can you do to organize your shirts for easy retrieval?
  • You have eight balls all of the same size. 7 of them weigh the same, and one of them weighs slightly more. How can you fine the ball that is heavier by using a balance and only two weighings?
  • What is the C-language command for opening a connection with a foreign host over the internet?
  • Design and describe a system/application that will most efficiently produce a report of the top 1 million Google search requests. These are the particulars: 1) You are given 12 servers to work with. They are all dual-processor machines with 4Gb of RAM, 4x400GB hard drives and networked together.(Basically, nothing more than high-end PC’s) 2) The log data has already been cleaned for you. It consists of 100 Billion log lines, broken down into 12 320 GB files of 40-byte search terms per line. 3) You can use only custom written applications or available free open-source software.
  • There is an array A[N] of N numbers. You have to compose an array Output[N] such that Output[i] will be equal to multiplication of all the elements of A[N] except A[i]. For example Output[0] will be multiplication of A[1] to A[N-1] and Output[1] will be multiplication of A[0] and from A[2] to A[N-1]. Solve it without division operator and in O(n).
  • There is a linked list of numbers of length N. N is very large and you don’t know N. You have to write a function that will return k random numbers from the list. Numbers should be completely random. Hint: 1. Use random function rand() (returns a number between 0 and 1) and irand() (return either 0 or 1) 2. It should be done in O(n).
  • Find or determine non existence of a number in a sorted list of N numbers where the numbers range over M, M>> N and N large enough to span multiple disks. Algorithm to beat O(log n) bonus points for constant time algorithm.
  • You are given a game of Tic Tac Toe. You have to write a function in which you pass the whole game and name of a player. The function will return whether the player has won the game or not. First you to decide which data structure you will use for the game. You need to tell the algorithm first and then need to write the code. Note: Some position may be blank in the game। So your data structure should consider this condition also.
  • You are given an array [a1 To an] and we have to construct another array [b1 To bn] where bi = a1*a2*…*an/ai. you are allowed to use only constant space and the time complexity is O(n). No divisions are allowed.
  • How do you put a Binary Search Tree in an array in a efficient manner. Hint :: If the node is stored at the ith position and its children are at 2i and 2i+1(I mean level order wise)Its not the most efficient way.
  • How do you find out the fifth maximum element in an Binary Search Tree in efficient manner. Note: You should not use use any extra space. i.e sorting Binary Search Tree and storing the results in an array and listing out the fifth element.
  • Given a Data Structure having first n integers and next n chars. A = i1 i2 i3 … iN c1 c2 c3 … cN.Write an in-place algorithm to rearrange the elements of the array ass A = i1 c1 i2 c2 … in cn
  • Given two sequences of items, find the items whose absolute number increases or decreases the most when comparing one sequence with the other by reading the sequence only once.
  • Given That One of the strings is very very long , and the other one could be of various sizes. Windowing will result in O(N+M) solution but could it be better? May be NlogM or even better?
  • How many lines can be drawn in a 2D plane such that they are equidistant from 3 non-collinear points?
  • Let’s say you have to construct Google maps from scratch and guide a person standing on Gateway of India (Mumbai) to India Gate(Delhi). How do you do the same?
  • Given that you have one string of length N and M small strings of length L. How do you efficiently find the occurrence of each small string in the larger one?
  • Given a binary tree, programmatically you need to prove it is a binary search tree.
  • You are given a small sorted list of numbers, and a very very long sorted list of numbers – so long that it had to be put on a disk in different blocks. How would you find those short list numbers in the bigger one?
  • Suppose you have given N companies, and we want to eventually merge them into one big company. How many ways are theres to merge?
  • Given a file of 4 billion 32-bit integers, how to find one that appears at least twice?
  • Write a program for displaying the ten most frequent words in a file such that your program should be efficient in all complexity measures.
  • Design a stack. We want to push, pop, and also, retrieve the minimum element in constant time.
  • Given a set of coin denominators, find the minimum number of coins to give a certain amount of change.
  • Given an array, i) find the longest continuous increasing subsequence. ii) find the longest increasing subsequence.
  • Suppose we have N companies, and we want to eventually merge them into one big company. How many ways are there to merge?
  • Write a function to find the middle node of a single link list.
  • Given two binary trees, write a compare function to check if they are equal or not. Being equal means that they have the same value and same structure.
  • Implement put/get methods of a fixed size cache with LRU replacement algorithm.
  • You are given with three sorted arrays ( in ascending order), you are required to find a triplet ( one element from each array) such that distance is minimum.
  • Distance is defined like this : If a[i], b[j] and c[k] are three elements then distance=max(abs(a[i]-b[j]),abs(a[i]-c[k]),abs(b[j]-c[k]))” Please give a solution in O(n) time complexity
  • How does C++ deal with constructors and deconstructors of a class and its child class?
  • Write a function that flips the bits inside a byte (either in C++ or Java). Write an algorithm that take a list of n words, and an integer m, and retrieves the mth most frequent word in that list.
  • What’s 2 to the power of 64?
  • Given that you have one string of length N and M small strings of length L. How do you efficiently find the occurrence of each small string in the larger one?
  • How do you find out the fifth maximum element in an Binary Search Tree in efficient manner.
  • Suppose we have N companies, and we want to eventually merge them into one big company. How many ways are there to merge?
  • There is linked list of millions of node and you do not know the length of it. Write a function which will return a random number from the list.
  • You need to check that your friend, Bob, has your correct phone number, but you cannot ask him directly. You must write a the question on a card which and give it to Eve who will take the card to Bob and return the answer to you. What must you write on the card, besides the question, to ensure Bob can encode the message so that Eve cannot read your phone number?
  • How long it would take to sort 1 trillion numbers? Come up with a good estimate.
  • Order the functions in order of their asymptotic performance: 1) 2^n 2) n^100 3) n! 4) n^n
  • There are some data represented by(x,y,z). Now we want to find the Kth least data. We say (x1, y1, z1) > (x2, y2, z2) when value(x1, y1, z1) > value(x2, y2, z2) where value(x,y,z) = (2^x)*(3^y)*(5^z). Now we can not get it by calculating value(x,y,z) or through other indirect calculations as lg(value(x,y,z)). How to solve it?
  • How many degrees are there in the angle between the hour and minute hands of a clock when the time is a quarter past three?
  • Given an array whose elements are sorted, return the index of a the first occurrence of a specific integer. Do this in sub-linear time. I.e. do not just go through each element searching for that element.
  • Given two linked lists, return the intersection of the two lists: i.e. return a list containing only the elements that occur in both of the input lists.
  • What’s the difference between a hashtable and a hashmap?
  • If a person dials a sequence of numbers on the telephone, what possible words/strings can be formed from the letters associated with those numbers?
  • How would you reverse the image on an n by n matrix where each pixel is represented by a bit?
  • Create a fast cached storage mechanism that, given a limitation on the amount of cache memory, will ensure that only the least recently used items are discarded when the cache memory is reached when inserting a new item. It supports 2 functions: String get(T t) and void put(String k, T t).
  • Create a cost model that allows Google to make purchasing decisions on to compare the cost of purchasing more RAM memory for their servers vs. buying more disk space.
  • Design an algorithm to play a game of Frogger and then code the solution. The object of the game is to direct a frog to avoid cars while crossing a busy road. You may represent a road lane via an array. Generalize the solution for an N-lane road.
  • What sort would you use if you had a large data set on disk and a small amount of ram to work with?
  • What sort would you use if you required tight max time bounds and wanted highly regular performance.
  • How would you store 1 million phone numbers?
  • Design a 2D dungeon crawling game. It must allow for various items in the maze – walls, objects, and computer-controlled characters. (The focus was on the class structures, and how to optimize the experience for the user as s/he travels through the dungeon.)
  • What is the size of the C structure below on a 32-bit system? On a 64-bit?

struct foo {

char a;
char* b;
};
  • Efficiently implement 3 stacks in a single array.
  • Given an array of integers which is circularly sorted, how do you find a given integer.
  • Write a program to find depth of binary search tree without using recursion.
  • Find the maximum rectangle (in terms of area) under a histogram in linear time.
  • Most phones now have full keyboards. Before there there three letters mapped to a number button. Describe how you would go about implementing spelling and word suggestions as people type.
  • Describe recursive mergesort and its runtime. Write an iterative version in C++/Java/Python.
  • How would you determine if someone has won a game of tic-tac-toe on a board of any size?
  • Given an array of numbers, replace each number with the product of all the numbers in the array except the number itself *without* using division.
  • Create a cache with fast look up that only stores the N most recently accessed items.
  • How to design a search engine? If each document contains a set of keywords, and is associated with a numeric attribute, how to build indices?
  • Given two files that has list of words (one per line), write a program to show the intersection.
  • What kind of data structure would you use to index annagrams of words? e.g. if there exists the word “top” in the database, the query for “pot” should list that.
  • What is the yearly standard deviation of a stock given the monthly standard deviation?
  • How many resumes does Google receive each year for software engineering?
  • Anywhere in the world, where would you open up a new Google office and how would you figure out compensation for all the employees at this new office?
  • What is the probability of breaking a stick into 3 pieces and forming a triangle?
  • You’re the captain of a pirate ship, and your crew gets to vote on how the gold is divided up. If fewer than half of the pirates agree with you, you die. How do you recommend apportioning the gold in such a way that you get a good share of the booty, but still survive?
  • How would you work with an advertiser who was not seeing the benefits of the AdWords relationship due to poor conversions?
  • How would you deal with an angry or frustrated advertisers on the phone?


Scaling your SharePoint application

Very often there is a moment in a project where one or more people go “uh oh.” What happens next usually defines the final result. Typically it is one arm of the overly dreaded triangle that is the immediate root cause of concern.

  • Time
  • Money/Cost/Budget
  • Scope

It is important to note that quality is not an input. It is essentially the output of the sum of the three. The quality level you’re shooting for is not absolute — it’s really part of scope. You have to articulate and repeatedly verify the bar needed to satisfy your customers. If a system performs the functions you say you wanted and you still don’t like it, then you got the requirements wrong. If you update the requirements to address your objections, you’ll discover that the scope is greater than you identified. That confusion is a telltale sign of a team with immature design capability. (It should never be forgotten that Project management is a subset of Program management.)

Enterprise deployments of environments such as SharePoint require careful pre-planning, expertise, and thoughtful consideration for “future proofing” the final deliverable. A little now, or more later really does apply. Knowing the difference between scaling up or out is not sufficient. Sometimes you need to step back and redesign and redeploy everything along with updating the project initiating processes as well. SharePoint is a nebulous product which requires understanding of SQL, WSS, HIGs, peple, and more. 2010 is going to make things even more “difficult” so take your time, talk to everybody involved, and have a process.

New technology and common sense

When new technology is released it can be very painful. In more ways than one…

It can be painful to use if it is still in a Beta like form. Bugs, lack of features, changing road-maps, and other nuances of maturing code can be irritating for end users and those who support the solution. And with Google taking it upon themselves to redefine the public’s perception of “Beta” the waters are more clouded than ever as to what to expect from Beta software.

Rabid supporters and detractors serve to polarize and create opinions/sides that apparently must be held at all cost. Microsoft/Windows Vs Apple/OS X, Google/Android Vs Apple/iPhone, etc. Having an opinion is great, but if it blinds you then maybe it’s not so great…

New software frequently becomes the “in” thing. But just because it’s new and shiny does not mean that you should use it. Just because you feel that you need to learn it does not mean that it needs to be jammed into your current project(s.)

SharePoint 2010 has some rather interesting social networking features that are fraught with hazards without sufficient governance. Take it slow and really consider the consequences of feature deployment.

A Thought on Monte Carlo Simulation Using Parallel Asynchronous Web Services with .NET and SharePoint

Monte Carlo Simulation is a technique used to estimate the likely range of outcomes outputted by a complex process by simulating the actual process with randomly selected data generating conditions that are true to the process model a large number of times. (In fact, the more you do it the better your data.) The Monte Carlo method is best applied whenever a deterministic solution would either be too computationally intensive or if such a solution does not exist whatsoever.

Monte Carlo Simulation is used in/with

  • Physical sciences
  • Design and visuals
  • Finance and business
  • Telecommunications
  • Games

Monte Carlo Simulation is not a “what if” process. What if’s require single point estimates and use deterministic modeling. Basically you are using best case, worst case, and so on. By using Monte Carlo you consume large random samplings, sourced from probability distribution functions, to produce a large range of outputs which in turn can allow you with greater confidence to produce a narrower range of outputs. In other words you are not using equal weights for each scenario.

Why is this pertinent? Well, stay with me on this one, Markov chain methods are extremely useful for generating sequences of random numbers to accurately reflect rather complicated desired probability distributions, via a process called Markov chain Monte Carlo methods. A tool that is used to generate simulations from a probability distribution…

The Google PageRank of a webpage is defined by a Markov chain.

And the penny drops…

Now, back to the point.

Depending on the degree of accuracy ultimately required, millions or billions of points may need to be tried. Distributing billions of point calculations across multiple servers running Monte Carlo Simulations via web services would parallelize the process and generates results VERY quickly. Good in concept but how to do it?

As defined by the W3C a web service is “a software system designed to support interoperable machine-to-machine interaction over a network.” Running web services on IIS has advantages not limited to:

  • You can grow your “cluster” by just deploying the web service to new nodes.
  • Each web service call with IIS is a thread which should have obvious and positive performance implications.
  • Web services provide a relatively simple and straightforward method of distributing parallel problems across multiple compute platforms.
  • Web services are written like traditional functions, they are easily parallelized without hand-coding a multi-threaded application, custom writing a message passing interface or using other high performance computing management software.

Needless to say, unless your requirements can be served by parallel computations, which would have no dependency on others in the pipe, this is going to become very difficult or rather “challenging” 🙂 very, very quickly.

So how could SharePoint fit in? SharePoint is perfect for acting as a landing point for your data. In and out. Companies benefit by building intelligence into their document libraries and lists with workflows. With workflow, SharePoint can act as a central hub for the data, sending it out to a queue which distributes to nodes on the network. Upon return, the data could be used to populate lists, document libraries, notify people/groups, and more. Search, BDC, Security, and all the other features in SharePoint make this concept a compelling one.

A random tidbit on non random data

I recently was talking with somebody who felt that TrueCrypt hidden volumes were the bee knees. The scenario they used, and which I myself have read ‘musings’ about, involved a laptop carrying sensitive corporate data being seized by customs. Laptop drive gets “reviewed”, secret container is not seen, and laptop passes as normal and uninteresting. Big deal. Bigger deal is if you have 007 style data and that guy in the uniform is pretty certain you have it as well. My colleagues version of the story ends with an almost hollywood style style exhalation of breath and cinematic zoom out to the hero walking out the door. That’s not how it would probably pan out…

Truecrypt volumes, which are essentially files, have certain characteristics that allow programs such as TCHunt to detect them with a high *probability*. The most significant, in mathematical terms, is that their modulo division by 512 is 0. Now it is certainly true that TrueCrypt volumes do not contain known file headers and that their content is indistinguishable from random, so it is difficult to definitively prove that certain files are TrueCrypt volumes. However their very presence can demonstrate and provide reasonable suspicion they contain encrypted data.

The actual math behind this is interesting. TrueCrypt volume files have file sizes that are evenly divisible by 512 and their content passes chi-square randomness tests. A chi-square test is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution* when the null hypothesis is true, or any in which this is asymptotically true. Specifically meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough.

So what does this all mean? Really nothing for us normal people. For those whom I have built custom STSADM containers for securing your backups and exports, your data is still secure and will stay that way indefinitely. For those running across the border. A forensic analysis will reveal the presence of encrypted data, TrueCrypt volumes or otherwise, but not much more. Sometimes that’s enough to start asking questions or poking further. With the forensic tools, not the dentistry kit.

* A skewed distribution whose shape depends on the number of degrees of freedom. As the number of degrees of freedom increases, the distribution becomes more symmetrical.

http://www.truecrypt.org/
http://16systems.com/TCHunt/