The value of distributed computing: The return of Markov chain Monte Carlo methods

A while back I wrote something on doing Monte Carlo simulations with Web Services and SharePoint. Halfway through I mentioned that Google Pagerank was defined by a Markov chain which in turn was an output of a process called Markov chain Monte Carlo methods. Not that it concerned me but only one person mentioned this, and at that it was a vague mentioning. Huh…

This actually is a big deal. In fact a very big deal. A multi billion dollar deal in fact, as in the case of Google PageRank. Distributed computing has the power to help us solve many things if applied correctly. The “cloud” does not. (A topic for later.) Probably the greatest hurdle in getting people back on track is that this technology has use beyond the scope of most peoples daily lives. For example…

A paper was published in PLoS last week, September 4th 2009, called “Can an Eigenvector Measure Species’ Importance for Coextinctions?” In it the authors state that “PageRank” can be applied to the study of food webs. Food webs are the complex networks of who eats whom in an ecosystem.Typically we’re at the top, unless Hollywood or very bad planning is involved. Essentially, the scientists are saying that their particular version of PageRank could be a simple way of working out which extinctions would lead to ecosystem collapse. A relatively handy thing to have these days… As every species is embedded in a complex network of relationships with others, even a single extinction can rapidly cascade into the loss of seemingly unrelated species. Investigating when this might happen using more conventional methods is complicated as even in simple ecosystems, the number of combinations exceeds the number of atoms in the universe… E.g. a typical lottery which has 8 numbers that can range between 1 and 50 has 39,062,500,000,000 different combinations…

The researchers had to tweak PageRank to it to adapt it for their ecology focused purposes.

“First of all we had to reverse the definition of the algorithm.” “In PageRank, a web page is important if important pages point to it. In our approach a species is important if it points to important species.”

They also tested against algorithms that were already in use in computational biology to find a solution to the same problem. PageRank, in its adjusted form, gave them exactly the same solution as these much more complicated algorithms.

With the right design SharePoint can be an extremely useful, and totally appropriate, interface for accessing and disseminating the inputs and outputs of such an effort. It can store and present this data with all of the requisite benefits one would expect from a collaborative platform. Certainly there’s a world of work involved in doing something like this but the key point is that the right tool for the right job mantra works here. “All” you need is:

  • IIS
  • .NET
  • SharePoint
  • PowerShell
  • Visual Studio
  • SQL
  • Skill

Opera Unite – a perspective change from the centralized model used by SharePoint?

Opera Unite, a web browser melded with a web server. Now there’s a novel concept.

Opera Unite allows you to share your files, stream music, host sites, and communicate real time with people. The suite of services, that’s what they literally are, are comprehensive.

  • File Sharing
  • Photo Sharing
  • The Lounge
  • Fridge
  • Media Player
  • Web Server
  • and more…

But there’s a problem with it. A very big problem that I suspect Opera Marketing are all too aware of. Although Opera Unite claims to “directly link people’s personal computers together,” to use it you must have an account on Opera’s servers. Once you have that all of your exchanges pass through Opera’s servers first. Sure, that’s an effective way to get around technical difficulties such as NAT, firewalls, etc, but the big issue is that it makes Opera the intermediary in your social interactions — not Facebook, not MySpace, but Opera. Think it through. Stepping past all the hype, the benchmarks*, etc. you have just another lockin scenario. Opera is up you’re up. Sure your stuff is on your machine but it can only be accessed via Opera the domain.

Is there a way around this? Do we need a way around this? Yes, it would be possible to create a swarm and find your friends, but what happens when your computer is down and somebody wants to access your content. Nothing.

*Benchmarks

Excerpt from http://unitehowto.com/Performance below. Take them in context.

Opera Unite uses very smart file I/O! Even if you save data to file each request (simplest, but stupidest way to do it) – it still can push out very impressive 744 requests/second! (It probably means that this data is saved to memory and dumped only sometimes, smart move!)

It seems like Opera uses 13 threads (seems like a soft limit, but unchangeable). 13 concurrent connections max out @ 810req/s, 1.23ms processing time.

For comparison:

PHP+Apache(+MySQL) is almost 2 times faster than peak Unite performance.

Compiled C++ web server (MadFish WebToolkit ) is only 6 times faster than Opera Unite, but that is compiled raw C++.

nginx (one of the fastest Web Servers available) is only 5 times faster than Opera Unite (clocked at 4900 req/s in raw C++) “Welcome to nginx” cycle (no I/O or scripting).