We just ended the founding meeting of a Machine Learning Study Group here in Copenhagen. We're planning to meet approximately once a month to discuss machine learning algorithms. In between meetings there's a Facebook group where we hang out and discuss our current pet machine learning problems and a Google Docs repository of machine learning resource (still to come).
Here's how we're going to run the group: The goal is to get smarter about machine learning. We're narrowing down on a couple of stacks we care about. Big data using databases - we're going to have a look at Madlib for Postgresql. Or something based on Map/Reduce. Some of us occasionally work in Processing using OpenCV - which is neat, because it's also a good choice on the iPhone/in XCode/Openframeworks.
Some of us are probably going to be looking at Microsoft Solver Foundation in C# at some point. Some of us might be looking at implementing algorithms for one of these environments if no one else has.
We'll be building out our knowledge on configuring and setting up these stacks. The idea is for everyone to work on a pet problem with a machine learning element. We'll share our ideas on how to proceed, share how to approach the stack of software.
The typical approach to a machine learning problems involves
Just to give a little landscape: The founder members are primarily interested in the following spaces of problems: Image analysis - with a little time series analysis of sensor data thrown in for interactive fun - and "big data", running machine learning algorithms on web scala data.
If you're into this problem space, and this style of working - project driven, collaborative and conversational - feel free to join. If there's simply something you'd like to be able to do with a computer, feel free to join as well - maybe one of the more technical hangers on will take an interest in your idea and start working on it.
It worked wonders for the ARToolkit to be connected to the Flash developer community. Maybe it will work for Arduino and hardware hacking as well - using the Netlab toolkit.
On Twitter, @tveskov asks, how many layers of abstraction does it take to run Twitter.
This is a really hard question, probably intractable - if you want to include the physics and signalling of lasers through the fiberoptic cables that sends Twitter's data to my home. There are simply too many different places along the route where the platform relies on some level of abstraction and comprehension, to do a full enumeration.
Instead, let's tackle a simpler question: How many technologies/abstractions are directly visible in Twitter's source code. I did a view source and tried to find all the technology that I could see from the source would be in use to run Twitter.
The rule here is that it there has to be text in the source file that does not make sense unless I know the abstraction/standard/software/API I'm referring to below.
Here my best shot at the list (in order of discovery (by me, while reading))
Notably absent: The email standards. While obviously employed by twitter (as are many other standards if the API is considered), I found no evidence of email in the source of the logged in front page.
Amazons utility computing platform just announced official MySQL Enterprise support and Oracle support to the offering. One nice consequence for Amazon is that their pricing will now seem dirt cheap compared to the Oracle licenses, whereas previously the psychology around the pricing was that you paid more for the convenience of scalability and the elastic supply.
Google App Engine launched - it's quite a different beast than Amazon's cloud services. The three most notable differences to Amazon's offering:
Free to try - not just during the beta, but apparently forever.
During the beta the allowances are: Apps that use less than 500MB space, 200 teraclockcycles of CPU per day (that's about 1 single-core 2.3 Ghz CPU running at max, continuously, if TechCrunch is reporting the number correctly) and 10 GB traffic daily are free. If they keep that up they could destroy the hosting industry completely. The only plausible deal breaker - other than Google lock-in - above is the 500 MB limit, which is on the small side.
Integrated stack, not a bundle of opportunities - Amazon's cloud services are basic by design and its up to clients to perform the integration. Tons of stacks have sprung up for that, but you have to choose and configure yourself. Google's is one integrated solution, with the restrictions and opportunities that provides.
Actually the integrated stack makes Google App Engine makes one feel bad for the now defunct Zimki. Zimki was a nice idea, but took way too many risks with an alien development environment and an unclear above-toy-level road map. Google's massive infrastructure, however, is perfect marketing for the App Engine as viable above play level. Who wouldn't want Google's uptime and scalability?
Integrated with Google's user base
"Eat shit and die, Facebook"? That could be the end goal, at least. With the Google App Engine you can access Google's user account system, so you don't have to design your own user system. One can easily imagine Open Social extensions to the app engine and/or integrations with e.g. Google Talk
Whether it's going to be a toy or not will be down to pricing above the free level, I guess. The functionality of the Python stack in the SDK seems to cover the basics well, with integrated object storage and email along side the Python CGI.
[UPDATE: Nice google app engine promotion: Jaiku is moving in]
Sun's open source software stack (everything but Java) is extended through purchase with MySQL - for a price of $1 Billion.
Amazon's continued roll out of large scale grid services now also has a database. It's less interesting than it ought to have been: They haven't done unlimited scale through parallelization, but scalability in terms of number of clients accessing the server is there. What I would consider interesting, though, would have been the ability to story very large tables and query those. Since you can't do any joins it seems a pretty naive implementaiton of that would work out.
Plus points for extreme simplicity - it's a basic object/attribute store with robust but simple search functionality on top.
My previous job was a at a domain registrar - where we ran something that looks a lot like a domain registry. This would have been perfect for that kind of application.
Everybody knows that if they don't control the domain where their blog is or the domain where their email is, they have no control. Which of course is as true for IM and for the up and coming market for cross media presence as it is for email and blogs. Some combination of Atom/APP and Jabber (with SMS, RSS, etc. gateways) should solve this issue - but it needs to be packaged and included in Wordpress and Moveabletype - and hosted by commonly available webhosts, simply part of the standard makeup of a webhost. Including the option to download a usable archive of everything in one file.
The next world's biggest scientific computer will be yielding 500 teraflops and consume 3 MW of power. Which comes out at approximate 6KW per TFlop. Which is much worse than the IBM BlueGene (if they delivered the performance envisioned). We need a Moore's law for power per operation but it seams we're going nowhere as far as power use is concerned.
Seagate is set to start shipping 1TB disks. For those keeping score, that's about 3 times the disk space in the first public edition of Google. Note: I'm pretty sure the bandwidth to the disk is not keeping up - disks like this new 1TB disk get their relevance from large files not lots of files.
(Bonus: The 1998 Salon story where I first heard of Google)
(Bonus II: Carrying one of these home from the office would provide a sneakernet bandwidth of approx 4 Gbit/s)
Facebook's memcached cluster has over 3TB of ram on 200 boxes. Obviously if you did a full memory count for Google's entire cluster you would reach enormous numbers, but still - as single installations of addressable ram goes 3TB is a big number.
Hov convenient for the previous post that DBPedia just launched. DBPedia is a full on RDF-based semantic reworking of the knowledge in Wikipedia. That is interesting. And yes, there's a TON of knowledge I have been looking around Wikipedia for but lacked the querytools to find. The debate on the quality of Wikipedia is another debate than this one. For now, let's see Britannica answer this one with their closed model - have they got a product with similarly rich semantics?
P.S. For a less research oriented and more consumer oriented approach to wikipedia browsing, see Wikiseek.
Pretty good call to arms, here - except it's maybe a little bit misnamed - since it's mainly a talk about a technological fix for open search and not so much about the economics of search. I too would find it profoundly cool if all the infrastructure to build personal micro-Googles was just standard on webhosting accounts (it almost is, actually) and if these accounts provided seamless scalability (not so much yet, certainly not without thinking on the part of the webhostee), but we would still have the problem of connecting the search to an audience and huge problems with a fragmented question space. The two are the same of course. I added some quick notes on this over on the Wiki.