We just ended the founding meeting of a Machine Learning Study Group here in Copenhagen. We're planning to meet approximately once a month to discuss machine learning algorithms. In between meetings there's a Facebook group where we hang out and discuss our current pet machine learning problems and a Google Docs repository of machine learning resource (still to come).
Here's how we're going to run the group: The goal is to get smarter about machine learning. We're narrowing down on a couple of stacks we care about. Big data using databases - we're going to have a look at Madlib for Postgresql. Or something based on Map/Reduce. Some of us occasionally work in Processing using OpenCV - which is neat, because it's also a good choice on the iPhone/in XCode/Openframeworks.
Some of us are probably going to be looking at Microsoft Solver Foundation in C# at some point. Some of us might be looking at implementing algorithms for one of these environments if no one else has.
We'll be building out our knowledge on configuring and setting up these stacks. The idea is for everyone to work on a pet problem with a machine learning element. We'll share our ideas on how to proceed, share how to approach the stack of software.
The typical approach to a machine learning problems involves
Just to give a little landscape: The founder members are primarily interested in the following spaces of problems: Image analysis - with a little time series analysis of sensor data thrown in for interactive fun - and "big data", running machine learning algorithms on web scala data.
If you're into this problem space, and this style of working - project driven, collaborative and conversational - feel free to join. If there's simply something you'd like to be able to do with a computer, feel free to join as well - maybe one of the more technical hangers on will take an interest in your idea and start working on it.
Mike Migurski is doing something quite interesting: Dumping the database engine as a middle man and just storing DB-indexes directly as HTML. Fullest, but preliminary, detail here. Let's let Mike explain
There's a short list of reasons to do this:
- A "database" that offers nothing but static file downloads will likely be more scalable than one that needs to do work internally. This architecture is even more shared-nothing than systems with multiple database slaves.
- Not needing a running process to serve requests makes publishing less of a headache.
- I'm using Amazon Web Services to do the hosting, and their pricing plans make it clear that bandwidth and storage are cheap, while processing is expensive. Indexes served over HTTP optimize for the former and make the latter unnecessary. It's interesting to note that the forthcoming S3 pricing change is geared toward encouraging chunkier blocks of data.
- The particular data involved is well-suited to this method. A lot of current web services are optimized for heavy reads and infrequent writes. Often, they use a MySQL master/slave setup where the occasional write happens on one master database server, and a small army of slaves along with liberal use of caching makes it possible for large numbers of concurrent users to read. Here, we've got infrequently-updated information from a single source, and no user input whatsoever. It makes sense for the expensive processing of uploading and indexing to happen in one place, about once per day.
Brilliant joke programming language article from Wikipedia:HQ9+, does everything you need to do for your first programming (or recursive logic) class with an absolutely minimal footprint.
Bonus brilliance: The creator of HQ9+, Cliff Biffle, also created the Beatnik language
A Beatnik program consists of any sequence of English words, separated by any sort of punctuation from spaces to hyphens to blank pages. Thus, "Hello, aunts! Swim around brains!" is a valid Beatnik program, despite not making much sense.
(If you're wondering, that reads a character from the user, adds seven to it [i.e. A -> H], and prints it out.)
The function of a particular word--say, brains, or aunts--is determined by the score one would receive for playing that word in Scrabble. Thus, "hello" gets us 8 points, and so on.
[...]
By this point, you're probably wondering why it's called Beatnik. Well, you're about to find out. Here is the source for a program that simply prints "Hi" to the screen.
Baa, badassed areas!
Jarheads' arses
queasy nude adverbs!
Dare address abase adder? *bares baser dadas* HA!
Equalize, add bezique, bra emblaze.
He (quezal), aeons liable. Label lilac "bulla," ocean sauce!
Ends, addends,
duodena sounded amends.
See?
The trends reported here are really, really important.
I recently outfitted my small format laptop, which isn't good for text heavy "lots and lots of writing"-software anyway, but much better for single-focus quality time with something beatiful and brief, with Haskell simply because the software neighbourhood I currently inhabit isn't beautiful or powerful enough. Brevity and reach of a language is extremely important.
My personal guide to these kinds of things has been Lambda the Ultimate for a while - as well as the clever perl hackers, Torkington mentions in the story above.
Extra special comp sci bonus: Abelson and Sussman video lectures of their own textbook, Structure and Interpretation of Computer Programs.
Other instructors, same material from Archive.org and ArsDigita
At my work we found this great but secret project called Sequoia. It is an open source project that can increase the performance and uptime of your database system. The remarkable thing with Sequoia is that you can use the project without changing your application or database schema and you can use it with any database server you like. It almost sounds too good.
The software implements a RAID like solution on the database connector level. In other words. The thing you change in your existing system is the database connector (JDBC, ODBC, or what ever you use). The Sequoia connector will RAIDb you databases accross several servers a bit like RAID controllers do it on disks. If one of the nodes in a RAIDb-1 controller fails the Sequoia connector has a transaction log that you can replay on the node when you get it online again.
In theory Sequoia and enough hardware will let you scale you database to any level of redundancy and performance.
Disclaimer: I have not tried the Sequoia software yet. All the facts above is from their own documentation. I will post my test results later.
I worry when my computer languages exhibit the following kind of behaviour
This example fails! The block variable, far from being a local parameter in the block, actually exists in the same scope as the array a. to evaluate the block the procedure assigns to the variable a destorying the array.
# example 1 does not work
a = [2,3,4,5]
puts a.max {|a,b| a <=> b}
puts a.max {|a,b| a <=> b} #runtime error - a is now an integer
# example 2 works
def max(a)
a.max {|a,b| a <=> b}
enda = [2,3,4,5]
puts(max(a))
puts(max(a))
Bug or quirk? I think bug.
"I started keeping a list of these annoyances but it got too long and depressing so I just learned to live with them again. We really are using a 1970s era operating system well past its sell-by date. We get a lot done, and we have fun, but let's face it, the fundamental design of Unix is older than many of the readers of Slashdot, while lots of different, great ideas about computing and networks have been developed in the last 30 years. Using Unix is the computing equivalent of listening only to music by David Cassidy."
He's right and he is wrong. I think it's entirely likely to me that we'll find further down the road that software works much like genetic development in nature. Nature never throws out old designs. In fact most of our human basic design is the same as the basic design in fish and plants and bacteria and it hasn't changed in billions of years. However, the interest, the competitive edge, moves away from the old designs once they win and onto greater things. So i'm not sure we'll ever have new file systems or new anything really. I find it entirely likely that inside the massively parallel billion CPU core machine of 2050 we'll find a million linux 2.6 cores with ext3 filesystems...
I think we can already see this as OS'es get commoditized and the interest moves from scaling up to scaling out. Scaling out is a developer way of saying "I'm not going to fix the I/O DNA or the process DNA of computing, I'll just add sophistication on top".
The only real reason this isn't truly plausible on a 200 year scale is energy consumption. It's quite possible that in a truly parallelized world we'd really much rather have a much simpler operating system able to function on much less power, but robust and distributable.
[UPDATE should have read the whole thing and a minimum of stuff about plan 9 - which answers some of the questions, but the failure of plan 9 to catch on underscores the point - and it's clear from the interview that Pike is aware of this]
The question that then comes to mind: Suppose we wanted to build the multi-concurrent internet-ready super machine of the future, programmed entirely in a fanstastic functional language that is able to hide complexity and concurrency in an efficient way, what would we keep around?
Some ideas on design points:
(I think I need to start a blog specifically for spaced out posts)
Eye opening (well I'm unsure if it is - the failure of acceleration pointed out has been apparent for a while) piece on a fundamental sea change in computing technology forced by the breakdown of the previously available "free lunch" of exponential hardware improvement.
Improvements in dealing with concurrency (from functional programming comes tons of way to do concurrency without thinking explicitly about threads) is definitely something to watch.
The benefits by the way are already appreciable as concurrency is already a design problem to be reckoned with in distributed computing - and with everything moving to the web who is not doing distrbuted computing projects?
For the ultimate in concurrency we need to go to quantum computing of course.
Here's a page I've been looking for for a long time: The what's feature X from language Y called in language Z-page. This is really a lifesaver when you're trying to learn the 15th language and you just bloody want to convert that bloody integer to a string in your bloody hello world program.