Hacks from the kitchen: computerscience Archives

March 17, 2011

Machine Learning study group in Copenhagen

We just ended the founding meeting of a Machine Learning Study Group here in Copenhagen. We're planning to meet approximately once a month to discuss machine learning algorithms. In between meetings there's a Facebook group where we hang out and discuss our current pet machine learning problems and a Google Docs repository of machine learning resource (still to come).

Here's how we're going to run the group: The goal is to get smarter about machine learning. We're narrowing down on a couple of stacks we care about. Big data using databases - we're going to have a look at Madlib for Postgresql. Or something based on Map/Reduce. Some of us occasionally work in Processing using OpenCV - which is neat, because it's also a good choice on the iPhone/in XCode/Openframeworks.
Some of us are probably going to be looking at Microsoft Solver Foundation in C# at some point. Some of us might be looking at implementing algorithms for one of these environments if no one else has.

We'll be building out our knowledge on configuring and setting up these stacks. The idea is for everyone to work on a pet problem with a machine learning element. We'll share our ideas on how to proceed, share how to approach the stack of software.

The typical approach to a machine learning problems involves

Identifying the problem - figuring out where machine learning is relevant

Working out how to get from problem to symbols - i.e. how to turn the data inherent in the problem into input for the textbook algorithms. As an example, in image processing this means coming up with some useful image features to analyze, and figuring out how to compute them efficiently. This can be really tricky, and we expect a lot of the discussions in the group will be about building out our skills in this area.

Picking the best algorithm - we'll be studying the core algorithms. Not necessarily implementing them; but learning about what they mean - and how they behave

Just to give a little landscape: The founder members are primarily interested in the following spaces of problems: Image analysis - with a little time series analysis of sensor data thrown in for interactive fun - and "big data", running machine learning algorithms on web scala data.

If you're into this problem space, and this style of working - project driven, collaborative and conversational - feel free to join. If there's simply something you'd like to be able to do with a computer, feel free to join as well - maybe one of the more technical hangers on will take an interest in your idea and start working on it.

Posted by Claus at 12:07 AM | Comments (0)

August 16, 2007

Serving indices over HTML

Mike Migurski is doing something quite interesting: Dumping the database engine as a middle man and just storing DB-indexes directly as HTML. Fullest, but preliminary, detail here. Let's let Mike explain

There's a short list of reasons to do this:
A "database" that offers nothing but static file downloads will likely be more scalable than one that needs to do work internally. This architecture is even more shared-nothing than systems with multiple database slaves.

Not needing a running process to serve requests makes publishing less of a headache.

I'm using Amazon Web Services to do the hosting, and their pricing plans make it clear that bandwidth and storage are cheap, while processing is expensive. Indexes served over HTTP optimize for the former and make the latter unnecessary. It's interesting to note that the forthcoming S3 pricing change is geared toward encouraging chunkier blocks of data.

The particular data involved is well-suited to this method. A lot of current web services are optimized for heavy reads and infrequent writes. Often, they use a MySQL master/slave setup where the occasional write happens on one master database server, and a small army of slaves along with liberal use of caching makes it possible for large numbers of concurrent users to read. Here, we've got infrequently-updated information from a single source, and no user input whatsoever. It makes sense for the expensive processing of uploading and indexing to happen in one place, about once per day.

It's kinda like what the various static indices for my blogs do, but done a little better - with semantic markup and all.

Posted by Claus at 12:51 PM | Comments (0)

July 11, 2007

Unsurprising

This just in: Ruby is S*L*O*W (and Rails is worse).

Posted by Claus at 10:36 PM | Comments (0)

January 29, 2007

Wiki/Cliffle-brilliance again: HQ9+

Brilliant joke programming language article from Wikipedia:HQ9+, does everything you need to do for your first programming (or recursive logic) class with an absolutely minimal footprint.

Bonus brilliance: The creator of HQ9+, Cliff Biffle, also created the Beatnik language

A Beatnik program consists of any sequence of English words, separated by any sort of punctuation from spaces to hyphens to blank pages. Thus, "Hello, aunts! Swim around brains!" is a valid Beatnik program, despite not making much sense.
(If you're wondering, that reads a character from the user, adds seven to it [i.e. A -> H], and prints it out.)
The function of a particular word--say, brains, or aunts--is determined by the score one would receive for playing that word in Scrabble. Thus, "hello" gets us 8 points, and so on.
[...]
By this point, you're probably wondering why it's called Beatnik. Well, you're about to find out. Here is the source for a program that simply prints "Hi" to the screen.
Baa, badassed areas!

Jarheads' arses

      queasy nude adverbs!

    Dare address abase adder? *bares baser dadas* HA!

Equalize, add bezique, bra emblaze.

  He (quezal), aeons liable.  Label lilac "bulla," ocean sauce!

Ends, addends,

   duodena sounded amends.
See?

Posted by Claus at 12:17 PM | Comments (0)

January 23, 2007

Highly diggable computing trends post

The trends reported here are really, really important.

I recently outfitted my small format laptop, which isn't good for text heavy "lots and lots of writing"-software anyway, but much better for single-focus quality time with something beatiful and brief, with Haskell simply because the software neighbourhood I currently inhabit isn't beautiful or powerful enough. Brevity and reach of a language is extremely important.

My personal guide to these kinds of things has been Lambda the Ultimate for a while - as well as the clever perl hackers, Torkington mentions in the story above.

Extra special comp sci bonus: Abelson and Sussman video lectures of their own textbook, Structure and Interpretation of Computer Programs.

Other instructors, same material from Archive.org and ArsDigita

Posted by Claus at 1:03 PM | Comments (0)

January 19, 2007

Scaling you database beyond limits with Sequoia

At my work we found this great but secret project called Sequoia. It is an open source project that can increase the performance and uptime of your database system. The remarkable thing with Sequoia is that you can use the project without changing your application or database schema and you can use it with any database server you like. It almost sounds too good.
The software implements a RAID like solution on the database connector level. In other words. The thing you change in your existing system is the database connector (JDBC, ODBC, or what ever you use). The Sequoia connector will RAIDb you databases accross several servers a bit like RAID controllers do it on disks. If one of the nodes in a RAIDb-1 controller fails the Sequoia connector has a transaction log that you can replay on the node when you get it online again.
In theory Sequoia and enough hardware will let you scale you database to any level of redundancy and performance.

Disclaimer: I have not tried the Sequoia software yet. All the facts above is from their own documentation. I will post my test results later.

Posted by simon at 8:49 PM | Comments (0)

October 6, 2006

Ruby quirk

I worry when my computer languages exhibit the following kind of behaviour



# example 1 does not work

a = [2,3,4,5]

puts a.max {|a,b| a <=> b}

puts a.max {|a,b| a <=> b} #runtime error - a is now an integer

This example fails! The block variable, far from being a local parameter in the block, actually exists in the same scope as the array a. to evaluate the block the procedure assigns to the variable a destorying the array.



# example 2 works

def max(a)

    a.max {|a,b| a <=> b}

end

a = [2,3,4,5]

puts(max(a))

puts(max(a))

This works. Function call parameters are proper variables existing in the scope of the function - not the caller. The a inside the function is still destroyed - but is not reused.

Bug or quirk? I think bug.

Posted by Claus at 12:45 PM | Comments (0)

July 17, 2006

The UNIX quote

"I started keeping a list of these annoyances but it got too long and depressing so I just learned to live with them again. We really are using a 1970s era operating system well past its sell-by date. We get a lot done, and we have fun, but let's face it, the fundamental design of Unix is older than many of the readers of Slashdot, while lots of different, great ideas about computing and networks have been developed in the last 30 years. Using Unix is the computing equivalent of listening only to music by David Cassidy."

Rob Pike

He's right and he is wrong. I think it's entirely likely to me that we'll find further down the road that software works much like genetic development in nature. Nature never throws out old designs. In fact most of our human basic design is the same as the basic design in fish and plants and bacteria and it hasn't changed in billions of years. However, the interest, the competitive edge, moves away from the old designs once they win and onto greater things. So i'm not sure we'll ever have new file systems or new anything really. I find it entirely likely that inside the massively parallel billion CPU core machine of 2050 we'll find a million linux 2.6 cores with ext3 filesystems...
I think we can already see this as OS'es get commoditized and the interest moves from scaling up to scaling out. Scaling out is a developer way of saying "I'm not going to fix the I/O DNA or the process DNA of computing, I'll just add sophistication on top".
The only real reason this isn't truly plausible on a 200 year scale is energy consumption. It's quite possible that in a truly parallelized world we'd really much rather have a much simpler operating system able to function on much less power, but robust and distributable.

[UPDATE should have read the whole thing and a minimum of stuff about plan 9 - which answers some of the questions, but the failure of plan 9 to catch on underscores the point - and it's clear from the interview that Pike is aware of this]

The question that then comes to mind: Suppose we wanted to build the multi-concurrent internet-ready super machine of the future, programmed entirely in a fanstastic functional language that is able to hide complexity and concurrency in an efficient way, what would we keep around?
Some ideas on design points:

Software will run on millions of mutually sandboxed cores. Cores are perishable and automatically restartable. Cores are simply glorified processes.
Cores maintain a distinction between interior and exterior and police their communication surface (think cells)
Cores are hardware independent, all software on a core relocates effortlessly to other cores
There is no "shared storage" there is only the cores. The communication substrate between cores is the only shared medium and it has no state
Any idea of of privilege or trust other than sandboxes is just unmaintainable. The idea that we'll be running software that is hundreds of times more complex than what we have today (or run the same software on data scaled hundreds of times, which is really the same thing) and be able to think consciously about trust is probably not sound.
The coordination mechanisms between software can't come from a high enough level of abstraction.
What that means is that any kind of coordination protocol or mechanism that is "bottom up" is really not useful. So an example would be implementing component coordination within a sandbox but not supporting coordination between the sandboxes from above.
What I'm thinking of here is once again the security mechanisms and the privilege mechanisms, but also something that might just be more of a pipe dream - which is the scripted ability to control any resource on any reachable machine - with sandboxing and privacy of course, but still. The point about it being from above not below is that I don't want to have to go to a substrate below to accomplish my connectivity goal, it should just be a standard operating assumption about any layer that it naturally distributes and shares
Unreliability is the norm not the exception. I mean this both in terms of hardware failure, software bugs and malware. As the world becomes more and more complex, there's just no way we will remain in conscious control of the quality of our system. At best we can do some double computation and fact checking and that kind of thing.

(I think I need to start a blog specifically for spaced out posts)

Posted by Claus at 2:31 AM | Comments (0)

July 16, 2006

Free lunch is over: Concurrency is the future

Eye opening (well I'm unsure if it is - the failure of acceleration pointed out has been apparent for a while) piece on a fundamental sea change in computing technology forced by the breakdown of the previously available "free lunch" of exponential hardware improvement.
Improvements in dealing with concurrency (from functional programming comes tons of way to do concurrency without thinking explicitly about threads) is definitely something to watch.
The benefits by the way are already appreciable as concurrency is already a design problem to be reckoned with in distributed computing - and with everything moving to the web who is not doing distrbuted computing projects?
For the ultimate in concurrency we need to go to quantum computing of course.

Posted by Claus at 6:07 PM | Comments (0)

May 11, 2006

Syntax across programming languages

Here's a page I've been looking for for a long time: The what's feature X from language Y called in language Z-page. This is really a lifesaver when you're trying to learn the 15th language and you just bloody want to convert that bloody integer to a string in your bloody hello world program.

Posted by Claus at 12:52 PM | Comments (0)