Week 10 moved on from classification to clustering. Although, conceptually, there was close relation to topics covered in Natural Computation the methods discussed were new. Again, Euclidean distance is a fundamental measure of similarity/uniquness.
The first method introduced was Heirarchical Clustering. This introduction was very bried and reference to the text would need to be made for issues such as linkages.
The next method was K-Means clustering.
I find the limitation of assuming the number of clusters [K] to go close to invalidating this methodology in its basic form. Of course, however the algorithm can be extended to an exhaustive or stoichastic search were multiple K values are compared and contrasted. The idea of clustering is to simplify data sets, in essence reducing dimensianality. With this in mind there must be a penalty for extended K-means algorithms for the number of clusters. Otherwise the best clustering would always result in K = number of unique instances. MML, MDL and BIC are examples of algoriths that incorporate these penalities. Interestingly, I came across MDL when looking for effective method for discretizing continuous variables. It now seems obvious that discretization is a form of clustering where there need to be penalties for an increasing number of clusters. For more info on using MDL to discretize continuos variables see:
Fayyad, U., Irani, K., 1993, Multi-interval discretization of continuous valued attributes for
classification learning, Thirteenth International Joint Conference on Articial Intelligence, 1022-
Interstingly Usama Fayyad is now Chief Data Officer and Executive Vice President, Yahoo! Inc… for next time anyone says research in this field is pointless for a career.
The lecture continued to introduce issues and algorithms which require a great deal of reading and writing to do justice (which I am yet to complete).