Supervised learning was covered in week 9. Happily I found that there was some overlap found here with FIT5167 Natural Computation. The introduction defined machine learning as a parent of data mining. Supervised and unsupervised learning were skimmed along with definitions of what learning actually is (ie: pattern classification/time series forecasting/clustering). The concept of splitting data for training and testing was also brought up.
Getting into the points that were different from Natural Computation (neural networks). Decision trees! Decision trees are a fairly simple concept. Constructing them can involve some calculation though. For example, give a set of 8 attributes and one classifying attribute. When building a decision tree, which attributes do we split the tree on first? The answer to this question can be found in Shannon’s information theory. Calculating information gain, when explained on the lecture slides displayed some mathematics that was not really intuitive for me. In the context of decision trees, determining the information gain can be calculated as the initial entropy (‘unsortedness’ of values) minus the entropy after the split of data on that attribute. The split that subtracted the lowest amount (having the lowest entropy) there for left the largest ‘information gain’. Simply is means that if we have back with 20 hats, 10 white, 10 black, entropy would equal 1. If a split of the bag on, say brand, left us with 2 bags, 1 with 10 black hats and 1 with 10 white hats, entropy would then be equal to 0. Everything in between can be calculated using the following:
I found and slightly modified a Python implementation (http://mchost/sourcecode/fit5047/shannonsEntropyCalc.py):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | ## initial source: http://code.activestate.com/recipes/577476/ }}} ## from sets import Set tmp = raw_input("Enter a set:") st = [] st.extend(tmp) # input string print 'Input string:' print st stList = list(st) alphabet = list(Set(stList)) # list of symbols in the string print 'Alphabet of symbols in the string:' print alphabet # calculate the frequency of each symbol in the string freqList = [] for symbol in alphabet: ctr = 0 for sym in stList: if sym == symbol: ctr += 1 freqList.append(float(ctr) / len(stList)) print 'Frequencies of alphabet symbols:' print freqList # Shannon entropy ent = 0.0 for freq in freqList: ent = ent + freq * math.log(freq, 2) ent = -ent print 'Shannon entropy:' print ent print 'Minimum number of bits required to encode each symbol:' print int(math.ceil(ent)) |