FIT5047 – Intelligent Systems Week 11

Week 11 moved into recommender systems, perhaps one of the most popular and commonly used forms of AI. Sites such as Google and Amazon built their success on the effectiveness of their recommender systems (now I guess their brands can carry them for a while). The first topic of the lecture was association mining, given a large dataset, how do we find useful associations between attributes.

Support and confidence were proposed as useful metrics to drive this process. Unfortunately we found some conflicting definitions among Data Mining, Weka and R&N texts. When in doubt check wikipedia..:

The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset.

supp(Z) = P(Z)

The confidence of a rule is defined – confidence

The lift of a rule is defined as – lift

The leverage of a rule is defined as –

leverage(X -> Y) = P(X and Y) – (P(X)P(Y))

The source listed in wikipedia for these definitions is:  http://michael.hahsler.net/research/association_rules/measures.html

In our lecture notes we had support for rule a -> b as the union of A and B, this confused me as I still think that support is the intersection of A and B.

The rules described above are quite intuitive when work through in an example. Lift feel like and extension of confidence taking into account independence. An increase in lift implies dependence.

No leverage implies independence between attributes and vice versa.

This topic was closed with the conclusion that it is in fact bad practice as variance and standard deviation are completely ignored.

A quick review of collaborative and content based filtering were covered next. Content Based Filtering [CBF] (haha) can be implemented using an array of machine learning  techniques already covered. Naive Bayes, Neural Networks and Decision trees are classification methods that can be applied to CBF. The pre-processing involved with CBF seems to be the most limiting factor. Term frequency and Inverse Document term frequency can be compile into tables allowing for effective searching. Considering the vast size of the data sets that these systems would be applied to, this can seem a bit daunting.

Collaborative filtering [CF] seems a bit easier to implement but does then rely on user participation. The introduction in the lecture felt very similar to the basics of Self Organising Maps. Vectors are created to represent instances (in  this case users). Euclidean distances (or some spin-off of this) are used to measure instance ‘likeness’ then missing values for instance vectors can be predicted based on the instances that are considered ‘like’. There was quite a bit of mathematical methodology described on the lecture slides which would be required when implementing a CF system.

Collaborative Filtering v Content Based Filtering
Collaborative Filtering v Content Based Filtering (soure: Week 11 lecture notes)
leverage(X -> Y) = P(X and Y) – (P(X)P(Y))

FIT5047 – Intelligent Systems Week 10

Week 10 moved on from classification to clustering. Although, conceptually, there was close relation to topics covered in Natural Computation the methods discussed were new. Again, Euclidean distance is a fundamental measure of similarity/uniquness.

The first method introduced was Heirarchical Clustering. This introduction was very bried and reference to the text would need to be made for issues such as linkages.

The next method was K-Means clustering.

 

kmeans
As cabn be seen with K = 3 we can move the center but the number of clusters is static

 

 

I find the limitation of assuming the number of clusters [K] to go close to invalidating this methodology in its basic form. Of course, however the algorithm can be extended to an exhaustive or stoichastic search were multiple K values are compared and contrasted. The idea of clustering is to simplify data sets, in essence reducing dimensianality. With this in mind there must be a penalty for extended K-means algorithms for the number of clusters. Otherwise the best clustering would always result in K = number of unique instances. MML, MDL and BIC are examples of algoriths that incorporate these penalities. Interestingly, I came across MDL when looking for effective method for discretizing continuous variables. It now seems obvious that discretization is a form of clustering where there need to be penalties for an increasing number of clusters. For more info on using MDL to discretize continuos variables see:

Fayyad, U., Irani, K., 1993, Multi-interval discretization of continuous valued attributes for
classification learning, Thirteenth International Joint Conference on Articial Intelligence, 1022-
1027

Interstingly Usama Fayyad is now Chief Data Officer and Executive Vice President, Yahoo! Inc… for next time anyone says research in this field is pointless for a career.

The lecture continued to introduce issues and algorithms which require a great deal of reading and writing to do justice (which I am yet to complete).

Chief Data Officer and Executive Vice President, Yahoo! Inc.

FIT5047 – Intelligent Systems Week 9

Supervised learning was covered in week 9. Happily I found that there was some overlap found here with FIT5167 Natural Computation. The introduction defined machine learning as a parent of data mining. Supervised and unsupervised learning were skimmed along with definitions of what learning actually is (ie: pattern classification/time series forecasting/clustering). The concept of splitting data for training and testing was also brought up.

Getting into the points that were different from Natural Computation (neural networks). Decision trees! Decision trees are a fairly simple concept. Constructing them can involve some calculation though. For example, give a set of 8 attributes and one classifying attribute. When building a decision tree, which attributes do we split the tree on first? The answer to this question can be found in Shannon’s information theory. Calculating information gain, when explained on the lecture slides displayed some mathematics that was not really intuitive for me. In the context of decision trees, determining the information gain can be calculated as the initial entropy (‘unsortedness’ of values) minus the entropy after the split of data on that attribute. The split that subtracted the lowest amount (having the lowest entropy) there for left the largest ‘information gain’. Simply is means that if we have back with 20 hats, 10 white, 10 black, entropy would equal 1.  If a split of the bag on, say brand, left us with 2 bags, 1 with 10 black hats and 1 with 10 white hats, entropy would then be equal to 0. Everything in between can be calculated using the following:

Entropy
The mathematical notation for calculating entropy

I found and slightly modified a Python implementation (http://mchost/sourcecode/fit5047/shannonsEntropyCalc.py):
[cc lang=”Python”]
## initial source: http://code.activestate.com/recipes/577476/ }}}
##
from sets import Set

tmp = raw_input(“Enter a set:”)
st = []
st.extend(tmp) # input string

print ‘Input string:’
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print ‘Alphabet of symbols in the string:’
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
ctr = 0
for sym in stList:
if sym == symbol:
ctr += 1
freqList.append(float(ctr) / len(stList))
print ‘Frequencies of alphabet symbols:’
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
ent = ent + freq * math.log(freq, 2)
ent = -ent
print ‘Shannon entropy:’
print ent
print ‘Minimum number of bits required to encode each symbol:’
print int(math.ceil(ent))[/cc]

FIT5047 – Intelligent Systems Week 8

Intelligent decision support was the topic of week 8’s lecture. This is a topic that has been built on the ‘Fundamental Preference Assumption’ (given choices A, B either A > B, B> A or A ~ B). This topic is closely related to our previous lectures which centered around reasoning with uncertainty.

Rational preferences are a prerequisite for intelligent decisions. Characteristics of rational preferences are:

  • Orderability
  • Transitivity
  • Continuity
  • Substitutability
  • Monotonicity

Mapping of preferences that may not have a readily comparable outcome is achieved through utility values. We did not cover any material on the development of utility values. I believe that to increase the success rate of intelligent decision systems, collaboration between end users and implementors must be conducted. A perfect systems with poorly represented utility values will fail.

Principle of Maximum Expected Utility (MEU) – An agent is rational iff it makes decisions that reflect MEU (I would argue rather that ‘An agent can’t be rational if it does not make decision based on MEU). Rationality should encompass consideration of the source of utility values.

Using Bayesian networks with decision and utility nodes, Dynamic Utility Networks can be developed. Depending on the information available, the maximum expected utility of decision can be calculated. This is a key concept for rational planning in uncertain environments. The value of new information can also be calculated using Shannon’s utility gain, a topic to be discussed next lecture.

ExpectedUtility
Expected utility in uncertain environments is linked with Bayes theorm

FIT5047 – Intelligent Systems Week 7

This week’s lecture delved further into probabilistic systems, specifically Bayesian networks. We explored the rationale behind the explosion in probabilistic systems in the past 15 years, namely the computational shortcuts achieved via the markov property and real world usefulness of  refining probabilities as known information increases.

We got much more detail about Bayesian network this week, including the components:

  • Random Variables [nodes, represented as binary, ordered, integral or continuous states]
  • Node links
  • Conditional Probability table [Quantifying the affect of parent on a  node]
  • The network is a direct, acyclic graph

Nice simple example of a Bayesian network

With the network seen above, causalities are indicated by the node links. If a node state changes from unknown to know, the Bayesian network can efficiently update the probabilities for all the other nodes. Note that the efficiency of a network relies on finding marginally independent nodes. If in this example all of the nodes were linked, the network would not be very effective.  The Markov property as defined by the lecture notes:

There are no direct dependencies in the system being modeled which are not explicitly shown via arcs

In essences when re-propagating probabilities after information is introduced the Markov model allows for encapsulation when completing calculations, speeding up the process.

With a network we can conduct belief updating, a form of inference. Essential to this process is identifying conditional independence of nodes, again this is closely associated to the Markov property. I will need to do some reading before going into more detail about that one. The lecture came to a close with an introduction to Netica, followed up in the tut with some practical experimentation.

 

FIT5047 – Intelligent Systems Week 6

Intelligent systems’ sixth week took a swing into soft computing and probabilistic systems. We had a quick introduction to probability theory which had the usual intuition breaking outcomes. The use of venn diagrams to explain parts of Kolmogorov’s Axioms were particularly useful. The definition of conditional probability did strike me a little of guard however:

 

Conditional Probability
Conditional Probability

Although in review this does seem much clearer. Given the knowledge of B [yellow] what is the probability of A [red].  As per the diagram and axiom, the answer is the intersection of A and B [green].

A revision of elementary probability reminded me that although at first glance it seems a trivial subject probability requires some use of the brain and calculator:

Suppose a “once in a century” flood has probability 0.01 of occuring in a year. How long do we expect to wait for one?

The answer:

(1 – p)^n = .5

log(1 – p)^n = log (.5)

n log(1 – p) = log (.5)

n = log .5 / log(1 – p)

= approx 69 years

Next came some discussion over notation and then, more importantly, and introduction to Bayes’ Theorm. A simple explination of Bayes’ theorm can be seen here:

 

Discussion then continued to some of the silly mistakes in probability theory that litter its past. I’m sure in 20  years many of the financial tools used in present day will appear on the lecture slides in the same category.

Kevin also made the time to add some material about the nature of probability. The suggestion made in Russell and Norvig is that probability is simply used to represent and agents believe state. If this is kept in mind then it is understandable why Bayesian networks have been such a boon over the past 15 years.

FIT5047 – Intelligent Systems Week 5

Taking a turn away from first order logic, search algorithms and planning, week 5 introduced the key issues around natural language processing [NLP] and the programming language Prolog.
The logic programming paradigm use by Prolog is something I have not learned about before. The development of axioms and problems solving by querying the axioms is the foundation of languages such as prolog. The engine of Prolog is a backward chaining theorem prover. The axioms in logic programming need to be Horn clauses: disjunctions of literals with exactly one positive literal

An example:

[cc lang=”prolog”]king(X) & greedy(X) → evil(X).

king(john).

greedy(john).

?evil(john).[/cc]

In the  tutorial we were able to do some basic playing with a toy implementation by Lloyd Allison:

http://www.csse.monash.edu.au/˜lloyd/tildeLogic/Prolog.toy/

Prolog relies very heavily on unification, a process that we were actually unable to correctly re-inact in the tutorial.

Prolog Answer:

After reading the tutorial solution, I am not really much clearer on the proves for each of these outcomes. I will have to follow up in the lecture.

NLP

We discussed the surface level methodologies for NLP:

  • Lexical analysis
  • Syntactic analysis
  • Semantic analysis
  • Pragmatic analysis

The focus of the lecture was however on the limitations of NLP. How ambiguity of words, their meaning and context makes effective NLP very difficult. Implications were another issue for NLP covered for some time.

Next came some approaches for overcoming the challenges of NLP. Statistical approaches such as N-Gram analysis. This veered the lecture into information retrieval , discussing the techniques used by search engines such as google to interpret searches.

On the topic of NLP I wondered if there were any large knowledge bases being assembled to try and assist in the task. Yahoo have a cluster of computers working on this:

Another interesting project:

 

 

FIT5047 – Intelligent Systems Week 4

The fourth week of Intelligent systems provided an introduction to Planning. The question that stands out in my mind from the tutorial is: what is the difference between problem solving via search and problem solving via planning? After running through the weeks material I hope to be able to answer that question.

First off the two learning objectives for the week (hmm seems to simple):

  • Goal stack planning
  • Partial order planning

What do we need to do goal based planning?

  • a world model
  • an action model
  • a problem solving strategy

Planning uses a divide and conquer strategy, allowing for sub goals to be identified. One must be careful to ensure that interrelation between sub goals is identified. This can reduce branching factors and assist in problems where heuristics are difficult to define.

Planning algorithms can work forwards or backwards, it seems that in most situations working backwards from the goal proves to be more efficient and further reduces branching factors.

Here is an example of a Goal stack planner from the lecture:

World model: Objects, states and goals

Action model: Operators

Problem solving strategy: Goal-stack planning

States are defined using first order logic, ie: AT(home) ΛHAVE(milk) ΛHAVE(bananas)

At this point we were introduced to the frame problem

The frame problem is that specifying only which conditions are changed by the actions do not allow, in logic, to conclude that all other conditions are not changed.

I don’t really understand why this is such an issue at present, the human brain does not reconfirm every fact after an action and it can still function effectively. Perhaps the reasoning this is considered such a big issue will become more apparent as I learn a bit more on the topic.

So, with an action in the goal stack planning systems [ using STRIPS] would appear as such:

>ACTION: GO(market)

>PRECONDITION: AT(home) ΛCAR-AT(home)

>ADD: AT(market), CAR-AT(market)

>DELETE: AT(home), CAR-AT(home)

From the goal state, one can define the preconditions, identifying which action is required to generate those conditions and work back until the initial state is reached. I do have some questions on how actions are selected and backtracking occurs. As in box world if B is picked up, why would it not be placed back on C (see lecture notes) unless there is an explored set.

After the box world example we moved onto Partial order Planning:

 

POPvsTOP
Partial order planning allows for concurrent actions

Partial order planning allows for less commitment in the search, reducing backtracking. I am still a little fuzzy on how this would be implemented so I will have to review the text book.

So, planning is clearly different from searching which simply applies valid operations to an initial state until it stumbles onto the goal state (this stumbling can be guided by heuristics).

FIT5047 – Intelligent Systems Week 3

Week number 3, titled ‘Knowledge Representation’ ran through propositional and first-order logic. Specific attention was paid to Inference, the prime reason for creating Knowledge Bases.

I am a little bit cloudy on the methods for attaining inferences in Knowledge Bases. The lecture slides describe ‘Inference – Resolution Refutation’. To my understanding resolution can be attained through refutation or simply by applying the inference to clauses and determining the resolvent.

There is a clear example of ‘General Resolution’ and also ‘Proof by refutation’. I understand the difference in the processes; however I do not understand why one would use resolution by refutation as opposed to general resolution.

Considering our assignment which is due next week requires this understanding and info on this seems scarce in the text book, I shall inquire at next week’s lecture.

generalresolution
General resolution - source week 3 lecture notes

resolutionbyrefutation
Resolution by refutation - source week 3 lecture notes

FIT5047 – Intelligent Systems Week 2

Week 2 – search algorithms.I implemented A Breadth first search in Python for application to the Missionaries and cannibals problem.

—–>>>>      Python script:  http://mchost/sourcecode/MandC.py <<<<-—-

To run install python 2.7 then open terminal/cmd and $ python -i MandC.py

The game is not that simple: Play Missionaries and Cannibals game

The other search algorithms that we discussed in the week 2 lecture were:

  • Breadth first search
  • Depth First search
  • Hill climbing
  • Simulated annelaing
  • Graph search (A*)

The most important assessment criteria for search algorithms are:

  • Completeness [If there is a solution will it be found?]
  • Optimality [Will the algorithm find the best solution (lowest path cost)]
  • Time/Space complexity [Processing power and memory requirements]

Also introduced were heuristics [warmer colder hints for the search], two key characteristics:

  • Admissibility [the cost of reaching the goal is never overestimated]
  • Monotonicity [The estimated cost to goal from N1 is not greater than the step cost to N2 plus estimate to goal from N2 (aka consistency]

 

some search algorithms (source: week 2 lecture notes)

<a href=”http://www.novelgames.com/flashgames/game.php?id=54&l=e”>Missionaries and Cannibals</a>