101 Artificial Intelligence and Machine Learning

Robert Maxwell · Dec 12, 2014

I would appreciate more depth on neural networks. I have written some simple ones in the past but getting straight information from an actual AI researcher would be great. :techman:

intrinsical · Dec 13, 2014

Alright, it may take me a while as Christmas is just around the corner, I will be spending time with the family, and I was not exactly prepared to dig deeper. I will need to think about how best to present the rest of the neural network stuff. It can be quite a huge topic so I need to do some trimming and organizing before I can present something. Is there any specific thing you want to know or ask?

For now, this is what I will say...

In the past decade there has been a revival of a form of Neural Networks called Deep Learning. I'm in the school who says Neural Networks and Deep Learning is 90% overhyped, rebranded old technology and 10% genuine scientific advancement. I do appreciate the fact that the term Neural Network and Deep Learning has captured the public's fascination with Artificial Intelligence and Machine Learning. However, I don't think there has been as big a big jump in our understanding of AI as many other researchers have claimed.

If anyone could be thought of as the Machine Learning equivalent of Prof. Steven Hawkings in Physics, Prof. Michael Jordan would probably be one of the top few contenders. So the fact that he feels the same way as I do is quite telling.

intrinsical · Dec 29, 2014

So far, the type of Neural Network I have talked about was developed in the 1970s to 1980s. I am going to do a slight detour into newer stuff that was developed in the 1990s and early 2000s before returning to Neural Networks in about two or three posts. The reason is because the stuff I'm going to talk about will serve as a foundation for talking about the more advanced stuff in Neural Networks.

As a research field we didn't intentionally set out to do this but now looking back in hindsight, it turns out that AI reseach in the 1990s and early 2000s was for a large part focused on creating better decision boundaries. I will focus on three critically important ideas that came out of this period:

1) Ensemble Learning
2) Large Margin Classifiers / The Maximum Entropy Principle
3) The No Free Lunch Theorem

Ensemble Learning http://en.wikipedia.org/wiki/Ensemble_learning
I could be wrong about the year, but I believe it was in either 1995 or 1996 when a notable researcher, Leo Breiman, came up with a crazy notion. Instead of spending a huge amount of time training the perfect classifier, how about we just use a bunch of quick to train but bad classifiers? This idea, now called Ensemble Learning, didn't turn out to be crazy after all. Today, just about every state of the art performance in AI system with the best prediction accuracy uses this idea in one form or another.

To explain why this counter-intuitive idea works, lets consider the parable of blind men describing an elephant. One blind man touched the elephant's ear and declared, "An elephant is like a piece of leather. Large, rough and floppy." Another blind man touches the tail and says "No, an elephant is like a thick rope. Sinewy and tough." A third blind man touches the elephant's leg and says, "An elephant is like a tree trunk!". Clearly, they are all simultaneously right and wrong. More specifically they each had a different perspective, which resulted in each having a different description of what an elephant is. If we had a large number of blind men each with their different perspectives and took a vote, then we start to have the wisdom of the crowds. A good number of them may individually be wrong, but together a hundred blind men can probably quite accurately agree on whether the object they have in front of them is indeed an elephant.

The simplest "Ensemble" classifier, which we call a Bootstrap Aggregator or Bagging classifier, takes a large number of simple but different classifiers, and have them vote. Just like in a democracy, the selected answer is the one with the majority vote. As long as each individual classifier is sufficiently different from the rest, this method usually works quite well. There are slightly more advanced ways of creating an ensemble of classifiers, such as Boosting but in a nutshell all ensemble classifiers rely on the fact that the whole is greater than the sum of its parts. This idea works surprisingly well in practice. As mentioned, it has been used to create some of the most highly advanced AI being used today.

As an example, in 2006 Netflix started a competition to determine if anyone can improve upon Netflix's own internally developed movie recommendation system. Thousands of teams participated, thanks to the one million dollars prize money. In the first year, each competitor developed their own recommendation system. In that first year, after a lot of hard work, the best system outperformed Netflix's internal system by about 1% in terms of accuracy. In the second year, the best performing system outperformed Netflix's by a massive 8% (compared to 1%, that's an eight-fold increase in accuracy). The secret? It used Ensemble Learning to combine 107 previously separate recommendation systems into one single system. In the third and final year of the competition, practically all the competing teams have merged their systems using Ensemble Learning into a handful of mega systems. The top two mega-teams, both using ensembles of around 500-700 recommenders, both had a nearly identical improvement of 10% upon Netflix's original recommendation system.

Robert Maxwell · Dec 29, 2014

That's really fascinating. It sounds a little bit similar to genetic programming, in which you have variations of an algorithm try to solve the same problem and then propagate and mutate whichever one(s) come closest, then do it over again.

Crewman Commander · Dec 30, 2014

I'll de-lurk for a moment, just to let you all know that I'm interested in the thread and following the convo, even if I don't chime in. :techman:

intrinsical · Dec 30, 2014

Solstice said:
That's really fascinating. It sounds a little bit similar to genetic programming, in which you have variations of an algorithm try to solve the same problem and then propagate and mutate whichever one(s) come closest, then do it over again.

I hadn't thought about it this way but I suppose on a cursory examination they do look similar.

However, Genetic Algorithms are modeled on "survival of the fittest" whereas Ensemble Learning is modeled on "wisdom of the crowds". They are actually two very different ideas.

Genetic algorithms are designed start with a large number of random "solutions", then whittle them down to a few or a single good solution. On the other hand, the goal of ensembles is to end up with a large number of diverse, partially-good classifiers. Without this diversity, ensemble learning can actually perform worse than many standard classifiers.

Robert Maxwell · Dec 30, 2014

intrinsical said:
Solstice said:

That's really fascinating. It sounds a little bit similar to genetic programming, in which you have variations of an algorithm try to solve the same problem and then propagate and mutate whichever one(s) come closest, then do it over again.

Click to expand...

I hadn't thought about it this way but I suppose on a cursory examination they do look similar.

However, Genetic Algorithms are modeled on "survival of the fittest" whereas Ensemble Learning is modeled on "wisdom of the crowds". They are actually two very different ideas.

Genetic algorithms are designed start with a large number of random "solutions", then whittle them down to a few or a single good solution. On the other hand, the goal of ensembles is to end up with a large number of diverse, partially-good classifiers. Without this diversity, ensemble learning can actually perform worse than many standard classifiers.

Right, I just made note of both as involving taking a large number of approaches, but where genetic programming ultimately chooses one or two of the "most fit" solutions, ensemble considers the aggregate answers of all classifiers as a combined solution. Very fascinating!

intrinsical · Jan 2, 2015

Time for a pop quiz.

The image below shows a simple dataset with two classes, depicted by red and green crosses. Also in the image are two decision boundaries, one with a blue line and the other with a purple line. Both decision boundaries accurately split the dataset into two parts with 100% accuracy, but one decision boundary is better than the other. Can you tell me which decision boundary, blue or purple, is the better decision boundary? And why is it better?

PurpleBuddha · Jan 2, 2015

Purple. The blue line has two data points so close to the boundary that they touch it. That, and purple is always superior.

Robert Maxwell · Jan 2, 2015

Yeah, I'm going with purple, because it seems to represent a "best fit" line between the two sets--it never draws too close to either.

Crewman Commander · Jan 2, 2015

There is not enough information to decide which is better. Reasonable criteria have been given for why the purple line is better. However, if the goal were to provide a boundary that is a function of x, the horizontal coordinate, then the blue line is a much better approximation of a vertical boundary. It depends by what one means by "better"; what is it exactly that you are trying to maximize?

intrinsical · Jan 2, 2015

PurpleBuddha said:
Purple. The blue line has two data points so close to the boundary that they touch it. That, and purple is always superior.

Robert Maxwell said:
Yeah, I'm going with purple, because it seems to represent a "best fit" line between the two sets--it never draws too close to either.

CorporalCaptain said:
There is not enough information to decide which is better. Reasonable criteria have been given for why the purple line is better. However, if the goal were to provide a boundary that is a function of x, the horizontal coordinate, then the blue line is a much better approximation of a vertical boundary. It depends by what one means by "better"; what is it exactly that you are trying to maximize?

All three of you are correct. In AI, we usually refer to the blue decision boundary as being biased.

Although in truth, all decision boundaries we try to create in Machine Learning are biased in some way as we have no idea what the "true" decision boundary is. Who knows? Maybe blue decision boundary IS the "true" decision boundary. The only way to discover the "true" decision boundary is if we collect so much data that we can fill the entire image with red and green crosses. And that is assuming that our data collection process is flawlessly perfect and doesn't introduce any "noise" to our dataset. However as a "best guess" based on the limited amount of data we have, there's a much better chance that the purple decision boundary is less biased than blue.

Alright, lets try a slightly harder question. What about this image below. Which is the less biased decision boundary, purple or orange?

Robert Maxwell · Jan 2, 2015

I'm resisting my instinct to say that the curved line is better because it rests on a more complex formula. One thing I seem to have gotten from your posts on all this is that the simpler solutions tend to be more reliable. I would say there is either no difference or the curve is only slightly better, or possibly even worse. On the balance, I'd probably stick with the purple line.

intrinsical · Jan 4, 2015

Robert, I will address your concerns in the next post on the simplicity/complexity of decision boundaries as you have stumbled upon another important Machine Learning idea which I initially was not going to talk about. For now, here's the next topic.

Support Vector Machines http://en.wikipedia.org/wiki/Support_vector_machine
I had a very difficult time trying to write this post. The reason is because the topic is an advanced classification algorithm called Support Vector Machine (SVM for short). The creation of SVM in 1995/1996 is one of the major milestone events in Machine Learning. It is so important that every single Machine Learning textbook spends at least a chapter on it and it is covered in every single undergraduate and postgraduate Machine Learning course. In a way, SVMs heralded the decline of Neural Networks in the 1990s because SVMs routinely outperformed NeuralNets by anywhere between 5% to 20% accuracy in just about every single classification task we tried. Optical Character Recognition, Speech Recognition, Face Detection, Natural Language Processing, Fraud Prediction, Microsoft Kinect, Self Driving Vehicles and a whole host of technologies, all suddenly became achieveble thanks in part to SVMs, the lessons and theories we have learned from it.

So.. how does it work? Well I won't be covering that as it uses highly sophisticated "rocket science" mathematics that frankly, I'll need to brush up on before I can speak with any authority. To describe SVMs without the use of math, well, is going to be very hard (Now I know how Phycists feel when trying to describe Quantum Mechanics without math). What I will do is use an analogy to try and explain how SVMs does the magic it does. It would be kind of like how Q showed Janeway the Q continuum, it is a version of truth that is also no where close enough to the real deal.

Maximum Margin Classifier
Older classification algorithms, such as Neural Networks and Decision Trees, do not impose any special conditions on the type of decision boundary that can be generated. In fact once the algorithm finds a decision boundary that resulted in high prediction accuracy, it considered its job done. This means that any of the decision boundaries in the image below are all equally good. In fact, a Neural Network would probably have first found the blue decision boundary and stopped right there because it gave 100% prediction accuracy. Neural Networks would not go on to find the purple or the orange decision boundary.

SVM is different in this respect. Not only must the decision boundary give a high accuracy, SVM also requires that the decision boundary must have an equally wide gap separating datapoints on both sides of the decision boundary. This is why SVM is sometimes also referred to as a maximum margin classifier. To create such a decision boundary, it sort of uses these steps:

1. Pair up every datapoint with the closest green datapoint. Out of these red-green pairs, select the data-pairs with the shortest distances between them. These will be the support vectors (pillars) that will be used to define the decision boundary.

2. Now to create the decision boundary, pick the midpoint between these support vectors. That is where the decision boundary lies.

In a separate line of research that occurred at roughly the same time when SVMs were first developed, another group of researchers worked on a more theory based approach to finding the best unbiased decision boundary. They used information theory and proved that based on a given dataset, the best and most unbiased decision boundary is the one that maximizes information entropy. These days, we call this the Maximum Entropy Principle or Maximum Entropy Discrimination. It took several more years to conclusively work out all the math, but it turned out that the decision boundary conditions used by SVM is exactly the one that satisfies the Maximum Entropy Principle. So it was a happy coincidence that SVMs turned out to have a very good theoretical grounding as well.

So why is Maximum Margin / Maximum Entropy such an important idea? Well if you think about it, most of the classification errors happen close to the decision boundary. By having an unbiased decision boundary that is provably more likely to be similar to the "true" decision boundary, we can seriously reduce the number misclassifications.

Robert Maxwell · Jan 4, 2015

You know, when I was looking at that curve earlier, it did look to me like it was purposely made to lie exactly between data points in the two sets, but I wasn't sure if that meant anything. Glad to know my suspicion wasn't totally crazy. :techman:

I've honestly never heard of SVM before. Makes a lot of sense, though. I can see how this would be important for a lot of classification scenarios and, as you say, is the most unbiased way to do it.

PurpleBuddha · Jan 7, 2015

Wow, wonderful post. Looking forward to the next one.

intrinsical · Jan 11, 2015

I was going to write about Occam's Razor in Machine Learning, but it is turning out to be a much bigger topic. There are a lot of sub-topics interlinked with the idea of keeping algorithms simple in Machine Learning, and I am having difficulty explaining it all in a simple fashion. I don't know if I'll ever complete this section. I suspect it will be quite dry and possibly uninteresting as it involves a lot of technical and mathematical details. So for now, I am posting the next part (which is also quite theoretical)...

No Free Lunch Theorem http://en.wikipedia.org/wiki/No_free_lunch_theorem
Many scientific fields have their own versions of the impossibility theorem. In Mathematics, there is Godel's Incompleteness Theorem which states roughly that any but the simplest toy mathematical system has facts that can never be proven or disproven. In Computer Science, Turing's Halting Theorem states that there are computer programs that will never stop running (In other words, are unable to solve certain tasks as they won't ever stop running or find a solution). The No Free Lunch Theorem is the far less well known and much more recent version of the impossibility theorem for Machine Learning and AI in general. I won't go into the theorem's proofs, but the implications are that:

1) There will never ever be a single general purpose algorithm that is best at every single task we throw at it. In other words, all general purpose algorithms will have some form of an achilles heel, resulting in it performing badly in certain tasks. A simple example would be Logistic Regression (the straight line decision boundary) algorithm, which is easily defeated when the true decision boundary of the task isn't a straight line. It doesn't matter which algorithm we are talking about, Decision Trees, Ensembles, Neural Networks, Support Vector Machines, some future uninvented algorithm or even the human brain. There will always be some task where one algorithm will outshine all others.

2) For a specific task, we can always design a specialized version of an algorithm that takes into accout the unique nature of the task. Such specialized algorithms will outperform general purpose algorithms. In fact over the past several decades, a lot of doctorates have been given out to researchers who are able to come up with specialized algorithms for problems such as speech recognition, language translation, robot navigation, face/object recognition, etc...

The practical implication of No Free Lunch is simple, try as many algorithms as you can for you may be surprised at which algorithm works best. I see many new comers to Machine Learning who come in with a preconceived notion that algorithm X is the best. Typically, X is neural networks as that's the algorithm most people have heard about. And they will stick to using their favorite algorithm for every single assignment. To be honest, this was myself over a decade ago. These days when I am starting on a new prediction task, I would run as many algorithms as possible on the same dataset. Coupled with my knowledge of how each algorithm works, I find that this approach actually gives me insight into the "shape" of the dataset and how to further specialize the best performing algorithm so it can be even more accurate.

The next few posts will be something more practical. How handwriting recognition actually works, how face recognition actually works and the Neural Networks revival since the mid 2000s.

intrinsical · Jan 15, 2015

Handwriting Digit Recognition, Part 1
So you want to build a handwriting number recognition app. That was one of the first machine learning applications I learnt to do with Neural Networks back in the 1990s and here is a video demoing a similiar software.

[yt]https://www.youtube.com/watch?v=ocB8uDYXtt0[/yt]

MNIST
http://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
As with any Machine Learning project, we need data. Lots of data. Thankfully, there is a curated collection of people's handwriting called MNIST. And MNIST is the dataset we use as a standard test for determining which algorithm does best in handwriting digit recognition. The MNIST dataset contains 70,000 samples of handwritten numbers, stored as images 28 by 28 pixels in size. Here's a few examples from the dataset.

Unfortunately, Machine Learning algorithms only work on numbers, and cannot work directly with images. So we have to convert these images into numbers. To be precise, each pixel has to be converted into its greyscale level, a number representing the level of blackness or whiteness of the pixel. So for example, black is 0, white is 255 with numbers between 1 and 254 representing various levels of grey. Since each image's dimension is 28x28, there are 784 pixels and thus we end up with 784 greyscale numbers. Here's an example of a hand written "1" converted into numbers (apologies for distorted image, I don't know what happened there).

Now that all the images have been converted into numbers, we can start the training process to define decision boundaries. We can basically use use Neural Networks, Support Vector Machines, or practically any number of algorithms. Here are a few handwriting recognition systems developed over the years and how well they did.

1998 - Single Layer Neural Network, 92.4% accuracy
2002 - Support Vector Machines, 99.44% accuracy
2012 - Ensemble Convolutional Neural Networks (Ensemble learning with 35 Neural Networks specialized for images), 99.77% accuracy

Image Recognition is Solved, Right?
I found a website that showcases a piece of handwriting recognition AI called HeroSVM. These are the digits it misclassified (Correct classification is on the left and the incorrect guess by the AI is on the right).

As you can see, some of these are pretty hard even for human beings. So is handwriting recognition a solved problem? Not at all. Consider this, a typical sentence has an average of 75 to 100 characters. And the very best AI right now has an accuracy of 99.77%. That is roughly 1 wrong character every 434 characters, or one wrong character every 4-5 sentences. Essentially, one wrong character in every paragraph. Human beings can do better than this.

There was a moment in the 1990s when we thought because we had we had achieved near-human accuracy on handwriting recognition meant we had solved all image recognition problems. It turns out that handwriting recognition was one of the simplest image recognition tasks. When researchers tried using the techniques I described above on more complicated tasks such as face recognition, these algorithms did extremely poorly. I will explain why this is so in the next post.

Crewman Commander · Jan 15, 2015

intrinsical said:
Many scientific fields have their own versions of the impossibility theorem. In Mathematics, there is Godel's Incompleteness Theorem which states roughly that any but the simplest toy mathematical system has facts that can never be proven or disproven. In Computer Science, Turing's Halting Theorem states that there are computer programs that will never stop running (In other words, are unable to solve certain tasks as they won't ever stop running or find a solution).

Just chiming in to point out that that's an incorrect statement of Turing's Halting Theorem. The Halting Theorem is a statement that a very specific kind of computer program does not exist, and the precise statement is perhaps somewhat technical. The Halting Theorem states that there is no Turing machine H operating on ordered pairs (F,X), such that for all inputs (F,X) we have that H(F,X) halts and computes whether the Turing machine encoded by F (under a fixed but standard encoding) halts on input X. The Halting Theorem (as the negative solution to the Halting Problem) is intimately related to Gödel's Incompleteness Theorem.

RAMA · Jan 15, 2015

Great thread, just catching up with this.

101 Artificial Intelligence and Machine Learning

memelord

Commodore

Commodore

memelord

Nerd

Commodore

memelord

Commodore

Rear Admiral

memelord

Nerd

Commodore

memelord

Commodore

memelord

Rear Admiral

Commodore

Commodore

Nerd

Admiral

Similar threads

We value your privacy