Facilitating Machine Learning Ep 1: An Introduction to Computer Vision

8 min readOct 27, 2020

This article was written as part of the Facilitating Machine Learning series by the BITS Goa Women in Tech community. The goal of the series is to foster an interactive community of learners in the field of Machine Learning, through interactive talks and discussions. The series hopes to spark interest and provide a platform for shared ownership towards learning.

What is Computer Vision?

Computer Vision is the subfield of artificial intelligence which tries to imitate the human vision capabilities. This not only includes the ability of humans to see, but also the ability of humans to perceive the environment that they see and make decisions based on it. Computer Vision is a broad field, and encompasses not just the machine learning models required to build logical decisions, but also hardware settings required to “view” the environment, and appropriate encoding and signal processing to construct the image to be used by the model as input. The main aim of computer vision is to understand and automate tasks that the human visual system can do. It has applications in several fields, and is widely used for defect detection, Metrology, Facial recognition and in robotics. Due to the vast amount of data that we generate today, Computer Vision has grown by leaps and bounds in recent times.

A ridiculously brief history

In the late 1960s, computer vision began at universities which were pioneering artificial intelligence. It was meant to mimic the human visual system, as a stepping stone to endowing robots with intelligent behaviour. In 1966, it was believed that this could be achieved through a summer project, by attaching a camera to a computer and having it “describe what it saw”.

Studies in the 1970s formed the foundations for many of the computer vision algorithms that exist today, such as extraction of edges from images, labelling of lines, non-polyhedral and polyhedral modelling and representation of objects as interconnections of smaller structures.

The 1980s marked the first time that statistical learning techniques were used in practice to recognise faces in images. Despite the progress, by the 1980s, computer reasoning was still far from achieved. This led to a period of reduced funding and interest in computer vision and artificial intelligence research in general, termed as the “AI Winter”. It wasn’t until AlexNet won the ImageNet challenge in 2002 that AI witnessed the massive boom that it currently enjoys. However, this doesn’t take away from the fact that Computer Vision continues to remain a difficult concept to master, and there still might be a long way to go.

Why is Computer Vision so hard: The Moravec’s paradox

Moravec’s Paradox states that “It is comparatively easy to make computers exhibit adult level performance on intelligence tests, playing checkers or calculating pi to a billion digits, but difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility… The mental abilities of a child that we take for granted — recognising a face, lifting a pencil, or walking across a room — in fact solve some of the hardest engineering problems ever conceived… Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. ” In short, the hard problems are easy, and the easy problems are hard. This explains why research into computer vision and robotics had made so little progress by the 1970s. Moravec’s Paradox was the reason that a lot of the “AI research” lost its AI label, understandably how can something be termed “intelligent” if it cannot even replicate the behaviour of a one-year old? The invisible complexity of the “simple skills” that we so often take for granted are extremely difficult for computers to master. Since computer vision attempts to proxy as a human equivalent for sight, it has to be able to efficiently replicate all the evolutionary learning that our eyes and brains inherit for the purpose of ‘seeing’. Computer vision, therefore, always has an uphill battle in front of it.

To put it succinctly, challenges in computer vision are:

Computers do not have built-in mechanisms to extract higher-level information from images like humans do.
There exists an issue of data representation.

Applications of Computer Vision

Object Classification
Object Identification
Object Verification
Object Detection
Object Landmark Detection
Object Segmentation
Object Recognition

The backbone of Computer Vision: Convolutional Neural Networks

To overcome the challenges faced in computer vision, feedforward neural networks just won’t cut it. Feedforward neural networks are fully connected networks, and for large input vectors the number of learnable parameters is magnanimous. This leads to a large training time and also makes the model prone to overfitting on the training dataset.

As an example, look at the feedforward neural network above. All the 10 inputs of the input layer are connected to all the 8 neurons of the hidden layer. That is 10*8–80 parameters — just for the first hidden layer! In total, this extremely small feedforward neural network model has around 6000 learnable parameters! Now imagine if we had to use a feedforward neural network on an image dataset. Even the smallest possible image size is 28x28 or 30x30. This requires a feedforward neural network with 784 (or 900) input neurons in the input layer. As the image size increases and the model becomes more complex, the number of learnable parameters for a feedforward neural network reaches to the order of millions!

This problem begs the question, can we have Deep Neural Networks which are complex yet have fewer parameters?

The convolution operation

Our challenge at hand can be solved beautifully by a very simple mathematical tool called the convolution operation. Consider pizza waiting times at a busy Italian restaurant.

At t0, the waiting time for the pizza is X0.

At t1, the waiting time for the pizza is X1.

At t2, the waiting time for the pizza is X2.

Now, if you were asked to estimate the waiting time for your pizza which you ordered at t3, you would probably average all the neighbouring waiting times out.

Estimated waiting time X3 at t3 = ⅓ * (X0 + X1 + X2)

Now suppose, you want to introduce a prior belief to your estimation that the neighbour closer to t3 will have greater influence on the estimated waiting time for t3 than the neighbours before it, you could come up with a increasing weight system W0<W1<W2 such that:

Weighted estimate waiting time = W0*X0 + W1*X1 + W2*X2.

In the above equation, you have basically assumed that information gained from the closest neighbour is higher than the neighbour which is farther away. A different setting of weights could have been done depending on what the information gained from neighbouring cells in the given setting was.

What we have done above is known as a convolution. A convolution is the process of calculating the weighted average of all the previous neighbours to estimate the value of the current neighbour. A convolution is applied when information gain is achieved not just from the current point of interest, but also from the neighbouring points.

1D convolution

Below table is an example of a 1D convolution.

We say that the W vector convolves over the X vector and the result is 1.80 (which is summation of Wi*Xi)

2D Convolution

In a 2D convolution, the information gain is across neighbours in both rows as well as columns. A 2D convolution operation is pretty useful in the context of computer vision, since images are typically represented as matrices. The 2D convolution is a useful tool to analyse images, since a pixel by itself does not give any information gain. We need to look at neighbours along both rows and columns to get some information. Blue coloured pixels could be a part of the sea or the sky, and we can only be sure of which category they belong to once we have looked at the neighbouring pixels. Thus, the convolution operation becomes the foundation for our computer vision problem. We call the image matrix as the input, and the weight matrix as the kernel or the filter. The kernel is used for specific feature extraction depending on what the weights are.

The Convolutional Neural Network

We can now use the convolution operation to build our network. The filters become analogous to weights in the feedforward neural network and the entire convolutional output corresponds to the whole layer of neurons in the hidden layer. In FNNs, we consider all input values from the previous layer multiplied by their weights while in CNNs, we consider only a small number of input values multiplied by the filter values. Thus, CNNs bring about a huge reduction in the number of parameters by making use of two very important concepts — Sparse connectivity and Weight sharing. Sparse connectivity refers to the reduced number of connections between two consecutive layers, while weight sharing refers to the reduction in the number of parameters caused due to the same filter convolving over the entire image. The end result is a neural network which is much more tuned to handling images and is one of the most widely used neural networks for computer vision problems.

Additional Reading

About the author: Arshika Lalan

I am a pre-final year Msc. Economics and B.E Computer Science student. I am a Deep Learning and Machine Learning enthusiast, who loves to experiment and explore new ideas. I love maths, econometrics and statistics. I also love to work on web development using React.js. I also enjoy writing, and write poetry and tech articles in my free time. You can stay connected with me via LinkedIn and check out my articles on Medium.

Thank you for reading!

Check out more from BITS Goa Women in Tech on our Website , Instagram, LinkedIn and Medium!