A brief overview of big data machine learning and feature selection

I am sure that lately all of you have been hearing a lot about Machine Learning and I guess if you are here, you are curious to find out what it is. I will attempt to give you a very brief and modest overview of my view on Machine Learning.

Like most things, Machine Learning was invented because there was a problem that needed solving. So the first question I am going to look into is “What is the problem we are trying to solve?” Answering this does matter, since it underpins all fundamentals behind Machine Learning.

I am assuming you have all heard of this Big Data Problem. It does sound a bit grand and at first, it doesn't really look like that much of a problem, because usually we tend to think - more data, this must be good, right. How is that a problem?

Well, all four experiments currently in the Large Hadron Collider at CERN produce 25 GB data per second. That is an astonishing amount and without proper processing and extracting only the most useful information in a quick and efficient way you can imagine that this will quickly become impossible. So if we imagine this fictional person working in CERN in charge of finding this 1 interesting collision in those gigabytes of data, he or she will have a very bad day at work.

The Large Hadron Collider (Image: CERN, Creative Commons licence)

The Large Hadron Collider (Image: CERN, Creative Commons licence)

But it is not only the physicists that have this problem. Geneticists have been dealing with a similar problem since the completion of the Human Genome Project. With the rapidly dropping price for whole genome sequencing, it is now possible to sequence your genome for around 1000 USD. Just for reference, the sequencing of the first genome took over a decade and about 2.7 billion USD to complete. This is an astonishing drop in prices and it also means that vast amounts of data are available for geneticists to exploit when searching for innovative therapies for cancer, heart failure and many more. But looking for those few special genes that determine whether someone can develop cancer amongst the 20,000 ones that make up our genome is no easy task.

Not only scientists have this problem. Supermarket owners, for example, would like to analyse all the data about sales they have had for let’s say the previous year and understand the trends, what people bought and when so that they can stock appropriately.

So it looks like Big Data is everywhere around us and in order to use it, we need ways to actually analyse it and understand it. Apart from the statistics, the field of Machine Learning heavily explores concepts about how we can actually analyse data and use it to predict future trends. One particular area deals with feature selection. In particular, it gives us tools and ways to filter out unnecessary or redundant data and helps us select a subset of features that describes our data.

But what is a feature?

To put it simply, it can be anything as long as it helps you describe some concept. For example, when you want to look for reasons someone might have heart problems one of the factors you consider is how old is he or she. So age here is a feature.

But even simpler than that, when you go to the store to buy a new TV for example, you might look at various things that describe this device like, display size, resolution, colour, price, etc. All those are features that describe a TV. You can now begin to understand that there are usually quite a big number of features that can describe something but not all of them might be relevant for you. You might not care too much about the material of the polarizing filter of the display. It is definitely important and key part of the functionality of LCD display, but not a major factor in your decision to buy this particular TV. Similarly, not all particle collisions in CERN are of interest for scientists to investigate and not all genes in the genome are linked to cancer. So, we can begin to see that extracting only the most relevant ones is quite important, in order for us to be able to judge quickly, which TV to buy or what drug treatment might help a particular individual.

This is where Feature Selection comes in handy. And if you stay tuned we are going to dive in to what Feature Selection is in the next post.


Veneta Haralampieva

Veneta Haralampieva

Veneta is a Computer Science with Industrial Experience student, currently completing her placement year at Accenture in London. Her particular fields of interest are Machine Learning and Computer Graphics.

Enquire now

Fill out the following form and we’ll contact you within one business day to discuss and answer any questions you have about the programme. We look forward to speaking with you.


Talk to us about our Data & Ai programmes