The rise of increasingly complex computational systems has brought about a particularly challenging problem: How to control and classify the monumental amounts of data that is created and stored every day? Joseph Kearney, Blockchain and DLT Specialist for the Research and Development department at Atlas City, explains how Random Forests enter the story.
While the challenge of storage is simplified by the wide availability of cheap computer memory, the effective utilisation of this resource remains overlooked. It is frequently stated that data is the new oil; however, this asset is for the most part neglected, with vast amounts of potentially useful information going to waste. The majority of data is not utilised and lies redundant or will require painstaking manual checks of large datasets, often with low-quality results. This is primarily due to the data’s impenetrable nature and sheer scale.
Random Forests are a powerful, yet relatively simple, data mining and machine learning technique, allowing quick and automatic identification of relevant information from extremely large data sets. With such a stool the information can be extracted, without the need to manually trawl through it.
Random forests rely on the consensus of many predictions rather than trusting a single guess. Using data mining and machine learning techniques like Random Forest data sets can be manipulated and used to form highly accurate models of what the data is telling us and inform best business practices.
What is AI? A simple guide to machine learning, neural networks, deep learning and random forests!
Classification and Regression
The Random Forest algorithm allows classification and regressions to be made from the copious amount of data generated by companies every day.
Classification: is where an object is assigned to a data subset, based on information known about the object. For example, households can be classified into those that have pets and those that don’t.
Regression: is where an actual number is output, for example calculating the average number of pets per household.
Building your Random forest
Decision trees are the base component for Random Forests. Decision trees are very simple by nature. They operate in a similar way to the game 21 questions, where a series binary of yes or no questions are asked until you get an answer or a classification at the end. Datasets are formed from a series of objects characterised by specific elements.
Assume, for the sake of explanation, a dataset of cars. Each car is defined by specific features such as its colour, make, model, registration plate, etc. The tree sorts the objects by asking a series of question about it. Depending on the answer to the question the object moves down a certain branch of the tree. Eventually, the object can be classified according to its answers into one of the tree nodes, sitting at the bottom of the tree. Entire data sets can be checked allowing valuable information to be extracted from the predictions. Through the classification of datasets, easily readable facts can be extracted from unwieldy data sets.
Decision trees do not, however, provide completely accurate classifications. As with any statistics-based problem, the more data collected and tested the more accurate the prediction of a result.
A history of AI; key moments in the story of AI
Random Forests provide deeper classification and better predictions
Instead of creating a single decision tree, the Random Forest algorithm can create many individual trees from randomly selected subsets of the dataset. Each of these individual trees generates a classification of objects within the subset of data. These multiple classifications can be combined to get a more accurate prediction. Or in the case of regression, to extract a more accurate mean value. This technique is called bootstrapping.
Assume for instance that you have a very large dataset that you wish to classify. This dataset is randomly split into many smaller datasets, and for each of these datasets, a decision tree is created. Each of these subsets of data has no knowledge of what is in the other subsets. Therefore, it is free to make its own predictions solely according to the data that has been presented to it. Each decision tree created for each subset of data is also different, meaning that the questions asked of each object from the sub-set are different. More importance can be weighted on certain questions. Each random subset predicts its own classifications based on its own data set. It can thereby generate what it believes to be the correct prediction. Each individual bootstrapped dataset may get its classification – or regression – slightly wrong, however, the algorithm relies on the global average of all the bootstrapped sets to get the correct outputs.
Machine learning versus AI, and putting data science models into production
Machine learning is becoming the phrase that data scientists hide from CVs, putting a data science model into production is the biggest data challenge, and companies are still not getting it. We spoke to a data expert on the state of data science, and why machine learning is a more appropriate phrase than AI
Enhanced classification and reduced errors
Alternatively, Random Forest can utilise training sets – or a subset of the global datasets — where the correct output or classification is already known. The rest of the data is split into subsets for which individual decision trees are then created. As the correct output is already known, the quality of a random tree built form a subset that can be assessed.
The training runs are not considered in the final voting. However, they do allow us to form our decision trees in such a way that enhances the classification quality. This technique prevents a statistical error called overfitting. Overfitting is where a statistical set is classified in such a way that the classification is weighted towards the original dataset. This means that the addition of new information into the dataset would lead to false classifications of the new data. This is important due to the nature of the ever-changing data being created. The use of training sets prevents the overfitting as many different data subsets are tested meaning there is a wide variance in the types of data that the model is tested against, these training sets can be constantly improved and updated.