Data science and machine learning are often conflated and for good reasons.
This article will provide the context in which data science emerged, a general definition of data science, and how machine learning and data science might be understood as a role in a business. First, what is the difference between machine learning and data science? We have already discussed the definition of machine learning in another two-part article: what is machine learning part 1 & part 2.
The emergence of data science is a story of two disciplines. The more mature discipline of statistics and the baby, computer science. What you and I think of as "data science" as a field, and more distinctly "data scientist" as a role, has only recently become the new profession of making sense of the vast amounts of data.
About ten years before machine learning became dominant, big data was all the rage. All aspects of our lives started to become recorded and digitized, from counting the number of steps we walked to the length of hours spent playing video games. The digital records grew to become very large data sets and were analyzed using computers, the birth of big data. Big data has become a source of information for research, and with it the development of the data science we know today. While big data and data science are not interchangeable terms (big data is an aspect of data science), one could argue that the so-called big data revolution provided the impetus for the field. Read more about data cleaning and data wrangling here.
Before we jump into the definition, it’s important to note that there currently isn’t a generally accepted definition of data science. And, as a field in search of a definition, it is of no surprise that there have been multiple attempts to define it. The definition below (from Wikipedia) is purely to give you a better intuition of the field.
Data science is a ‘concept to unify statistics, data analysis, machine learning and their related methods’ in order to ‘understand and analyze actual phenomena’ with data.
Data science is understood as an interdisciplinary field, what makes data science worthy of its own field is the breadth of required skills. Including but not limited to computer science, machine learning, maths and statistics, traditional research, databases, and data processing. Really, applying any scientific method, process, algorithm, and system to extract knowledge and insights from both structured and unstructured data and convincingly communicate the results.
The field in search of a definition has been Venn diagramed many times over. This initially oversimplified what the field includes (seen in the first two diagrams below). The updated diagram by Stephan Kolassa (bottom right) takes into account communication because all the insights you derive won’t make a bit of a difference unless you can communicate them to people who may not have that unique blend of knowledge. It is interesting to note, that this explanation was not without some hearty commentary. Stephan Kolassa's explained via the Science Stack Exchange:
I still think that Hacking Skills, Math & Statistics Knowledge and Substantive Expertise (shortened to "Programming", "Statistics" and "Business" for legibility) are important... but I think that the role of Communication is important, too. All the insights you derive by leveraging your hacking, stats and business expertise won't make a bit of a difference unless you can communicate them to people who may not have that unique blend of knowledge. You may need to explain your statistical insights to a business manager who needs to be convinced to spend money or change processes. Or to a programmer who doesn't think statistically.
Whilst the definition remains in a state of flux (and likely to remain so in a rapidly evolving field) what can be agreed upon is that the field of data science and the evolution of the data scientist is spurred on by the very real scenario that data may become the pivot point on which the economy spins.
In 1962 John W. Tukey wrote in "The future of Data Analysis"
“For a long time I thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and doubt… I have come to feel that my central interest is in data analysis… Data analysis, and the parts of statistics which adhere to it, must…take on the characteristics of science rather than those of mathematics… data analysis is intrinsically an empirical science… How vital and how important… is the rise of the stored-program electronic computer? In many instances the answer may surprise many by being ‘important but not vital,’ although in others there is no doubt but what the computer has been ‘vital.’”
Jim Gray, the Turing Award winner for relational databases, imagines data science as a “fourth paradigm” of science with the other paradigms being empirical, theoretical, computational, and now, data-driven. Gray claims that “everything about science is changing because of the impact of information technology” and the data deluge.
A functional Data Scientist (as opposed to the field of data science) often defines the problem, identifies key sources of information, and designs the framework for collecting and manipulating the needed data and communicating the results.
The information extracted through data science applications is used to guide business processes and reach organizational goals; offer new perspectives on solving existing problems; And, importantly has the potential to revolutionize our understanding of fundamental questions.
Data science uses many techniques to perform data-driven research. The field is understood as a discipline where you can use machine learning techniques on your data. Machine learning is an application of artificial intelligence that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. This means that you can take a data science model and put it continuously into production.
Broadly, (very broadly) we’re talking about scientists and engineers. There is of course significant overlap, and indeed a plethora of related fields. You only need to check out this intense DATA SCIENCE schema to get the drift (courtesy of a great Quora answer).
While a scientist needs to fully understand the science behind their work, an engineer is tasked with building something. If we were to consider the engineer and the data scientist as two roles and members of the same team, a data scientist does the statistical analysis required to work out the best machine learning approach, then they model the algorithm and prototype it for testing. Machine learning engineers create data funnels and deliver software solutions. They typically require strong statistics and programming skills, as well as a knowledge of software engineering. In addition to designing and building machine learning systems, they are also responsible for running tests and experiments to monitor the performance and functionality of the systems (with the data scientists).
You can't really. They go hand-in-hand, machines cannot learn without data, and data science is better done with machine learning. This is because machine learning as technology helps analyze large amounts of data, reducing the tasks of data scientists with automated processes. It has changed the way data extraction and interpretation works to include automatic sets of generic methods. Meaning the data scientists can spend more time on the other critical elements of their role (check back in on the Venn diagrams).
It's not so clear-cut, just like the definition of data science and data scientist. Increasingly there are more people who use data to make decisions and enhance products with a background in neither. The availability of vast amounts of data and the rise of affordable storage technologies have triggered a meteoric rise in the need for self-serve analytics and machine learning platforms to cater to the non-expert user.
What this means is that there is a rise in non-technical folk who are using many of the data science methods (including ML) to extract insight from data all without a data science or ML engineering background. Self-serve analytics, business intelligence (BI) and machine learning providers aim to assist in this area. This includes our own AI & Analytics Engine, which simplifies and accelerates the data science process allowing expert and non-expert users to reduce the time to value of their data.
These emerging capabilities will make BI, analytics, and data-driven decision-making much more accessible, understandable, and actionable for non-technical business users. And now more than ever, this means a significant competitive edge.
So, where does machine learning fit into data science? We can split the application of machine learning broadly into data science and artificial intelligence. Whilst data science is still being defined as a domain, it includes machine learning as a principal technology. Machine learning refers to a group of techniques used (by data scientists and increasingly non-expert users) that allow computers to learn from data. Both are dependent on data. Data is the indispensable fuel for them both. And, with ML technologies fast becoming an integral part of most industries, so too is ML to data science.
Looking for a guided machine learning tool, so you can get on with extracting insights from data? Trial The AI & Analytics Engine.