IEEE Talks Big Data: David Belanger & José Moura
Dr. David Belanger and Dr. José M. F. Moura are the co-leaders of the IEEE Big Data Initiative. Belanger is senior research fellow in the Business Intelligence and Analysis Program at Stevens Institute of Technology; Moura is the Philip L. and Marsha Dowd University Professor at CMU, with interests in signal processing and data science and 2016 IEEE VP for Technical Activities. Both are big data subject matter experts and discuss the technologies that are developing to help realize opportunities in big data, where they might occur and what frameworks are needed.
Question: What comprises big data from a technological perspective?
David Belanger: Three things make up big data, in my mind. It’s the data itself. It’s technology. And it’s good applications. Those three things coming together have led us to where we are now, but there are several pieces of technology that have changed dramatically over the last few years moving us further along. Networking, including broadband, 4G and 5G, has allowed people to get data, consume data and generate data 7/24 in all sorts of modes. There are also a variety of computer science technologies that have changed dramatically thanks to the Web community, such as NOSQL, Hadoop and Spark. Finally, there’s a whole collection of new analytics and visualization technologies that were impractical or didn’t exist in the past with smaller data.
José Moura: There certainly is more hardware and infrastructure related to big data acquisition, such as sensors for gathering it, storage technologies for housing it and protocols for transferring it across networks. But in addition, we will need to develop methodologies for extracting from the data what’s relevant and methods and algorithms that are efficient enough to handle large amounts of diverse data from different sources.
Question: Some of these technologies have existed for some time and some are relatively new. Why is big data getting so much attention now?
Moura: It is a perfect storm and confluence of technologies that have allowed for huge amounts of data to be produced, stored, and processed. Technology is very different today from what it was just 10, or even five years ago. If we want to time stamp it, it may be that the iPhone is the epitome of the starting of this revolution in that it gave us mobility, not just the ability to communicate from anywhere to anyone in the world, but the smartphones as platforms for sensing, displaying, visualizing, processing, accessing and interacting with information. This also means that we have all sorts of infrastructure to support this access and communication and data flowing around. There are all of these technologies that suddenly converged, creating this glut of data.
Question: Why is it now becoming possible to actually begin to make use of this large data?
Belanger: There’s been a sea change in the technological approach to hardware from large symmetric multi-processing machines, which are fairly expensive, to very high degrees of parallelization on commodity hardware, which has cut the cost of doing these big data applications. It’s cut it so dramatically that it’s now within the reach of many users rather than just a few who could afford huge machines.
Moura: The technologies are there – and it is qualitatively different. Computing power was there in the supercomputers, but you could not carry them around – and now you can, in your pocket. Where we still have away to go, and what we possibly don’t yet know, is how to extract useful, timely knowledge from this data. It’s a big problem, but when I say it’s a challenge, I see it as an opportunity. The great opportunity now is to extract actionable knowledge out of the data.
Question: How do we come up with a framework to take advantage of the confluence of available technologies and growth in data?
Belanger: There are many different approaches. Some of them are quite different, some of them are just subtly different. Anyone doing large-scale data analysis is not only looking at one data set but at integrating many pieces of data together from a variety of different sources. Big data is often complex data. Standards have emerged in some areas. But we’re still in the process of shaking out what will be the dominant technologies and systems for storing data, for example. Without a collection of standards, which will evolve over a period of time, we won’t be able to progress as much as we’d like. There’s a large collection of tools for data quality, integrity and security, but they are very idiosyncratic at this point. We will start to see both de facto and formal standards start to emerge in many cases. I don’t think we know exactly what they’re going to look like right now.
Moura: Sorting out the hardware and software is essential, but the analytics side is particularly important and a big challenge. Traditional algorithms will not be able to process data within the time window of interest in most cases, so we need to develop new, very fast algorithms. And, we need methodologies to handle the distributed nature of the data. For example, if you have data available from hundreds of video cameras or hundreds of high data rate sensors, you may not want to transfer all of that data to a centralized location. You would want to spread out the processing to the periphery where the data is being acquired and only extract relevant information at a much more reduced data rate. If you are interacting with the cloud, for example, you want to be able to do it at a high level. You don’t want to be sending raw data, you want to send a filtered version of the data.
Question: What else besides technology and standards do we need to think about when creating a framework for big data?
Moura: The other big issue are the questions that are asked or to which we can give answers. Although we have a glut of data, the questions of interest remain very simple. Sometimes we will be looking for a needle in a haystack and we know what that needle is, we simply have a hard time finding it. But we also need to think in terms of not just finding out what most people are doing based on this data, but the exceptions that may be a precursor to something important that may happen. But we may not even know what is the interesting or important question to ask. Traditional thinking suggests we already know what questions to ask. The data is going to inform us of what relevant questions we should be asking. We need to be open to seeing patterns that we are not necessarily looking for. The anomalies are just as important as the trends.
Question: Are there sectors that tend to be more proactive in moving forward with big data initiatives?
Belanger: If you look at the Web industry, particularly those who develop apps, these people are interacting in various ways with each other and with machinery, and by their very nature have had to be drivers of big data because of the type of data that they work with and because of their business model. It’s completely dependent on being able to analyze lots of data and derive things such as marketing plans and recommender systems, which use big data. Other industries such as telecom and finance have been handling massive amounts of varying kinds of data at very high speeds for more than a decade. But they have often been using proprietary tools, thus not leading to standards. There are a number of scientific endeavours that have also been using big data for quite a while; the various genome activities fit there. Cyber security is another. There are many different players involved and they all have different motivations.
Moura: It goes without saying that medical and health are areas where the availability of data has led to the development of new tools and technologies to analyze it all. Another area is the process industry, such as manufacturing and its instrumentation of factories and plants with lots of sensors. A lot of data is being processed to better manage industrial manufacturing processes. Another area is urban science, where big data can be used to make cities more liveable from an environmental, sustainability, and traffic or flows point of view. Some industries have standards, but additional standards will need to be developed.
Belanger: There are plenty of examples of employing many sensors to collect huge amounts of data, the so-called Internet of Things, and doing analysis on the data. There are also examples of completing the control chain by not only doing the analysis but also feeding the data back, to control the devices for complete automation of the analytical and decision results – self-driving cars, for example. This will become more important over time.