[Review] Data Science: Challenges and Directions


Data Science: Challenges and Directions (Communications of the ACM, Volume 60 Issue 8, August 2017)


With the amount of data exploding everyday, what sense can we make out of them? How can we effectively make use of all the data and obtain wisdom in making decisions? How can we analyze large-size data more effectively than we could before? Data science seeks to answer these questions, trying to find hidden complexities and intelligence in vast sets of data.

By definition, data science is an interdisciplinary field of scientific methods, processes, and systems to extract knowledge and insight from data in various forms. This can be generalized to transformation from unknown to known, which is done by overcoming the immaturity of capacity/capability and visibility of data/physical worlds. For this, data science discussion covers not only data-related domains like statistics, computing, and informatics but also traditionally less data-related fields like social sciences and business management.

To discuss how to solve a certain problem, we must first address the problem itself. Data science problems have many complexities and intelligence inherent. From here we generalize the term as X-complexities and X-intelligence, each with specific categorizing words in the place of X.

X-complexities are the core objective of data science. Data science problems can be viewed as a complex entanglement of many X-complexities that we with to uncover. The types are data, behavior, domain, social, environment, learning, and deliverable. Three of the seven seems worthy of a brief overview in this summary. Behavior complexity refers to the challenges involved in understanding what actually takes place in business activities, where data obtained often simplifies and ignores the physical world. Another is learning (process and system) complexity, which must be addressed to achieve the goal of data analytics. Data scientists should be able to design appropriate experiments and mechanisms for managing data, and lean from multiple heterogeneous and non-iid data sources, sometimes in computational resource-poor environments. Lastly, deliverable complexity is slightly off from data science itself. In cases where the objective to solve a problem is to obtain actionable insight and assist decision making, data science should not only be able to identify and evaluate the outcomes, but also effectively communicate the results to nonprofessionals.

On the other hand, X-intelligence is what we use to address complex data science problems. In an attempt to transform data into knowledge and unknown domain into what is known, X-intelligence helps us deal better with underlying complexities and challenges. The types are data, behavior, domain, human, network, organizational, social, and environmental, where three will be discussed in this summary. Qualitative domain intelligence helps identify which domain of knowledge can be used to analyze the data. Human intelligence like imaginative thinking, emotional intelligence, or inspiration is important because in the acquisition of the data always accompanies human interaction. Human intelligence expands to social intelligence, which is obvious because people get involved in social interactions and networks using their own human intelligence. The collective actions of social systems and artificial social networks must be understood in terms of social intelligence.

Data science problems are given to us as complex data systems, which we are to render to an open complex X-intelligence system. In such problems, many X-intelligence and X-complexities are weaved together, which should be solved following the “meta-synthetic engineering” theory by integrating ubiquitous intelligence in a view of systematism.

Then what is the direction to which data science seeks to advance? There are two main goals: non-iid data learning and human-like intelligence.

Until recent times, mathematical, statistical, and analytical method have been built on relatively narrow assumptions. However, as the era shifts to big data, assumptions in data science are getting violated, making model outcomes inaccurate and misleading. One of the major violation of assumptions is the ‘iid assumption’.

IID stands for ‘independent and identically distributed’. A set of variables are independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. Big, complex data are essentially non-iid, whereas most existing analytical methods have iid as an assumption. In order to reinvent existing data science theories and develop new ones for non-iidness and non-iid data learning, the following prospective is needed: deep understanding of non-iid data characteristics; non-iid feature analysis and construction; non-iid leaning theories, algorithms, and models; and non-iid similarity and evaluation.

With a focus on big data analysis, data science is now driving a technological revolution from logical-thinking machines to creative-thinking machine intelligence. The domain of imagination, creativity, or curiosity used to be a critical human capability. For this, existing data and computer architectures, theories, and science must be fundamentally reformed to render machines creative.

As a conclusion, data science seeks to reveal X-complexities and X-intelligence characterizing complex data science problems which reflect the gaps between the world of hidden data and existing data science immaturity. Filling them requires a cross-disciplinary effort to refactor existing methodologies from a complex system perspective. Also, the emerging data science evolution hints opportunities for breakthrough innovation. The socio-echonomic and cultural impact of data science will be unprecedented indeed.

Leave a comment