Data analysis
Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.[1] Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.[2] In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.[3]
Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.[4] In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).[5] EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses.[6][7] Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis.[8]
Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.[9]
Author Jonathan Koomey has recommended a series of best practices for understanding quantitative data.[60] These include:
For the variables under examination, analysts typically obtain descriptive statistics for them, such as the mean (average), median, and standard deviation.[61] They may also analyze the distribution of the key variables to see how the individual values cluster around the mean.[62]
The consultants at McKinsey and Company named a technique for breaking a quantitative problem down into its component parts called the MECE principle.[63] Each layer can be broken down into its components; each of the sub-components must be mutually exclusive of each other and collectively add up to the layer above them.[64] The relationship is referred to as "Mutually Exclusive and Collectively Exhaustive" or MECE. For example, profit by definition can be broken down into total revenue and total cost.[65] In turn, total revenue can be analyzed by its components, such as the revenue of divisions A, B, and C (which are mutually exclusive of each other) and should add to the total revenue (collectively exhaustive).[66]
Analysts may use robust statistical measurements to solve certain analytical problems.[67] Hypothesis testing is used when a particular hypothesis about the true state of affairs is made by the analyst and data is gathered to determine whether that state of affairs is true or false.[68][69] For example, the hypothesis might be that "Unemployment has no effect on inflation", which relates to an economics concept called the Phillips Curve.[70] Hypothesis testing involves considering the likelihood of Type I and type II errors, which relate to whether the data supports accepting or rejecting the hypothesis.[71][72]
Regression analysis may be used when the analyst is trying to determine the extent to which independent variable X affects dependent variable Y (e.g., "To what extent do changes in the unemployment rate (X) affect the inflation rate (Y)?").[73] This is an attempt to model or fit an equation line or curve to the data, such that Y is a function of X.[74][75]
Necessary condition analysis (NCA) may be used when the analyst is trying to determine the extent to which independent variable X allows variable Y (e.g., "To what extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)?").[73] Whereas (multiple) regression analysis uses additive logic where each X-variable can produce the outcome and the X's can compensate for each other (they are sufficient but not necessary),[76] necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow the outcome to exist, but may not produce it (they are necessary but not sufficient). Each single necessary condition must be present and compensation is not possible.[77]
Other topics[edit]
Smart buildings[edit]
A data analytics approach can be used in order to predict energy consumption in buildings.[103] The different steps of the data analysis process are carried out in order to realise smart buildings, where the building management and control operations including heating, ventilation, air conditioning, lighting and security are realised automatically by miming the needs of the building users and optimising resources like energy and time.[104]
Notable free software for data analysis include:
Reproducible Analysis[edit]
The typical data analysis workflow involves collecting data, running analyses through various scripts, creating visualizations, and writing reports. However, this workflow presents challenges, including a separation between analysis scripts and data, as well as a gap between analysis and documentation. Often, the correct order of running scripts is only described informally or resides in the data scientist's memory. The potential for losing this information creates issues for reproducibility. To address these challenges, it is essential to have analysis scripts written for automated, reproducible workflows. Additionally, dynamic documentation is crucial, providing reports that are understandable by both machines and humans, ensuring accurate representation of the analysis workflow even as scripts evolve.[150]