Data analysis

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.^[1] Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.^[2] In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.^[3]

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.^[4] In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).^[5] EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses.^[6]^[7] Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis.^[8]

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.^[9]

Check raw data for anomalies prior to performing an analysis;

Re-perform important calculations, such as verifying columns of data that are formula driven;

Confirm main totals are the sum of subtotals;

Check relationships between numbers that should be related in a predictable way, such as ratios over time;

Normalize numbers to make comparisons easier, such as analyzing amounts per person or relative to GDP or as an index value relative to a base year;

Break problems into component parts by analyzing factors that led to the results, such as of return on equity.^[25]

DuPont analysis

Author Jonathan Koomey has recommended a series of best practices for understanding quantitative data.^[60] These include:

For the variables under examination, analysts typically obtain descriptive statistics for them, such as the mean (average), median, and standard deviation.^[61] They may also analyze the distribution of the key variables to see how the individual values cluster around the mean.^[62]

The consultants at McKinsey and Company named a technique for breaking a quantitative problem down into its component parts called the MECE principle.^[63] Each layer can be broken down into its components; each of the sub-components must be mutually exclusive of each other and collectively add up to the layer above them.^[64] The relationship is referred to as "Mutually Exclusive and Collectively Exhaustive" or MECE. For example, profit by definition can be broken down into total revenue and total cost.^[65] In turn, total revenue can be analyzed by its components, such as the revenue of divisions A, B, and C (which are mutually exclusive of each other) and should add to the total revenue (collectively exhaustive).^[66]

Analysts may use robust statistical measurements to solve certain analytical problems.^[67] Hypothesis testing is used when a particular hypothesis about the true state of affairs is made by the analyst and data is gathered to determine whether that state of affairs is true or false.^[68]^[69] For example, the hypothesis might be that "Unemployment has no effect on inflation", which relates to an economics concept called the Phillips Curve.^[70] Hypothesis testing involves considering the likelihood of Type I and type II errors, which relate to whether the data supports accepting or rejecting the hypothesis.^[71]^[72]

Regression analysis may be used when the analyst is trying to determine the extent to which independent variable X affects dependent variable Y (e.g., "To what extent do changes in the unemployment rate (X) affect the inflation rate (Y)?").^[73] This is an attempt to model or fit an equation line or curve to the data, such that Y is a function of X.^[74]^[75]

Necessary condition analysis (NCA) may be used when the analyst is trying to determine the extent to which independent variable X allows variable Y (e.g., "To what extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)?").^[73] Whereas (multiple) regression analysis uses additive logic where each X-variable can produce the outcome and the X's can compensate for each other (they are sufficient but not necessary),^[76] necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow the outcome to exist, but may not produce it (they are necessary but not sufficient). Each single necessary condition must be present and compensation is not possible.^[77]

Reproducible Analysis[edit]

The typical data analysis workflow involves collecting data, running analyses through various scripts, creating visualizations, and writing reports. However, this workflow presents challenges, including a separation between analysis scripts and data, as well as a gap between analysis and documentation. Often, the correct order of running scripts is only described informally or resides in the data scientist's memory. The potential for losing this information creates issues for reproducibility. To address these challenges, it is essential to have analysis scripts written for automated, reproducible workflows. Additionally, dynamic documentation is crucial, providing reports that are understandable by both machines and humans, ensuring accurate representation of the analysis workflow even as scripts evolve.^[150]

Kaggle competition, which is held by .^[154]

Kaggle

held by FHWA and ASCE.^[155]^[156]

LTPP data analysis contest

Different companies or organizations hold data analysis contests to encourage researchers to utilize their data or to solve a particular question using data analysis.^[151]^[152] A few examples of well-known international data analysis contests are as follows:^[153]

(2008a). "Chapter 14: Phases and initial steps in data analysis". In Adèr, Herman J.; Mellenbergh, Gideon J.; Hand, David J (eds.). Advising on research methods : a consultant's companion. Huizen, Netherlands: Johannes van Kessel Pub. pp. 333–356. ISBN 9789079418015. OCLC 905799857.

Adèr, Herman J.

(2008b). "Chapter 15: The main analysis phase". In Adèr, Herman J.; Mellenbergh, Gideon J.; Hand, David J (eds.). Advising on research methods : a consultant's companion. Huizen, Netherlands: Johannes van Kessel Pub. pp. 357–386. ISBN 9789079418015. OCLC 905799857.

Adèr, Herman J.

Tabachnick, B.G. & Fidell, L.S. (2007). Chapter 4: Cleaning up your act. Screening data prior to analysis. In B.G. Tabachnick & L.S. Fidell (Eds.), Using Multivariate Statistics, Fifth Edition (pp. 60–116). Boston: Pearson Education, Inc. / Allyn and Bacon.

& Mellenbergh, G.J. (with contributions by D.J. Hand) (2008). Advising on Research Methods: A Consultant's Companion. Huizen, the Netherlands: Johannes van Kessel Publishing. ISBN 978-90-79418-01-5

Adèr, H.J.

Chambers, John M.; Cleveland, William S.; Kleiner, Beat; Tukey, Paul A. (1983). Graphical Methods for Data Analysis, Wadsworth/Duxbury Press. 0-534-98052-X

ISBN

Fandango, Armando (2017). Python Data Analysis, 2nd Edition. Packt Publishers. 978-1787127487

ISBN

Juran, Joseph M.; Godfrey, A. Blanton (1999). Juran's Quality Handbook, 5th Edition. New York: McGraw Hill. 0-07-034003-X

ISBN

Lewis-Beck, Michael S. (1995). Data Analysis: an Introduction, Sage Publications Inc, 0-8039-5772-6

ISBN

NIST/SEMATECH (2008) ,

Handbook of Statistical Methods

Pyzdek, T, (2003). Quality Engineering Handbook, 0-8247-4614-7

ISBN

(1984). Pragmatic Data Analysis. Oxford : Blackwell Scientific Publications. ISBN 0-632-01311-7

Richard Veryard

Tabachnick, B.G.; Fidell, L.S. (2007). Using Multivariate Statistics, 5th Edition. Boston: Pearson Education, Inc. / Allyn and Bacon, 978-0-205-45938-4

Data analysis

DuPont analysis

Other topics[edit]

Smart buildings[edit]

extreme observations

[113]

common-method variance

DevInfo

ELKI

KNIME

Orange

Pandas

PAW

R

ROOT

SciPy

Julia

Reproducible Analysis[edit]

Kaggle

LTPP data analysis contest

Adèr, Herman J.

Adèr, Herman J.

Adèr, H.J.

ISBN

ISBN

ISBN

ISBN

Handbook of Statistical Methods

ISBN

Richard Veryard

ISBN