Statistical analysis with python

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Statistical Data Analysis in Python

fonnesbeck/statistical-analysis-python-tutorial

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Statistical Data Analysis in Python

Introductory Tutorial, SciPy 2013, 25 June 2013

Christopher Fonnesbeck — Vanderbilt University School of Medicine

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia.

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Читайте также:  Check post values php

For students familiar with Git, you may simply clone this repository to obtain all the materials (iPython notebooks and data) for the tutorial. Alternatively, you may download a zip file containing the materials. A third option is to simply view static notebooks by clicking on the titles of each section below.

  • Importing data
  • Series and DataFrame objects
  • Indexing, data selection and subsetting
  • Hierarchical indexing
  • Reading and writing files
  • Sorting and ranking
  • Missing data
  • Data summarization
  • Date/time types
  • Merging and joining DataFrame objects
  • Concatenation
  • Reshaping DataFrame objects
  • Pivoting
  • Data transformation
  • Permutation and sampling
  • Data aggregation and GroupBy operations
  • Plotting in Pandas vs Matplotlib
  • Bar plots
  • Histograms
  • Box plots
  • Grouped plots
  • Scatterplots
  • Trellis plots
  • Statistical modeling
  • Fitting data to probability distributions
  • Fitting regression models
  • Model selection
  • Bootstrapping
  • Python 2.7 or higher (including Python 3)
  • pandas >= 0.11.1 and its dependencies
  • NumPy >= 1.6.1
  • matplotlib >= 1.0.0
  • pytz
  • IPython >= 0.12
  • pyzmq
  • tornado

Optional: statsmodels, xlrd and openpyxl

For students running the latest version of Mac OS X (10.8), the easiest way to obtain all the packages is to install the Scipy Superpack which works with Python 2.7.2 that ships with OS X.

Otherwise, another easy way to install all the necessary packages is to use Continuum Analytics’ Anaconda.

Though targeted to ecologists, Mangel and Hilborn identify key methods that scientists can use to build useful and credible models for their data. They don’t shy away from the math, but the book is very readable and example-laden.

The go-to reference for applied hierarchical modeling.

A comprehensive machine learning guide for statisticians.

An excellent, approachable book to get started with Bayesian methods.

Frank Harrell’s bag of tricks for regression modeling. I pull this off the shelf every week.

Statistical Data Analysis in Python by Christopher Fonnesbeck is licensed under a Creative Commons Attribution 4.0 International License.

Источник

Statistical Analysis in Python

Statistical Analysis in Python

Statistical analysis of data refers to the extraction of some useful knowledge from vague or complex data. Python is widely used for statistical data analysis by using data frame objects such as pandas. Statistical analysis of data includes importing, cleaning, transformation, etc. of data in preparation for analysis. The dataset of the CSV file is considered to be analyzed by python libraries which process every data from preprocessing to end result. Some libraries in python are effectively used like pandas, statsmodels, seaborn, etc that use to handle the analysis of such data. Python does data representation, data comparison, data visualization, data plotting, data testing, indexing, alignment, handling missing data, etc. Such operations are useful in data analyses that are handled by various libraries of python. Python utilizes the analysis of complex data with mix statistics with image analysis or text mining.

Web development, programming languages, Software testing & others

How to Perform Statistical Analysis?

There are different modules of statistical analysis of data processing by python:

1. Data Collection/ Representation

The data can be anything related to business, polity, education, etc that can be seen as a 2D table, or matrix, with columns giving the different attributes of the data, and rows the observations. A dataset is a mixture of numerical and categorical values. Python can interact with data in CSV format by using the pandas library. This library is built on numpy which is another library to handle array data structure. Every column of a dataset is fetched into an array for further processing/analysis. The data can be an image that is further converted into a 2D matrix and stored into an array for further processing.

Читайте также:  Html code for url links

2. Descriptive Statistics

Descriptive statistics are used to identifying hidden patterns in the data. It just describes the data through statistics. It doesn’t make any predictions about the data. Several methods are used to analyze descriptive statistics of data such as mean, median, mode, variance, and standard deviation. These mathematical statistics are utilized on data in python using a library called statistics. This library contains all such mathematical methods for the descriptive analysis of data. This kind of analysis helps the user to obtain basic statistics about data. As discussed above that the statistical analysis is the extraction of some useful knowledge from complex data. The mean, median, and mode lies in central tendency statistics in which the user is intended to extract the central or the middle knowledge of complex data. Standard deviation statistics come to measure the spread or variation in data from its actual mean. Variance the use to analysis that how far individual data in a group are spread out. It is the square of the standard deviation.

3. Inferential Statistics

This type of statistical analysis is intended to extract inferences or hypotheses from a sample of large data. Prediction about the population is carried out from random samples of data. The prediction of the dependent variable based on the independent variable is carried out in inferential statistics. For gathering predictions about sample data the model is trained with training samples and learn the correlation between dependent and independent variables. Based on its learning and type of model, the machine can make a prediction. Some technical terms are used to make a prediction about sample data are listed below:-

  • Z Score: Z score is a way to compute the probability of data occurring within the normal distribution. It shows the relationship of different values in data with the mean of data. To compute the Z score, we subtract the mean from each data value and divide the whole by standard deviation. Z score is computed for a column in the dataset. It tells whether a data value is typical for a specific dataset. Z score helps us to decide whether to keep or reject the null hypothesis. The null hypothesis refers that there is no spatial pattern among the data values associated with the features. Z score can be imported from “scipy” library of python.
  • Z test: Z test is to analyze whether the means of two different samples of data are similar or different while knowing their variances and standard deviations. It is a hypothetical test that follows a normal distribution. It is used for large-size data samples. It tells if the two datasets are similar or not. In this case, the null hypothesis considers that both datasets are significantly similar. A significance level (say 5%) is to be set sot that the null hypothesis is only accepted if the p-value of data is more than the significant level. A good z-test signifies that both the dataset are similar and are not significantly different from each other. The z-test method can be implemented using the library called “statsmodels” in python.
  • T-test: T-test is also used to determine whether the two datasets are similar or different. It is the same as z-test but the difference is that this method is applicable to a smaller sample size which must be less than 30. The T-test can be implemented using libraries like numpy, pandas, and scipy.
  • F test: F test utilizes F-distribution. It is used to determine if the two samples of data are equal based on comparing their variances. The null hypothesis is rejected if the ratio of the variances of two samples of data is equal to one. There is some significance level also to tolerate some amount of difference between the two samples which is not considered significant. It is implemented using “scipy” library of python.
Читайте также:  Html forms inside tables

4. Correlation Matrix

The correlation matrix is used to draw a pattern in a dataset. It is a table that shows correlation coefficients between the variables of a dataset. It depicts the relationship between different data and helps us to understand how the occurrence of any data is associated with the occurrence of other data. It can be utilized in linear regression or multiple regression models. Correlation is the function of covariance. The correlation coefficient of any two variables is calculated by taking the ratio of the covariance of these variables and the product of their standard deviation. It is used to find the dependency between the two variables.

Importance of Statistical Analysis of Data

Statistical analysis of data is important because it saves time and optimizes the problem. It is carried out efficiently in python. Python libraries are used to take every analysis of data. Python libraries can smartly handle small issues like the scaling of data while analyzing statistical properties. Python replaces a complex mathematical expression with the functions that are present in its libraries. It is fast and provides accurate knowledge about data which can be used to process further for predictions or classifications like problems. Statistical analysis is important to good decisions on data. Statistical analysis of data helps us to access effective data only with good efficiency. It helps us to decide an optimal path for data accessing and processing.

Conclusion

Statistical analysis of data is the acquisition of knowledge about data in order to simplify the complex data which can be further used for processing. The job is effectively done by different libraries of python which effectively use for the analysis of data in less time. The goal of data analysis is to optimize the complex data structure. It helps us to take optimal decisions on data.

This is a guide to Statistical Analysis in Python. Here we also discuss the introduction and how to perform the statistical analysis? along with importance. You may also have a look at the following articles to learn more –

Источник

Оцените статью