Big data analytics with python

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

worldbank/BDA-with-Python

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Big Data Analytics with Python

This repository contains content for the Big Data Analytics with Python course. In its latest iteration, the course was taught at The African Institute for Mathematical Sciences (AIMS), Rwanda in 2022 and 2023 as part of the Master of Science in Mathematical Sciences (Data Science stream) program. For more details about this Masters programme, please check AIMS website.

This course can be thought of as a practical guide to working with large scale datasets. The principal aim is to introduce students/participants to the ecosystem of technologies for working with large scale datasets such as the technoligies for data storage, data processing, building machine learning models and more in the most practical approach possible using Python as the programming language. For more details about the course content, refer to this outline, otherwise, the main modules taught in the course are presented below.

  • Module 1: Big data basics. The core message See the lecture slides here.
  • Module 2: Functional programming and distributed data processing. See the lecture slides here and the corresponding notebook here.
  • Module 3: Data gathering from the Web. See the lecture slides here and the corresponding notebooks here and here.
  • Module 4: The Hadoop ecosystem. See the lecture slides here.
  • Module 5: Introduction to Apache Spark. See the lecture slides here and the corresponding notebook here.
  • Module 6: Data wrangling with Spark’s structured APIs. See the lecture slides here and the corresponding notebook here.
  • Module 7: Machine Learning with Apache Spark. See the lecture slides here and the corresponding notebook here.
Читайте также:  Синтаксис оператора while python

The repository contains the following folders:

  • SLIDES: This folder has all the powerpoint and Google slides with lecture notes. Due to the large size of the presentations, this folder will mostly be empty as I’m not uploading these large files in here. However, the presentations can be found on the link.
  • DOCS: This folder contains miscelleanous documents for the course. For instancee, the course outline.
  • NOTEBOOKS: This folder has all the source code for the tutorials.This includes the notebooks and Python files.
  • DATASETS: As the name suggests, tis folder has the datasets which are used in the course. Again, because of the size, these datasets are not uploaded here.
  • RESOURCES: In this folder, there are learning resources such as PDF books and articles.
  • SOFTWARE: This folder has all the packages required for the course. As some of the installation files are large, they are not available here but they can be found on the Google Drive linked.

In order to follow this material, the recommended approach is to tackle the modules as they are presented in the outline above. For each topic, go through the slides first and then move on to the tutorials in the notebooks. Its worth mentioning that since the course was delivered in person, the material isnt necessarily ideal for self paced learning but a person with reasonable prerequisite knowleedge can still follow the course and grasp the concepts.

For any questions regarding this course content, you can contact me through the two email adresses below:

Источник

Читайте также:  Php output request header

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Combine Spark and Python to process large datasets and unlock the power of parallel computing and machine learning

License

TrainingByPackt/Big-Data-Analysis-with-Python

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Big Data Analysis with Python

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you’ll learn effective techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The book begins with an introduction to data manipulation in Python using Pandas. You’ll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you’ll be able to analyze data that is distributed on several computers by using Dask. As you progress, you’ll study how to aggregate data for plots when the entire data cannot be accommodated into memory. You’ll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book further covers Spark and its interaction with other tools.

By the end of this book, you’ll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on the disk
  • Work with computing tasks distributed over a cluster
  • Convert data from different sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals
Читайте также:  Селектор всех потомков css

For an optimal experience, we recommend the following hardware configuration:

  • Windows 7 SP1 32/64-bit,
  • Windows 8.1 32/64-bit or Windows 10 32/64-bit
  • Ubuntu 14.04 or later
  • macOS Sierra or later
  • Browser: Google Chrome or Mozilla Firefox
  • Conda
  • Jupyterlab

About

Combine Spark and Python to process large datasets and unlock the power of parallel computing and machine learning

Источник

Python for Big Data Analytics: Solving Challenges with Distributed Computing

Explore Python’s role in Big Data analytics, its essential libraries, the concept of distributed computing, and real-world use cases.

Table of Contents
1. Introduction to Python and Big Data Analytics
2. The Role of Python in Big Data: A Deep Dive
3. Python Libraries for Big Data Processing
4. Understanding Distributed Computing: An Overview
5. Python in Distributed Computing: Tools and Frameworks
6. Case Studies: Solving Big Data Challenges with Python and Distributed Computing
7. Conclusion: The Future of Python in Big Data and Distributed Computing

1. Introduction to Python and Big Data Analytics

In the realm of Big Data Analytics, Python has emerged as a preferred language due to its simplicity and vast library support. Its easy-to-understand syntax makes it a go-to choice for data professionals globally.

With Python, data extraction, cleaning, analysis, and visualization processes become highly efficient, enabling organizations to draw valuable insights from complex datasets. This, in turn, significantly aids in decision making.

Moreover, Python’s compatibility with other technologies, such as Hadoop and Spark, further facilitates Big Data analytics operations.

In this blog, we will explore Python’s role in Big Data Analytics and its usage in solving various Big Data challenges through distributed computing.

2. The Role of Python in Big Data: A Deep Dive

Python’s versatility makes it an essential tool in Big Data analytics. With its extensive set of libraries, it can handle tasks ranging from web scraping to machine learning.

For instance, libraries like Pandas and NumPy simplify data manipulation and mathematical computations, while SciKit-Learn and TensorFlow cater to machine learning applications.

Источник

Оцените статью