Data engineer with python

Содержание

Saved searches
Use saved searches to filter your results more quickly
License
Wathon/data_engineering_with_python-track-datacamp
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
About
Data Engineer with Python
What is Data Engineer with Python?
Top 5 Python used in Data Engineering
1. SciPy
2. Pandas
3. Beautiful Soup
4. Petl
5. Pygrametl
Use Cases Data Engineer with Python
Role of Data Engineer with Python
Conclusion
Recommended Articles
Saved searches
Use saved searches to filter your results more quickly
License
jeantardelli/data-engineering-with-python
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
About

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Data Engineer with Python lecture notes from #datacamp.

License

Wathon/data_engineering_with_python-track-datacamp

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Data Engineer with Python

In this track, you’ll discover how to build an effective data architecture, streamline data processing, and maintain large-scale data systems. In addition to working with Python, you’ll also grow your language skills as you work with Shell, SQL, and Scala, to create data engineering pipelines, automate common file system tasks, and build a high-performance database.

Through hands-on exercises, you’ll add cloud and big data tools such as AWS Boto, PySpark, Spark SQL, and MongoDB, to your data engineering toolkit to help you create and query databases, wrangle data, and configure schedules to run your pipelines. By the end of this track, you’ll have mastered the critical database, scripting, and process skills you need to progress your career.

Data Engineering for Everyone
Introduction to Data Engineering
Streamlined Data Ingestion with pandas
Writing Efficient Python Code
Writing Functions in Python
Introduction to Shell
Data Processing in Shell
Introduction to Bash Scripting
Unit Testing for Data Science in Python
Object-Oriented Programming in Python
Introduction to Airflow in Python
Introduction to PySpark
Building Data Engineering Pipelines in Python
Introduction to AWS Boto in Python
Introduction to Relational Databases in SQL
Database Design
Introduction to Scala
Big Data Fundamentals with PySpark
Cleaning Data with PySpark
Introduction to Spark SQL in Python
Cleaning Data in SQL Server databases
Transactions and Error Handling in SQL Server
Building and Optimizing Triggers in SQL Server
Improving Query Performance in SQL Server
Introduction to MongoDB in Python

About

Data Engineer with Python lecture notes from #datacamp.

Источник

Data Engineer with Python

Data Engineers use Python for data analysis and creation of data pipelines where it helps in data wrangling activities such as aggregation, joining with several sources, reshaping and ETL activities. Python has several tools that help in data analysis and there are libraries which help to complete the analytic process with few codes. Knowledge of database tools is important for a data engineer to manage the data well and to know the analytic process. This helps to combine several tasks into a single role thus managing the analytics process easily. Complex problems can be solved easily with Python in analytics.

Web development, programming languages, Software testing & others

What is Data Engineer with Python?

Programming skills are important for a Data Engineer and Python being easy to code, most Data Engineers are happy with Python being used in pipelines and data analytics. Data architecture and the way database works are known by Data Engineers so that all the implementation and database development can be started by them easily. This database should be linked with any applications and Python knowledge is inevitable here. Machine learning is also important for Data Engineers which can be managed with the knowledge of Python.

Top 5 Python used in Data Engineering

These are the most used Python packages in Data Engineering.

1. SciPy

Various scientific methods in addition to the numerical methods are offered in this module which can be used by Data Engineers to solve complex problems. Optimization modules along with linear algebra, integration and interpolation functions, several special functions, and even signal image processing can be done with the help of SciPy module in Python.

2. Pandas

The data structures offered here are simple and easy to understand and has high performance in all the data provided. This package is good in data wrangling and data manipulation. Data can be visualized and handled faster than any other modules provided by Python.

3. Beautiful Soup

This module helps in data extraction by scraping and parsing techniques. Any format can be parsed easily as it considers the data as hierarchically ordered including the web pages. This helps data engineers even to parse HTML and any other web pages.

4. Petl

This module is used for the sole purpose of data extraction, manipulation, and data table loading. Tables can be easily converted here with few lines of code and the data export is also supported here. This helps to transfer data from SQL, CSV, or any other format easily. This is called PETL due to Python module for Extracting, Transforming, and Loading tables.

5. Pygrametl

ETL workflows can be easily created using Pygrametl as this has all the ETL functionalities. This is faster and all the codes are available directly in the module. The dimension in the ETL is measured by a dimension object and has a connection with a table or several tables within the dataflow. All the activities of ETL such as lookup, insertion, and deletion of data, copying data from one source to another is done by Pygrametl itself.

Читайте также: File upload timeout php

Use Cases Data Engineer with Python

1. Data Acquisition: Data acquisition involves contacting the source and getting the data in the required format and these sources can be API or any web application. Python helps here with the coding and the packages to build the pipeline based on the source and collect the information. Also, we can use ETL jobs to do data acquisition which again involves Python.

2. Data Manipulation: Python has several libraries and Pandas library is known for manipulating data for user’s requirements. We can read the data in any format and manipulate it. If the dataset is large, we can use PySpark library to manage the data.

3. Data Modelling: Python can communicate with teams as Machine Learning and TensorFlow is involved with this. It uses Keras, Scikit-learn, or PyTorch to do data modelling and hence it can be used to see where data stands with respect to Data Engineer.

4. Data Surfacing: Python can set up APIs so that data can be seen easily and this is done with the help of Flask and Django frameworks. This includes normal report creation as well.

Role of Data Engineer with Python

Working on Data architecture is important for Data engineers as they should know the working of the system and should plan the work based on the requirements of the organization. Here Python is not much in use as the visualization tools are used here mostly.
Data collection is another important process in Data Engineering where they collect data from different sources and manipulate the same. Python is used here to collect the data from the source in the form of pipelines and data manipulation with the help of Data Bricks or any other analytics platform.
Data Engineers should do research about the data and how it has performed in the past years. Graphs can be drawn easily with the help of Python to know the data performance which makes the work quicker and efficient.
Data Engineers should not rely on one library alone in Python as other libraries have different approaches and faster solutions to the same problem. Data Engineer should learn always and make changes to their approach when the efficient method is figured out.
After data storage, it is important to identify the data patterns from the same source. Here, Python is useful with its visualization skills. If there are any data anomalies, this can be solved and any IDE such as Jupyter IDE can be used to do data engineering problems.
Several automation will be needed while creating the data pipelines and here Python will come in handy as it can do all the coding work efficiently.

Conclusion

For beginners who are new to Data Engineering, it is easy to learn Python and do data analysis. Data from all over the world are being processed by Data Engineers followed by Data Scientists and hence the profile of Data Engineer with Python will be in demand for the coming years as well.

Saved searches

Use saved searches to filter your results more quickly

Here I will be exploring various tools and methods that are used in data engineering process with Python.

License

jeantardelli/data-engineering-with-python

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Data Engineering w/ Python

This repo contains my code and pipelines explained on Data Engineering with Python book.

Software and Hardware List

Software required	OS used
Python 3.x, Spark 3.x, Nifi 1.x, MySQL 8.0.x, Elasticsearch 7.x, Kibana 7.x, Apache Kafka 2.x	Linux (any distro)

airflow-dag: this directory contains the airflow DAG modules used in this repo
great_expectations: contains all the important components of a local Great Expectation deployment
kafka-producer-consumer: contains modules that produce and consume Kafka topics in Python
load-database: this directory contains modules that load and query data from MySQL
load-nosql: this directory contains modules that load and query data from Elasticsearch
nifi-datalake: this directory contains Nifi Pipelines to simulate reading data from the datalike
nifi-files: this directory contains the files derived from the Nifi template pipelines
nifi-scanfiles: this directory contains dictionary files read by ScanContent processor (e.g. VIP)
nifi-scripts: this directory contains shell scripts that are used with ExecuteStreamCommand in Nifi
nifi-templates: this directory contains different Apache Nifi pipeline templates
nifi-versioning: this directory contains Nifi pipelines with version control (NiFi Regsitry)
pyspark: this directory contains Jupyter Notebooks that connect to PySpark data processor
scooter-data: this directory contains the scooter dataset and wrangling data modules (pandas)
sql-user: this directory contains the query to create a user and its credentials data
writing-reading-data: this directory contains modules that create and read fake data

Setup working environment

To setup the working environment run the command:

$ source start-working-environment

If you want to stop/kill the working environment, run the command:

To create the MySQL user, run the following statement as the root user:

$ mysql -u root -p -e "SOURCE sql-user/create-user.sql"

This will also grant access to the databases used in this repo.

About

Here I will be exploring various tools and methods that are used in data engineering process with Python.

Источник

Data engineer with python

Saved searches

Use saved searches to filter your results more quickly

License

Wathon/data_engineering_with_python-track-datacamp

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

About

Data Engineer with Python

What is Data Engineer with Python?

Top 5 Python used in Data Engineering

1. SciPy

2. Pandas

3. Beautiful Soup

4. Petl

5. Pygrametl

Use Cases Data Engineer with Python

Role of Data Engineer with Python

Conclusion

Recommended Articles

Saved searches

Use saved searches to filter your results more quickly

License

jeantardelli/data-engineering-with-python

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

About