- Saved searches
- Use saved searches to filter your results more quickly
- License
- Wathon/data_engineering_with_python-track-datacamp
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- Data Engineer with Python
- What is Data Engineer with Python?
- Top 5 Python used in Data Engineering
- 1. SciPy
- 2. Pandas
- 3. Beautiful Soup
- 4. Petl
- 5. Pygrametl
- Use Cases Data Engineer with Python
- Role of Data Engineer with Python
- Conclusion
- Recommended Articles
- Saved searches
- Use saved searches to filter your results more quickly
- License
- jeantardelli/data-engineering-with-python
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Data Engineer with Python lecture notes from #datacamp.
License
Wathon/data_engineering_with_python-track-datacamp
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Data Engineer with Python
In this track, you’ll discover how to build an effective data architecture, streamline data processing, and maintain large-scale data systems. In addition to working with Python, you’ll also grow your language skills as you work with Shell, SQL, and Scala, to create data engineering pipelines, automate common file system tasks, and build a high-performance database.
Through hands-on exercises, you’ll add cloud and big data tools such as AWS Boto, PySpark, Spark SQL, and MongoDB, to your data engineering toolkit to help you create and query databases, wrangle data, and configure schedules to run your pipelines. By the end of this track, you’ll have mastered the critical database, scripting, and process skills you need to progress your career.
- Data Engineering for Everyone
- Introduction to Data Engineering
- Streamlined Data Ingestion with pandas
- Writing Efficient Python Code
- Writing Functions in Python
- Introduction to Shell
- Data Processing in Shell
- Introduction to Bash Scripting
- Unit Testing for Data Science in Python
- Object-Oriented Programming in Python
- Introduction to Airflow in Python
- Introduction to PySpark
- Building Data Engineering Pipelines in Python
- Introduction to AWS Boto in Python
- Introduction to Relational Databases in SQL
- Database Design
- Introduction to Scala
- Big Data Fundamentals with PySpark
- Cleaning Data with PySpark
- Introduction to Spark SQL in Python
- Cleaning Data in SQL Server databases
- Transactions and Error Handling in SQL Server
- Building and Optimizing Triggers in SQL Server
- Improving Query Performance in SQL Server
- Introduction to MongoDB in Python
About
Data Engineer with Python lecture notes from #datacamp.
Data Engineer with Python
Data Engineers use Python for data analysis and creation of data pipelines where it helps in data wrangling activities such as aggregation, joining with several sources, reshaping and ETL activities. Python has several tools that help in data analysis and there are libraries which help to complete the analytic process with few codes. Knowledge of database tools is important for a data engineer to manage the data well and to know the analytic process. This helps to combine several tasks into a single role thus managing the analytics process easily. Complex problems can be solved easily with Python in analytics.
Web development, programming languages, Software testing & others
What is Data Engineer with Python?
Programming skills are important for a Data Engineer and Python being easy to code, most Data Engineers are happy with Python being used in pipelines and data analytics. Data architecture and the way database works are known by Data Engineers so that all the implementation and database development can be started by them easily. This database should be linked with any applications and Python knowledge is inevitable here. Machine learning is also important for Data Engineers which can be managed with the knowledge of Python.
Top 5 Python used in Data Engineering
These are the most used Python packages in Data Engineering.
1. SciPy
Various scientific methods in addition to the numerical methods are offered in this module which can be used by Data Engineers to solve complex problems. Optimization modules along with linear algebra, integration and interpolation functions, several special functions, and even signal image processing can be done with the help of SciPy module in Python.
2. Pandas
The data structures offered here are simple and easy to understand and has high performance in all the data provided. This package is good in data wrangling and data manipulation. Data can be visualized and handled faster than any other modules provided by Python.
3. Beautiful Soup
This module helps in data extraction by scraping and parsing techniques. Any format can be parsed easily as it considers the data as hierarchically ordered including the web pages. This helps data engineers even to parse HTML and any other web pages.
4. Petl
This module is used for the sole purpose of data extraction, manipulation, and data table loading. Tables can be easily converted here with few lines of code and the data export is also supported here. This helps to transfer data from SQL, CSV, or any other format easily. This is called PETL due to Python module for Extracting, Transforming, and Loading tables.
5. Pygrametl
ETL workflows can be easily created using Pygrametl as this has all the ETL functionalities. This is faster and all the codes are available directly in the module. The dimension in the ETL is measured by a dimension object and has a connection with a table or several tables within the dataflow. All the activities of ETL such as lookup, insertion, and deletion of data, copying data from one source to another is done by Pygrametl itself.
Use Cases Data Engineer with Python
1. Data Acquisition: Data acquisition involves contacting the source and getting the data in the required format and these sources can be API or any web application. Python helps here with the coding and the packages to build the pipeline based on the source and collect the information. Also, we can use ETL jobs to do data acquisition which again involves Python.
2. Data Manipulation: Python has several libraries and Pandas library is known for manipulating data for user’s requirements. We can read the data in any format and manipulate it. If the dataset is large, we can use PySpark library to manage the data.
3. Data Modelling: Python can communicate with teams as Machine Learning and TensorFlow is involved with this. It uses Keras, Scikit-learn, or PyTorch to do data modelling and hence it can be used to see where data stands with respect to Data Engineer.
4. Data Surfacing: Python can set up APIs so that data can be seen easily and this is done with the help of Flask and Django frameworks. This includes normal report creation as well.
Role of Data Engineer with Python
- Working on Data architecture is important for Data engineers as they should know the working of the system and should plan the work based on the requirements of the organization. Here Python is not much in use as the visualization tools are used here mostly.
- Data collection is another important process in Data Engineering where they collect data from different sources and manipulate the same. Python is used here to collect the data from the source in the form of pipelines and data manipulation with the help of Data Bricks or any other analytics platform.
- Data Engineers should do research about the data and how it has performed in the past years. Graphs can be drawn easily with the help of Python to know the data performance which makes the work quicker and efficient.
- Data Engineers should not rely on one library alone in Python as other libraries have different approaches and faster solutions to the same problem. Data Engineer should learn always and make changes to their approach when the efficient method is figured out.
- After data storage, it is important to identify the data patterns from the same source. Here, Python is useful with its visualization skills. If there are any data anomalies, this can be solved and any IDE such as Jupyter IDE can be used to do data engineering problems.
- Several automation will be needed while creating the data pipelines and here Python will come in handy as it can do all the coding work efficiently.
Conclusion
For beginners who are new to Data Engineering, it is easy to learn Python and do data analysis. Data from all over the world are being processed by Data Engineers followed by Data Scientists and hence the profile of Data Engineer with Python will be in demand for the coming years as well.
Recommended Articles
This is a guide to Data Engineer with Python. Here we discuss the Introduction, What is Data Engineer with Python, uses cases, role. You may also have a look at the following articles to learn more –
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Here I will be exploring various tools and methods that are used in data engineering process with Python.
License
jeantardelli/data-engineering-with-python
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Data Engineering w/ Python
This repo contains my code and pipelines explained on Data Engineering with Python book.
Software and Hardware List
Software required | OS used |
---|---|
Python 3.x, Spark 3.x, Nifi 1.x, MySQL 8.0.x, Elasticsearch 7.x, Kibana 7.x, Apache Kafka 2.x | Linux (any distro) |
- airflow-dag: this directory contains the airflow DAG modules used in this repo
- great_expectations: contains all the important components of a local Great Expectation deployment
- kafka-producer-consumer: contains modules that produce and consume Kafka topics in Python
- load-database: this directory contains modules that load and query data from MySQL
- load-nosql: this directory contains modules that load and query data from Elasticsearch
- nifi-datalake: this directory contains Nifi Pipelines to simulate reading data from the datalike
- nifi-files: this directory contains the files derived from the Nifi template pipelines
- nifi-scanfiles: this directory contains dictionary files read by ScanContent processor (e.g. VIP)
- nifi-scripts: this directory contains shell scripts that are used with ExecuteStreamCommand in Nifi
- nifi-templates: this directory contains different Apache Nifi pipeline templates
- nifi-versioning: this directory contains Nifi pipelines with version control (NiFi Regsitry)
- pyspark: this directory contains Jupyter Notebooks that connect to PySpark data processor
- scooter-data: this directory contains the scooter dataset and wrangling data modules (pandas)
- sql-user: this directory contains the query to create a user and its credentials data
- writing-reading-data: this directory contains modules that create and read fake data
Setup working environment
To setup the working environment run the command:
$ source start-working-environment
If you want to stop/kill the working environment, run the command:
To create the MySQL user, run the following statement as the root user:
$ mysql -u root -p -e "SOURCE sql-user/create-user.sql"
This will also grant access to the databases used in this repo.
About
Here I will be exploring various tools and methods that are used in data engineering process with Python.