Parsing xml in python example

Parsing XML with BeautifulSoup in Python

Extensible Markup Language (XML) is a markup language that’s popular because of the way it structures data. It found usage in data transmission (representing serialized objects) and configuration files.

Despite JSON’s rising popularity, you can still find XML in Android development’s manifest file, Java/Maven build tools and SOAP APIs on the web. Parsing XML is therefore still a common task a developer would have to do.

In Python, we can read and parse XML by leveraging two libraries: BeautifulSoup and LXML.

In this guide, we’ll take a look at extracting and parsing data from XML files with BeautifulSoup and LXML, and store the results using Pandas.

Setting up LXML and BeautifulSoup

We first need to install both libraries. We’ll create a new folder in your workspace, set up a virtual environment, and install the libraries:

$ mkdir xml_parsing_tutorial $ cd xml_parsing_tutorial $ python3 -m venv env # Create a virtual environment for this project $ . env/bin/activate # Activate the virtual environment $ pip install lxml beautifulsoup4 # Install both Python packages 

Now that we have everything set up, let’s do some parsing!

Parsing XML with lxml and BeautifulSoup

Parsing always depends on the underlying file and the structure it uses so there’s no single silver bullet for all files. BeautifulSoup parses them automatically, but the underlying elements are task-dependent.

Thus, it’s best to learn parsing with a hands-on approach. Save the following XML into a file in your working directory — teachers.xml :

 teachers> teacher> name>Sam Davies name> age>35 age> subject>Maths subject> teacher> teacher> name>Cassie Stone name> age>24 age> subject>Science subject> teacher> teacher> name>Derek Brandon name> age>32 age> subject>History subject> teacher> teachers> 

The tag indicates the root of the XML document, the tag is a child or sub-element of the , with information about a singular person. The , , are children of the tag, and grand-children of the tag.

The first line, , in the sample document above is called an XML Prolog. It always comes at the beginning of an XML file, although it is completely optional to include an XML Prolog in an XML document.

The XML Prolog shown above indicates the version of XML used and the type of character encoding. In this case, the characters in the XML document are encoded in UTF-8.

Now that we understand the structure of the XML file — we can parse it. Create a new file called teachers.py in your working directory, and import the BeautifulSoup library:

from bs4 import BeautifulSoup 

Note: As you may have noticed, we didn’t import lxml ! With importing BeautifulSoup, LXML is automatically integrated, so importing it separately isn’t necessary, but it isn’t installed as part of BeautifulSoup.

Читайте также:  Send udp packet in java

Now let’s read the contents of the XML file we created and store it in a variable called soup so we can begin parsing:

with open('teachers.xml', 'r') as f: file = f.read() # 'xml' is the parser used. For html files, which BeautifulSoup is typically used for, it would be 'html.parser'. soup = BeautifulSoup(file, 'xml') 

The soup variable now has the parsed contents of our XML file. We can use this variable and the methods attached to it to retrieve the XML information with Python code.

Let’s say we want to view only the names of the teachers from the XML document. We can get that information with a few lines of code:

names = soup.find_all('name') for name in names: print(name.text) 

Running python teachers.py would give us:

Sam Davis Cassie Stone Derek Brandon 

The find_all() method returns a list of all the matching tags passed into it as an argument. As shown in the code above, soup.find_all(‘name’) returns all the tags in the XML file. We then iterate over these tags and print their text property, which contains the tags’ values.

Display Parsed Data in a Table

Let’s take things one step further, we’ll parse all the contents of the XML file and display it in a tabular format.

Let’s rewrite the teachers.py file with:

from bs4 import BeautifulSoup # Opens and reads the xml file we saved earlier with open('teachers.xml', 'r') as f: file = f.read() # Initializing soup variable soup = BeautifulSoup(file, 'xml') # Storing tags and elements in names variable names = soup.find_all('name') # Storing tags and elements in 'ages' variable ages = soup.find_all('age') # Storing tags and elements in 'subjects' variable subjects = soup.find_all('subject') # Displaying data in tabular format print('-'.center(35, '-')) print('|' + 'Name'.center(15) + '|' + ' Age ' + '|' + 'Subject'.center(11) + '|') for i in range(0, len(names)): print('-'.center(35, '-')) print( f'|15)>|5)>|11)>|') print('-'.center(35, '-')) 

The output of the code above would look like this:

----------------------------------- | Name | Age | Subject | ----------------------------------- | Sam Davies | 35 | Maths | ----------------------------------- | Cassie Stone | 24 | Science | ----------------------------------- | Derek Brandon | 32 | History | ----------------------------------- 

Congrats! You just parsed your first XML file with BeautifulSoup and LXML! Now that you’re more comfortable with the theory and the process, let’s try a more real-world example.

Читайте также:  Kotlin json parse example

We’ve formatted the data as a table as a precursor to storing it in a versatile data structure. Namely — in the upcoming mini-project, we’ll store the data in a Pandas DataFrame .

Parsing an RSS Feed and Storing the Data to a CSV

In this section, we’ll parse an RSS feed of The New York Times News, and store that data in a CSV file.

RSS is short for Really Simple Syndication. An RSS feed is a file that contains a summary of updates from a website and is written in XML. In this case, the RSS feed of The New York Times contains a summary of daily news updates on their website. This summary contains links to news releases, links to article images, descriptions of news items, and more. RSS feeds are also used to allow people to get data without scraping websites as a nice token by website owners.

Here’s a snapshot of an RSS feed from The New York Times:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

You can gain access to different New York Times RSS feeds of different continents, countries, regions, topics and other criteria via this link.

It’s important to see and understand the structure of the data before you can begin parsing it. The data we would like to extract from the RSS feed about each news article is:

Now that we’re familiar with the structure and have clear goals, let’s kick off our program! We’ll need the requests library and the pandas library to retrieve the data and easily convert it to a CSV file.

With requests , we can make HTTP requests to websites and parse the responses. In this case, we can use it to retrieve their RSS feeds (in XML) so BeautifulSoup can parse it. With pandas , we will be able to format the parsed data in a table, and finally store the table’s contents into a CSV file.

Читайте также:  Поиск свойств объекта javascript

In the same working directory, install requests and pandas (your virtual environment should still be active):

$ pip install requests pandas 

In a new file, nyt_rss_feed.py , let’s import our libraries:

import requests from bs4 import BeautifulSoup import pandas as pd 

Then, let’s make an HTTP request to The New York Times’ server to get their RSS feed and retrieve its contents:

url = 'https://rss.nytimes.com/services/xml/rss/nyt/US.xml' xml_data = requests.get(url).content 

With the code above, we have been able to get a response from the HTTP request and store its contents in the xml_data variable. The requests library returns data as bytes .

Now, create the following function to parse the XML data into a table in Pandas, with the help of BeautifulSoup:

def parse_xml(xml_data): # Initializing soup variable soup = BeautifulSoup(xml_data, 'xml') # Creating column for table df = pd.DataFrame(columns=['guid', 'title', 'pubDate', 'description']) # Iterating through item tag and extracting elements all_items = soup.find_all('item') items_length = len(all_items) for index, item in enumerate(all_items): guid = item.find('guid').text title = item.find('title').text pub_date = item.find('pubDate').text description = item.find('description').text # Adding extracted elements to rows in table row = < 'guid': guid, 'title': title, 'pubDate': pub_date, 'description': description > df = df.append(row, ignore_index=True) print(f'Appending row %s of %s' % (index+1, items_length)) return df 

The function above parses XML data from an HTTP request with BeautifulSoup, storing its contents in a soup variable. The Pandas DataFrame with rows and columns for the data we would like to parse is referenced via the df variable.

We then iterate through the XML file to find all tags with . By iterating through the tag we are able to extract its children tags: , , , and . Note how we use the find() method to get only one object. We append the values of each child tag to the Pandas table.

Now, at the end of the file after the function, add these two lines of code to call the function and create a CSV file:

df = parse_xml(xml_data) df.to_csv('news.csv') 

Run python nyt_rss_feed.py to create a new CSV file in your present working directory:

Appending row 1 of 24 Appending row 2 of 24 . Appending row 24 of 24 

Источник

Оцените статью