Selenium python parse table

Содержание

Parsing table data using Selenium with Python in a generalized way
Related Query
More Query from same tag
Обработка html таблиц с Python и Selenium
A basic HTML table
Комментарии ( 0 ):

Parsing table data using Selenium with Python in a generalized way

Edit: Just realized you weren’t using BeautifulSoup (or any other html parser). When scraping webpages, using string manipulation is just asking for trouble when you could just parse the HTML and use it. Check it out: https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Try to incorporate it in your workflow. It will help you immensely.

def scrape_table(table: Tag) -> list: rows = [] for row in table.find_all('tr'): cells = [cell.text.strip() for cell in row.find_all('td')] rows.append(cells) return rows

You don’t have to bring out the big guns like Selenium, when requests with a few headers suffices. Most websites set up a basic barrier that blocks requests without User-Agent header. Here adding that allows us to scrape the page just fine. And not having to launch a browser speeds up the process quite a bit.
If you have a list of Selenium python parse table pairs, you can use dict function to package them into a dictionary. It works for this page, because all tables rows only have a stat name and a number.

Here I’ve purposefully duplicated some code but you can easily refactor table search via title into a find_table_by_title function, for instance.

import requests from bs4 import BeautifulSoup, Tag def scrape_table(table: Tag) -> list: rows = [] for row in table.find_all('tr'): cells = [cell.text.strip() for cell in row.find_all('td')] rows.append(cells) return rows def scrape_technical(soup: BeautifulSoup) -> dict: # find table by column title col_title_el = soup.find('h3', text='TECHNICAL') # go up the parents until we find one that # contains both column title and the table, but separate for all columns. # .panel seems to fit our criteria panel_el = col_title_el.find_parent(class_='panel') # now we can find the table table_el = panel_el.find('table') rows = scrape_table(table_el) return dict(rows) def scrape_mental(soup: BeautifulSoup) -> dict: col_title_el = soup.find('h3', text='MENTAL') panel_el = col_title_el.find_parent(class_='panel') table_el = panel_el.find('table') rows = scrape_table(table_el) return dict(rows) def scrape_physical(soup: BeautifulSoup) -> dict: col_title_el = soup.find('h3', text='TECHNICAL') panel_el = col_title_el.find_parent(class_='panel') table_el = panel_el.find('table') rows = scrape_table(table_el) return dict(rows) def scrape_profile_page(url) -> dict: res = requests.get( url=url, headers= < 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0' >) res.raise_for_status() soup: BeautifulSoup = BeautifulSoup(res.text, 'html.parser') technical = scrape_technical(soup) mental = scrape_mental(soup) physical = scrape_physical(soup) return < 'technical': technical, 'mental': mental, 'physical': physical, >if __name__ == "__main__": stats = scrape_profile_page('https://fmdataba.com/19/p/621/toni-kroos/') from pprint import pprint pprint(stats)

abdusco 8515

Using Selenium with Python to parse table data
Python — Selenium — webscrape table with text in html using WebDriverWait
How to get data from javascript rendered table using selenium in python
Send nested JSON data to Postgres with Python — can´t find a way to insert null values on a table with psycopg2
replace bigquery partition with data staged in bigquery table using python
Filter a python table with sql data using «dynamic» sql queries
Quicker way to manage data than using python with JSON
Secure way for sending Data to Amazon S3 with small fleet of RaspberryPi using Python and boto3
Long term instrument data acquisition with Python — Using «While» loops and threaded processes
How to obtain user specific data with FourSquare using Python (with authentication)
python preprocess image of table with multiple colors using cv2 and pytesseract
Scrape table data from website using Python
Scrapping some data from a football website using Selenium and Python
Best way to chart streamed data using PyQtChart or pyqtgraph with PyQt5 on Python?
Scraping data from multiple tooltips using Python and Selenium
Getting ‘403 Forbidden’ error when using headless chrome with python selenium
find_element_by_xpath() shows syntax error using Selenium with Python
How to input a text on a textbox using selenium with Python
How to load csv file to Oracle table in faster way using Python
How do I capture hidden data from a table with Selenium and Python?
Fetch data from multiple tables from oracle database using python and insert those data into another table
How to collect specific data from HTML using Selenium Python
Selecting price using Selenium and Python with changing classes
Using python for put_item in dynamodb table with Lambda function
creating Azure Data factory linked service using Managed identity with python
Unable to print a tag with href using Selenium python
Print the ID of a YouTube channel using YouTube Data API with Python
Using a table_id with beautifulsoup to extract data in python
How to extract location name with longitude and latitude? Using python and selenium
How to send an image together with form data using python requests?
Getting Refreshed Highcharts Data Using Selenium & Python
Using .cache() with data preparation in python
Cant extract the values from a table using python selenium
Parsing a table on a webpage generated by a script using Python
Issue in sending keys to a website using Selenium with Python
Not able to insert list of multiple values into SQL table with parameter using pyobc in python
How we can use «Google Sheet API» to update the sheet with new data using Python
Python scraping selenium — table data not in code
Python scraping selenium — table data not in code
Iterate through table rows and print a href text with Python Selenium
Parsing parameters from string using a regex with groups in python
How to upload image with angular components using python selenium
Get content of table in website with Python Selenium
Scrape data from HTML pages with sequenced span IDs using Python
What is «AttributeError: ‘str’ object has no attribute ‘native_events_enabled» error? it appears when using selenium with python
How to send raw JSON data with a post request using mechanize in python
Using selenium chromedriver with python
Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup
How to collect data python using selenium geckodriver
Python — Table not found error when using SQlite3 with Daemonize

More Query from same tag

Bokeh: Legend outside plot in multi line chart
Solving this rectangular, nonlinear system with SciPy
passing parameters to python decorator
Custom login and register form in web2py
Cannot execute auto-generated Python script encoded in UTF8-sig
attatch existing process for sucess or fail commands?
How to «join» partially hidden numbers by * (Python)
How to load python module from a neighbouring package?
Importing variable directly yields a different value than importing its module in Python
Invalid expression pattern in python
limit web scraping extractions to once per xpath item, returning too many copies
im having trouble with my python code and the if-then thing
Pillow error while installing matplotlib on RPi
Youtube Blocker
Saving/Retrieving igraph Graph attributes
Python API Call gets rejected with «connection reset by peer»
Count the number of lists containing specific element in a nested list with mapreduce
BST Tree Error: TypeError: ‘
How to execute python program and save output in different file names automatically
What is the most efficient way to find all paths to particular values in HDF5 file with Python?
Could not successfully extract text from site html
What is the time complexity of these two solutions?
Jumbled out put of RFID Tags
Remove islands from mainland coastline using a multipolygon Shapefile — GeoPandas
How to get rid of duplicate combinations when going through list in two «for» loops?
No Module found error when i installed docx
can’t use the same function twice
In what scenarios might a web crawler be CPU limited as opposed to IO limited?
Creating python dataframe from nested xml
Concatenate a string with an incrementing number
Python Selenuim Alert Box Disable?
Python MemoryError — how can I force object deletion
How do I read list of tuples as list type and tuples as tuple type in Python that are encoded in the form of string?
python setup.py py2app raises [Errno 2] No such file or directory
Best way to map variables to various input layers

Источник

Обработка html таблиц с Python и Selenium

Здравствуйте! В сегодняшней статье мы рассмотрим как распарсить HTML таблицу при помощи Python и Selenium webdriver. И прежде всего создадим html файл с примером таблицы.

A basic HTML table

Язык	Рейтинг
Python	10
JavaScript	6

Если все сделано правильно, то в браузере должна появиться таблица.

Далее скачиваем selenium web driver для Firefox. По адресу https://github.com/mozilla/geckodriver/releases/. Называется он geckodriver. Необходимо скачать архив и распаковать его.

# Импортируем модули драйвера
import sys
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# В экземпляре класса передаем путь к exe файлу вебдрайвера

# ссылка на html таблицу,
# впереди обязательно прописываем file:///
link2 = «file:///C:/Users/УЗИ/Desktop/Таблица.html»

try:
# открываем ссылку в браузере
driver.get(link2)

# находим количество строк в таблице
rows = len(driver.find_elements(by=By.XPATH, value = ‘/html/body/table/tbody/tr’))
# подсчет количества столбцов
cols = len(driver.find_elements(by=By.XPATH, value = ‘/html/body/table/tbody/tr[1]/td’))

# итерация по строкам и столбцам таблицы
for r in range(2, rows+1):
for c in range(1, cols+1):
value = driver.find_element(by=By.XPATH, value = ‘/html/body/table/tbody/tr[‘+str(r)+’]/td[‘+str(c)+’]’).text
print(value, end = ‘ \n’)
finally:
time.sleep(30)
# обязательно прописываем выход из вебдрайвера
driver.quit()

В нашем примере используются локаторы типа /html/body/table/tbody/tr. Для его получения, необходимо сперва зайти в инструменты разработчика, выбрать элемент из ячейки таблицы и правой кнопкой мыши скопировать XPATH.

По ним вебдрайвер находит искомые элементы. Метод find_elements находит все элементы с данным локатором и возвращает список. Далее при помощи метода len вычисляется длина списка.

Для парсинга таблицы, вычисленные значения , передаются в цикл for, где они подставляются в локатор следующим образом tr[‘+str(r)+’]/td[‘+str(c)+’]. Методом text получаем текст содержащийся по данной ячейке таблицы.

Таким образом, последовательно, можно пройтись по достаточно крупной таблице с какого-либо сайта.

Создано 22.11.2022 12:41:13

Михаил Русаков

Копирование материалов разрешается только с указанием автора (Михаил Русаков) и индексируемой прямой ссылкой на сайт (http://myrusakov.ru)!

Добавляйтесь ко мне в друзья ВКонтакте: http://vk.com/myrusakov.
Если Вы хотите дать оценку мне и моей работе, то напишите её в моей группе: http://vk.com/rusakovmy.

Если Вы не хотите пропустить новые материалы на сайте,
то Вы можете подписаться на обновления: Подписаться на обновления

Если у Вас остались какие-либо вопросы, либо у Вас есть желание высказаться по поводу этой статьи, то Вы можете оставить свой комментарий внизу страницы.

Порекомендуйте эту статью друзьям:

Если Вам понравился сайт, то разместите ссылку на него (у себя на сайте, на форуме, в контакте):

Кнопка:
Она выглядит вот так:
Текстовая ссылка:
Она выглядит вот так: Как создать свой сайт
BB-код ссылки для форумов (например, можете поставить её в подписи):

Комментарии ( 0 ):

Для добавления комментариев надо войти в систему.
Если Вы ещё не зарегистрированы на сайте, то сначала зарегистрируйтесь.