Многопоточный парсер python selenium

wooddar / multiprocess_selenium.py

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

«»»
This is an adaptable example script for using selenium across multiple webbrowsers simultaneously. This makes use of
two queues — one to store idle webworkers and another to store data to pass to any idle webworkers in a selenium function
«»»
from multiprocessing import Queue , cpu_count
from threading import Thread
from selenium import webdriver
from time import sleep
from numpy . random import randint
import logging
logger = logging . getLogger ( __name__ )
# Some example data to pass the the selenium processes, this will just cause a sleep of time i
# This data can be a list of any datatype that can be pickled
selenium_data = [ 4 , 2 , 3 , 3 , 4 , 3 , 4 , 3 , 1 , 2 , 3 , 2 , ‘STOP’ ]
# Create the two queues to hold the data and the IDs for the selenium workers
selenium_data_queue = Queue ()
worker_queue = Queue ()
# Create Selenium processes and assign them a worker ID
# This ID is what needs to be put on the queue as Selenium workers cannot be pickled
# By default, make one selenium process per cpu core with cpu_count
# TODO: Change the worker creation code to be your webworker of choice e.g. PhantomJS
worker_ids = list ( range ( cpu_count ()))
selenium_workers =
for worker_id in worker_ids :
worker_queue . put ( worker_id )
def selenium_task ( worker , data ):
«»»
This is a demonstration selenium function that takes a worker and data and then does something with the worker and
data.
TODO: change the below code to be whatever it is you want your worker to do e.g. scrape webpages or run browser tests
:param worker: A selenium web worker NOT a worker ID
:type worker: webdriver.XXX
:param data: Any data for your selenium function (must be pickleable)
:rtype: None
«»»
worker . set_window_size ( randint ( 100 , 200 ), randint ( 200 , 400 ))
logger . info ( «Getting Google» )
worker . get ( f’https://ytroulette.com’ )
logger . info ( «Sleeping» )
sleep ( data )
def selenium_queue_listener ( data_queue , worker_queue ):
«»»
Monitor a data queue and assign new pieces of data to any available web workers to action
:param data_queue: The python FIFO queue containing the data to run on the web worker
:type data_queue: Queue
:param worker_queue: The queue that holds the IDs of any idle workers
:type worker_queue: Queue
:rtype: None
«»»
logger . info ( «Selenium func worker started» )
while True :
current_data = data_queue . get ()
if current_data == ‘STOP’ :
# If a stop is encountered then kill the current worker and put the stop back onto the queue
# to poison other workers listening on the queue
logger . warning ( «STOP encountered, killing worker thread» )
data_queue . put ( current_data )
break
else :
logger . info ( f»Got the item < current_data >on the data queue» )
# Get the ID of any currently free workers from the worker queue
worker_id = worker_queue . get ()
worker = selenium_workers [ worker_id ]
# Assign current worker and current data to your selenium function
selenium_task ( worker , current_data )
# Put the worker back into the worker queue as it has completed it’s task
worker_queue . put ( worker_id )
return
# Create one new queue listener thread per selenium worker and start them
logger . info ( «Starting selenium background processes» )
selenium_processes = [ Thread ( target = selenium_queue_listener ,
args = ( selenium_data_queue , worker_queue )) for _ in worker_ids ]
for p in selenium_processes :
p . daemon = True
p . start ()
# Add each item of data to the data queue, this could be done over time so long as the selenium queue listening
# processes are still running
logger . info ( «Adding data to data queue» )
for d in selenium_data :
selenium_data_queue . put ( d )
# Wait for all selenium queue listening processes to complete, this happens when the queue listener returns
logger . info ( «Waiting for Queue listener threads to complete» )
for p in selenium_processes :
p . join ()
# Quit all the web workers elegantly in the background
logger . info ( «Tearing down web workers» )
for b in selenium_workers . values ():
b . quit ()

Источник

Как запусить парсинг selenium многопоточном режиме?

Не могу разобраться, как запустить webdriver в несколько потоков.

Хочу парсить два сайта одновременно. Есть список ean выглядит
однопоточно, это выглядит так:

def labirint(eanlist): pricelist = [] for ean in eanlist: try: driver.get("http://www.labirint.ru/search/" + ean + "/?labsearch=1") time.sleep(1) labirintBookState(driver) if driver.find_element_by_xpath(labirint_xpath_state).is_displayed(): x = driver.find_element_by_xpath(labirint_xpath) price_int = int(x.text) pricelist.append(price_int) else: pricelist.append("") except: pricelist.append("") return pricelist def chitayGorod(eanlist): chitay_gorod_pricelist = [] for ean in eanlist: try: driver.get("https://www.chitai-gorod.ru/search/result/?q=" + ean + "&page=1") time.sleep(1) if driver.find_element_by_xpath(chitay_gorod_xpath).is_displayed(): price = driver.find_element_by_xpath(chitay_gorod_xpath) price_int = int(re.search(r'\d+', price.text).group()) chitay_gorod_pricelist.append(price_int) else: chitay_gorod_pricelist.append("") except: chitay_gorod_pricelist.append("") return chitay_gorod_pricelist option = webdriver.ChromeOptions() chrome_prefs = <> option.experimental_options["prefs"] = chrome_prefs chrome_prefs["profile.default_content_settings"] = chrome_prefs["profile.managed_default_content_settings"] = driver = webdriver.Chrome(executable_path='C:\priceUpdater\ChromeDriver\chromedriver.exe', chrome_options=option) if __name__ == "__main__": list_ean = getFileEan() labirint = labirint(list_ean) chitayGorod = chitayGorod(list_ean) print(datetime.now() - startTime) print("ok!")

Методы в цикле пробегаются по сайтам и собирают данные.

Как мне реализовать, чтобы два процесса driver загружались одновременно и один парсил chitayGorod, а другой labirint ?
В итоге хотелось бы получить следующее:
Чтобы запустились, два driverа параллельно и каждый парсил свой сайт. Нужно запустить два процесса , пусть каждый будет хоститься на ядре проца. Это возможно сделать? И если возможно , то как?
Попробовал, то что находил в интернете по этой теме и ничего подходящего не нашел.

Источник

Multithreading or Multiprocessing with Python and Selenium

Multithreading and multiprocessing are two popular approaches for improving the performance of a program by allowing it to run tasks in parallel. These approaches can be particularly useful when working with Python and Selenium, as they allow you to perform multiple actions simultaneously, such as automating the testing of a web application. In this blog, we will discuss the differences between multithreading and multiprocessing, and provide examples of how to implement these approaches using Python and Selenium.

Key Terms in Multithreading or Multiprocessing

Before diving into the specifics of multithreading and multiprocessing with Python and Selenium, it is important to understand the fundamental differences between these approaches.

  • Multithreading: Multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to provide multiple threads of execution concurrently, supported by the operating system. This allows a program to run multiple threads concurrently, with each thread running a separate task. In Python, the threading module provides support for multithreading.
  • Multiprocessing: Multiprocessing is the ability to execute multiple concurrent processes within a system. Unlike multithreading, which allows multiple threads to run on a single CPU, multiprocessing allows a program to run multiple processes concurrently, each on a separate CPU or core. In Python, the multiprocessing module provides support for multiprocessing.
    It is important to note that multithreading and multiprocessing are not mutually exclusive, and it is possible to use both approaches in a single program. However, there are some key differences to consider when deciding which approach is best for a given task.
  • Performance: In general, multiprocessing is more efficient than multithreading, as it allows a program to take full advantage of multiple CPU cores. However, multithreading can still be useful in certain situations, such as when a program is I/O bound (i.e., waiting for input/output operations to complete) rather than CPU bound.
  • Shared state: One of the major differences between multithreading and multiprocessing is the way that they handle shared state. In multithreading, threads share the same memory space, which means that they can access and modify shared variables. In contrast, processes in multiprocessing do not share a memory and must communicate with each other through interprocess communication (IPC) mechanisms such as pipes or shared memory.
  • Concurrency: Both multithreading and multiprocessing allow a program to execute tasks concurrently. However, there are some key differences to consider when it comes to concurrency. In multithreading, the Python interpreter is responsible for managing the threads, meaning that the program can only run as many threads as CPU cores. In contrast, multiprocessing allows a program to create as many processes as there are CPU cores, which can potentially lead to better performance.

Difference between thread and processes

In computer programming, a process is an instance of a program that is executed on a computer. It has its own memory space and runs independently of other processes. A thread, on the other hand, is a small unit of execution within a process. A process can contain multiple threads, which can run concurrently, allowing the process to perform multiple tasks at the same time.

One key difference between processes and threads is that each process has its own memory space, while threads share the memory space of the process in which they are running. This means that threads can access and modify data in the shared memory space, while processes cannot access the memory of other processes.

Another difference is that creating a new process requires the operating system to allocate additional resources, such as memory and processing power while creating a new thread is less resource-intensive.

In the context of web scraping, processes may be used to perform tasks that are independent of each other, such as scraping data from different websites. Threads, on the other hand, may be used to perform tasks that are related to each other within a single process, such as making multiple requests to a single website.

Here are a couple of examples to illustrate the difference between processes and threads:

Example 1: A web scraper that needs to scrape data from multiple websites could use a separate process for each website. This would allow the scraper to run multiple processes concurrently, making it more efficient.

Example 2: A web scraper that needs to scrape data from a single website could use threads to make multiple requests to the website concurrently. This would allow the scraper to scrape the data more quickly, as the threads can work in parallel.

Overall, the choice between using processes or threads will depend on the specific needs of the web scraping project and the resources available on the machine.

Steps needed

Now that we have a basic understanding of the differences between multithreading and multiprocessing, let’s take a look at how to implement these approaches using Python and Selenium.

To get started, you will need to install Python and Selenium. If you don’t already have these tools installed, you can follow the instructions on the Python and Selenium tutorials.

Once you have Python and Selenium installed, you can start using these tools to implement multithreading or multiprocessing in your program.

Multithreading

To implement multithreading with Python and Selenium, we can use the Thread class from the threading module.

Here is an example of how to use multithreading to scrape a list of URLs using Selenium:

Источник

Читайте также:  Найти php ini на хостинге
Оцените статью