Selenium сохранить страницу html

Save a Web Page with Python Selenium

At this point, how can I send a ‘Save Page As’ command to the browser?

Note: It is not the web-page source that I am interested in. I would like to save the page using the actual ‘Save Page As’ Firefox command, which yields different results than saving the web-page source.

3 Answers 3

Unfortunately you can’t do what you would like to do with Selenium. You can use page_source to get the html but that is all that you would get.

Selenium unfortunately can’t interact with the Dialog that is given to you when you do save as.

You can do the following to get the dialog up but then you will need something like AutoIT to finish it off

from selenium.webdriver.common.action_chains import ActionChains saveas = ActionChains(driver).key_down(Keys.CONTROL)\ .send_keys('s').key_up(Keys.CONTROL) saveas.perform() 

I had a similar problem and solved it recently:

@AutomatedTester gave a decent answer, but his answer did not solve the problem all the way, you still need to press Enter one more time yourself to finish the job.

Therefore, we need Python to do press that one more Enter for us.

Follow @NoctisSkytower ‘s answer in the following thread:

copy his definition for classes and then add the following to @AutomatedTester ‘s answer:

SendInput(Keyboard(VK_RETURN)) time.sleep(0.2) SendInput(Keyboard(VK_RETURN, KEYEVENTF_KEYUP)) 

you may also want to check out the following link:

You may encounter pop-up window and this thread will tell you want to do.

Источник

Save a Web Page with Python Selenium

We can save a webpage with Selenium webdriver in Python. To save a page we shall first obtain the page source behind the webpage with the help of the page_source method.

We shall open a file with a particular encoding with the codecs.open method. The file has to be opened in the write mode represented by w and encoding type as utf−8. Then use the write method to write the content obtained from the page_source method.

Syntax

n = os.path.join("C:\Users\ghs6kor\Downloads\Test", "PageSave.html") f = codecs.open(n, "w", "utf−8") h = driver.page_source f.write(h)

Let us make an attempt to save the below webpage −

Example

from selenium import webdriver import codecs #set chromedriver.exe path driver = webdriver.Chrome(executable_path="C:\chromedriver.exe") driver.implicitly_wait(0.5) #maximize browser driver.maximize_window() #launch URL driver.get("https://www.tutorialspoint.com/index.htm") #get file path to save page n=os.path.join("C:\Users\ghs6kor\Downloads\Test","Page.html") #open file in write mode with encoding f = codecs.open(n, "w", "utf−8") #obtain page source h = driver.page_source #write page source content to file file.write(h) #close browser driver.quit()

Output

On opening the Page.html file in a browser.

Читайте также:  Несколько фоновых изображений css

Источник

Using Selenium in Python to save a webpage on Firefox

I end up not using the SAVE AS method and to solve the html-file and csv-file writing problems, I used codecs and unicodecsv. Refer to RemcoW’s comment and this post stackoverflow.com/questions/18766955/… for details.

4 Answers 4

with open('page.html', 'w') as f: f.write(driver.page_source) 

Note that driver.page_source can crash with pages larger than 200MB in most webdrivers. For huge pages, using ActionChains is more reliable.

What you are trying to achieve is impossible to do with Selenium. The dialog that opens is not something Selenium can interact with.

The closes thing you could do is collect the page_source which gives you the entire HTML of a single page and save this to a file.

import codecs completeName = os.path.join(save_path, file_name) file_object = codecs.open(completeName, "w", "utf-8") html = browser.page_source file_object.write(html) 

If you really need to save the entire website you should look into using a tool like AutoIT. This will make it possible to interact with the save dialog.

Thank you! I am aware of this method. However, for my webpages contain characters that prompt Unicode Encode Errors, I need to save the webpages in its original format to avoid loosing important information. An example of the Unicode Encode Errors is . ‘ascii’ codec can’t encode character u’\xf8′ in position 1: ordinal not in range(128).

Yes, it happens when I try to write the page_source to a html file. Would you know if there are any solutions for me to minimize the amount of information lost in regards to those special characters? (I intentionally don’t want to use ignore)

You cannot interact with system dialogs like save file dialog. If you want to save the page html you can do something like this:

page = driver.page_source file_ = open('page.html', 'w') file_.write(page) file_.close() 

Getting the HTML can also be accomplished by using driver.page_source . This spares the need for finding the html element an getting its outerHTML manually.

This is a complete, working example of the answer RemcoW provided:

You first have to install a webdriver, e.g. pip install selenium chromedriver_installer .

#!/usr/bin/env python # -*- coding: utf-8 -*- # core modules import codecs import os # 3rd party modules from selenium import webdriver def get_browser(): """Get the browser (a "driver").""" # find the path with 'which chromedriver' path_to_chromedriver = ('/usr/local/bin/chromedriver') browser = webdriver.Chrome(executable_path=path_to_chromedriver) return browser save_path = os.path.expanduser('~') file_name = 'index.html' browser = get_browser() url = "https://martin-thoma.com/" browser.get(url) complete_name = os.path.join(save_path, file_name) file_object = codecs.open(complete_name, "w", "utf-8") html = browser.page_source file_object.write(html) browser.close() 

Источник

Save complete web page (incl css, images) using python/selenium

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:

from selenium import webdriver URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA' CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine # open page with selenium # (first need to download Chrome webdriver, or a firefox webdriver, etc) driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION) driver.get(URL) time.sleep(5) # enter sequence into the query field and hit 'blast' button to search seq_query_field = driver.find_element_by_id("seq") seq_query_field.send_keys(SEQUENCE) blast_button = driver.find_element_by_id("b1") blast_button.click() time.sleep(60) 

At that point I have a page that I can manually click «save as,» and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this ‘save as’ function in python/selenium but haven’t found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.

content = driver.page_source with open('webpage.html', 'w') as f: f.write(content) 

enter image description here

I’ve also found this question/answer on SO, but the accepted answer just brings up the ‘save as’ box, and does not provide a way to click it (as two commenters point out) Is there a simple way to ‘save [full page] as’ using python? Ideally I’d prefer an answer using selenium since selenium makes the crawling part so straightforward, but I’m open to using another library if there’s a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click ‘save as’ functionality? UPDATE — Follow up question for James’ answer So I ran James’ code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James’ script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that’s hidden in the manually save’d page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right). This is especially surprising to me because the raw html of the page saved by James’ script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James’ script:

Читайте также:  Java programming codes string

  

Источник

Selenium 2: How to save a HTML page including all referenced resources (css, js, images. )?

In Selenium 2, the WebDriver object only offers a method getPageSource() which saves the raw HTML page without any CSS, JS, images etc. Is there a way to also save all referenced resources in the HTML page (similar to HtmlUnit’s HtmlPage.save() )?

3 Answers 3

I know I’m royally late with my answer, but I didn’t really find an answer for this question when I was searching myself. So I did something myself, hope I can help some people still.

using system.net; string DataDirectory = "C:\\Temp\\AutoTest\\Data\\"; string PageSourceHTML = Driver.PageSource; string[] StringSeparators = new string[] < "; string[] Result = PageSourceHTML.Split(StringSeparators, StringSplitOptions.None); string CSSFile; string FileName = "filename.html"; System.IO.File.WriteAllText(DataDirectory + FileName, PageSourceHTML); foreach(string S in Result) < if(S.Contains("stylesheet")) < CSSFile = S.Substring(28); // strip off "link rel="stylesheet" href=" CSSFile = CSSFile.Substring(0,CSSFile.Length-10); // strip off characters behind, like " />" and newline, spaces until next " > 

I’m sure it could be improved to include any unforeseen situations, but for my website in test it will always work like this.

Nope. If you can, go for HtmlUnit for this particular task.

The best you could do, I think, is Robot . Press Ctrl + S simultaneously, the confirm with Enter . It’s blind, it’s imperfect, but it’s the closest thing to your need.

Читайте также:  Php operations with time

You can use the selenium interactions to handle it.

 using OpenQA.Selenium.Interactions; 

There are a few ways to do it as well. One of the ways that I handle something like this, is to find an item central to the page, or whichever area that you wish to save, and do an actions builder.

 var htmlElement = driver.FindElement(By.XPath("//your path")); Actions action = new Actions(driver); try < action.MoveToElement(htmlElement).ContextClick(htmlElement).SendKeys("p").Build().Perform(); >catch(WebDriverException)<> 

This will simply right click on the area, and then send the key «p» which is the ‘Save Page As’ hotkey in firefox when right clicking. Another way is to have the builder send the keys.

 var htmlElement = driver.FindElement(By.Xpath("//your path")); action.MoveToElement(htmlElement); try < action.KeyDown(Keys.Control).SendKeys("S").KeyUp(Keys.Control).Build().Perform(); >catch(WebDriverException)<> 

Note that in both cases, if you leave the scope of the driver, say a windows form, then you will have to switch your case / code to handle the windows form when it pops up. Selenium will also have issues with nothing being returned after the keys are sent, so the Try Catches are there for that. If anyone has a way to work around that, it would be awesome.

Источник

Оцените статью