Beautiful soup python 3 pip

2. Installing Beautiful Soup¶

If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip . The package name is beautifulsoup4 , and the same package works on Python 2 and Python 3.

$ pip install beautifulsoup4

(The BeautifulSoup package is probably not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4 .)

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py .

If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.

I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions.

2.1. Problems after installation¶

Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it’s automatically converted to Python 3 code. If you don’t install the package, the code won’t be converted. There have also been reports on Windows machines of the wrong version being installed.

Читайте также:  Windows find python version

If you get the ImportError “No module named HTMLParser”, your problem is that you’re running the Python 2 version of the code under Python 3.

If you get the ImportError “No module named html.parser”, your problem is that you’re running the Python 3 version of the code under Python 2.

In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again.

If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u'[document]’ , you need to convert the Python 2 code to Python 3. You can do this either by installing the package:

or by manually running Python’s 2to3 conversion script on the bs4 directory:

2.2. Installing a parser¶

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

This table summarizes the advantages and disadvantages of each parser library:

  • Batteries included
  • Decent speed
  • Lenient (as of Python 2.7.3 and 3.2.)
  • Not very lenient (before Python 2.7.3 or 3.2.2)
  • Very fast
  • Lenient
  • External C dependency
  • Very fast
  • The only currently supported XML parser
  • External C dependency
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency
Читайте также:  Css fixed element bottom

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.

Источник

beautifulsoup4 4.12.2

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

Quick start

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("

SomebadHTML") >>> print(soup.prettify())

Some bad HTML

>>> soup.find(text="bad") 'bad' >>> soup.i HTML # >>> soup = BeautifulSoup("SomebadXML", "xml") # >>> print(soup.prettify()) Some bad XML

Note on Python 2 sunsetting

Beautiful Soup’s support for Python 2 was discontinued on December 31, 2020: one year after the sunset date for Python 2 itself. From this point onward, new Beautiful Soup development will exclusively target Python 3. The final release of Beautiful Soup 4 to support Python 2 was 4.9.3.

Supporting the project

If you use Beautiful Soup as part of your professional work, please consider a Tidelift subscription. This will support many of the free software projects your organization depends on, not just Beautiful Soup.

If you use Beautiful Soup for personal projects, the best way to say thank you is to read Tool Safety, a zine I wrote about what Beautiful Soup has taught me about software development.

Building the documentation

The bs4/doc/ directory contains full documentation in Sphinx format. Run make html in that directory to create HTML documentation.

Running the unit tests

Beautiful Soup supports unit test discovery using Pytest:

Источник

Оцените статью