JSoup Example

Содержание

Convert HTML into Plain Text in Java using jsoup
Add jsoup library to your Java project
Convert HTML String into Plain Text
Convert HTML from Website into Plain Text
Convert HTML File into Plain Text
3 Examples of Parsing HTML File in Java using Jsoup
What is JSoup Library?
HTML Parsing in Java using JSoup
Java Program to parse HTML Document
HelloWorld

Convert HTML into Plain Text in Java using jsoup

In this tutorial, we are going to show how to use jsoup library to convert HTML content into plain text without HTML tag in a Java application.

Add jsoup library to your Java project

To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.

compile 'org.jsoup:jsoup:1.13.1'

To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.

To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download

Convert HTML String into Plain Text

The Java application below, we use Jsoup.clean() method to remove HTML tags in a HTML content to return plain text content.

import org.jsoup.Jsoup; import org.jsoup.safety.Whitelist; public class ConvertHtmlToText  public static void main(String. args)  String htmlString = "Simple Solution
Convert HTML to Text
"; String outputText = Jsoup.clean(htmlString, new Whitelist()); System.out.println(outputText); > >

Simple SolutionConvert HTML to Text

Convert HTML from Website into Plain Text

In the following example Java program, we combine Jsoup.clean() with Jsoup.connect() method provided by jsoup library to download HTML content from URL and then remove HTML tags.

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.safety.Whitelist; import java.io.IOException; public class ConvertHtmlToTextFromUrl  public static void main(String. args)  try  String url = "https://simplesolution.dev/"; Document document = Jsoup.connect(url).get(); String htmlString = document.html(); String outputText = Jsoup.clean(htmlString, new Whitelist()); System.out.println(outputText); > catch (IOException e)  e.printStackTrace(); > > >

Convert HTML File into Plain Text

Following examples to show how to read HTML content from a file and remove HTML tags. For example, we have a sample.html file with the following content.

 html> body> span class="test">Simple Solutionspan> body> html>

Example 1 read file content NIO classes .

import org.jsoup.Jsoup; import org.jsoup.safety.Whitelist; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ConvertHtmlToTextFromFile1  public static void main(String. args)  try  String fileName = "sample.html"; Path filePath = Paths.get(fileName); byte[] fileBytes = Files.readAllBytes(filePath); String htmlString = new String(fileBytes, "UTF-8"); String outputText = Jsoup.clean(htmlString, new Whitelist()); System.out.println(outputText); > catch (IOException e)  e.printStackTrace(); > > >

Example 2 read HTML file using Jsoup.parse() method.

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.safety.Whitelist; import java.io.File; import java.io.IOException; public class ConvertHtmlToTextFromFile2  public static void main(String. args)  try  String fileName = "sample.html"; File file = new File(fileName); Document document = Jsoup.parse(file, "UTF-8"); String htmlString = document.html(); String outputText = Jsoup.clean(htmlString, new Whitelist()); System.out.println(outputText); > catch (IOException e)  e.printStackTrace(); > > >

Источник

3 Examples of Parsing HTML File in Java using Jsoup

HTML is the core of the web, all the pages you see on the internet are HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP or any other web technology. Your browser actually parse HTML and render it for you. But what would you do, if you need to parse an HTML document and find some elements, tags, attributes or check if a particular element exists or not from Java program. If you have been in Java programming for some years, I am sure you have done some XML parsing work using parsers like DOM and SAX, but there is also good chance that you have not done any HTML parsing work. Ironically, there are few instances when you need to parse HTML documents from core Java application, which doesn’t include Servlet and other Java web technologies.

To make the matter worse, there is no HTTP or HTML library in core JDK as well; or at least I am not aware of that. That’s why when it comes to parsing an HTML file, many Java programmers had to look at Google to find out how to get the value of an HTML tag in Java.

When I needed that I was sure that there would be an open-source library that will do it for me, but didn’t know that it was as wonderful and feature-rich as JSoup. It not only provides support to read and parse HTML documents but also allows you to extract any element form HTML file, their attribute, their CSS class in JQuery style, and also allows you to modify them.

You can probably do anything with an HTML document using Jsoup. In this article, we will parse and HTML file and find out the value of the title and heading tags. We will also see an example of downloading and parsing HTML from the file as well as any URL or internet by parsing Google’s home page in Java.

What is JSoup Library?

Jsoup can scrape and parse HTML from a URL, file, or string
Jsoup can find and extract data, using DOM traversal or CSS selectors
Jsoup allows you to manipulate the HTML elements, attributes, and text
Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks
Jsoup also output tidy HTML

HTML Parsing in Java using JSoup

In this Java HTML parsing tutorial, we will see three different examples of parsing and traversing HTML documents in Java using jsoup . In the first example, we will parse an HTML String that contents all tags in form of String literal in Java.

In the second example, we will download our HTML document from web, and in third example, we will load our own sample HTML file login.html for parsing. This file is a sample HTML document that contains title tag and a div in the body that contains an HTML form.

It has input tags to capture username and password and submit and reset button for further action. It’s proper HTML which can be validated i.e. all tags and attributes are properly closed. Here is how our sample HTML file look like :

DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> html> head> meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> title>Login Pagetitle> head> body> div id="login" style="color: #3ad900;">"simple" > form action="login.do"> Username : input id="username" type="text" />br> Password : input id="password" type="password" />br> input id="submit" type="submit" /> input id="reset" type="reset" /> form> div> body> html>

HTML parsing is very simple with Jsoup, all you need to call is static method Jsoup.parse() and pass your HTML String to it. JSoup provides several overloaded parse() methods to read HTML file from String, a File, from a base URI, from an URL, and from an InputStream.

You can also specify character encoding to read HTML files correctly which is not in «UTF-8» format. Here is complete list of HTML parse methods from the JSoup library.

The parse(String html) method parses the input HTML into a new Document . In Jsoup, Document extends Element which extends Node . Also TextNode extends Node . As long as you pass in a non-null string, you’re guaranteed to have a successful, sensible parse, with a Document containing (at least) a head and a body element.

Once you have a Document, you can get the data you want by calling appropriate methods in Document and its parent classes Element and Node .

Java Program to parse HTML Document

Here is our complete Java program to parse an HTML String, an HTML file download from the internet, and an HTML file from the local file system. In order to run this program, you can either use Eclipse IDE or you can just use any IDE or command prompt. In Eclipse, it’s very easy, just copy this code, create a new Java project, right-click on src package and paste it.

Eclipse will take care of creating proper package and Java source file with same name, so absolutely less work. If you already have a Sample Java project, then it’s just one step. Following Java program shows 3 examples of parsing and traversing HTML file.

In first example, we directly parse a String with HTML content, in second example we parse an HTML file downloaded from an URL, in the third example, we load and parse an HTML document from local file system.

In the first and third example, we use the parse method to get a Document object which can be queried to extract any tag value or attribute value. In the second example, we use Jsoup.connect() with, which takes care of making a connection to URL, downloading HTML and parsing it. This method also returns the Document object which can be used for further querying and getting the value of any tag or attribute.

import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; /** * Java Program to parse/read HTML documents from File using Jsoup library. * Jsoup is an open source library which allows Java developer to parse HTML * files and extract elements, manipulate data, change style using DOM, CSS and * JQuery like method. * * @author Javin Paul */ public class HTMLParser< public static void main(String args[]) < // Parse HTML String using JSoup library String HTMLSTring = "" + "" + "" + "" + "" + "" + "HelloWorld
"+" 
  
" + "" + ""; Document html = Jsoup.parse(HTMLSTring); String title = html.title(); String h1 = html.body().getElementsByTag("h1").text(); System.out.println("Input HTML String to JSoup :" + HTMLSTring); System.out.println("After parsing, Title : " + title); System.out.println("Afte parsing, Heading : " + h1); // JSoup Example 2 - Reading HTML page from URL Document doc; try < doc = Jsoup.connect("http://google.com/").get(); title = doc.title(); > catch (IOException e) < e.printStackTrace(); > System.out.println("Jsoup Can read HTML page from URL, title : " + title); // JSoup Example 3 - Parsing an HTML file in Java //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong Document htmlFile = null; try < htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1"); > catch (IOException e) < // TODO Auto-generated catch block e.printStackTrace(); > // right title = htmlFile.title(); Element div = htmlFile.getElementById("login"); String cssClass = div.className(); // getting class form HTML element System.out.println("Jsoup can also parse HTML file directly"); System.out.println("title : " + title); System.out.println("class of div tag : " + cssClass); > >

Output: Input HTML String to JSoup :DOCTYPE html> html>head>title>JSoup Exampletitle> head>body>table>tr>td> h1>HelloWorldh1>tr>table>body> html> After parsing, Title : JSoup Example Afte parsing, Heading : HelloWorld Jsoup Can read HTML page from URL, title : Google Jsoup can also parse HTML file directly title : Login Page class of div tag : simple

Good thing about JSoup is that it is very robust. Jsoup HTML parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It can handle following mistakes :
unclosed tags (e.g. < p >Java < p >Scala to < p >Java

< p >Scala )
implicit tags (e.g. a naked < td >Java is Great is wrapped into a < table > < tr > < td >)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)

That’s all about how to parse an HTML document in Java. Jsoup is an excellent and robust open-source library that makes reading HTML document, body fragment, HTML string and directly parsing HTML content from the web extremely easy.

In this article, we learned hot to get the value of a particular HTML tag in Java, as in the first example we extracted title and value of H1 tag as text, and in the third example, we learned how to get the value of an attribute from HTML tag by extracting CSS class.

Apart from powerful jQuery style html . body() . getElementsByTag( «h1» ) . text() method, which you can use to extract any HTML tag, it also provides convenience methods like Document.title() and Element.className() method to quickly get title and CSS class. Have fun with Jsoup and we will see a couple of more examples of this API soon.

Источник