Overview

Jsoup is an open source Java library, It used to parse data from HTML Documents. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. It scrape and parse HTML from a URL, file or String and forms DOM Tree.

Example

Fetch the Google homepage, parse it to a DOM, and select the all anchor tags from it.

Document doc = Jsoup.connect(“https://www.google.com”).get(); Elements newsHeadlines = doc.select(“a”);

We will use Spring Documentation Blog to showcase the features of Jsoup library.

Download Jsoup Library

The jsoup is available in Maven central repository. For non-Maven user download it from JSoup site and add it to project class-path.

Loading

The loading phase comprises the fetching and parsing of the HTML into a Document. Loading of document can be done form URL, Document or String.

Let’s load a Document from the Spring Documentation Blog URL:

Here .get() method stands for request type we want to make, We can also perform Post as well as other method types which HTTP supports.

Jsoup is also also supporting header parameters which one browser sends during request of URL.

Extraction of Data

The Document select method receives a String representing the selector, using the same selector syntax as in a CSS or JavaScript, and retrieves the matching list of Elements. This list can be empty but not null.

Once we get list of Elements, We can get specific element from the list.

We can iterate through all Elements and get separate Element too.

Adding to that we can also get attribute details of selected element.

Examples

1. Get All Hyperlinks
Output :
Link :/docs Title :Docs Link :/guides Title :Guides Link :/projects Title :Projects Link :/blog Title :Blog //Removed other outputs.
2. Get All Images
Output :
src : /images/branding/googlelogo/2x/googlelogo_color_120x44dp.png height : 44 width : 120 alt : Google
3. Select Elements based on Id/Class Name
Output :
Category: Java Recent Posts Categories Like Us On Facebook Recent Posts :: Categories :: Get to Know Us :: Follow Us ::
It's good to share...Share on FacebookTweet about this on TwitterShare on LinkedInPin on PinterestShare on Google+Email this to someone

Leave a Reply

Your email address will not be published. Required fields are marked *