JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods Refer. Using JSoup DOM traversal you can extract data. Manipulation of HTML attributes and elements is very easy through JSoup library.
In this post, extraction of latest posts through JSoup API will be discussed.
- Select a blog URL. Here i have selected my own blog url to get my latest posts.
String rashmiBlog="http://rashmi9425.blogspot.in/";
1: public static List<String> JsoupGet(String rashmiBlog) {
2: Document doc = null;
3: List<String> list = new ArrayList();
4: try {
5: doc = Jsoup.connect(rashmiBlog).get();
6: } catch (IOException e) {
7: e.printStackTrace();
8: System.out.println("not able to connect");
9: return null;
10: }
11: //Extract useName
12: String userName =doc.getElementsByClass("post-footer").select ("span[itemprop]").first().text();
13: list.add(userName);
14: //select "posts" class
15: Element posts = doc.getElementsByClass("posts").first();
16: ListIterator<Element> link = posts.getElementsByTag("a").listIterator();
17: while (link.hasNext()) {
18: String href = link.next().getElementsByTag("a").first().attr("href");
19: list.add(href);
20: }
21:
return list;
22: }
- In line 5, connecting to blog url and parse it to a DOM.
- In line 11, Extracting Username or author of blog.
- In line 14,
<ul class="posts"> <li><a href="http://rashmi9425.blogspot.in/2015/04/programmatically-execution-of-create.html">Programmatically execution of CREATE operation in ...</a> </li> <li> . . </li> . . </ul>
(2)Select element from class 'posts', create a list iterator of latest posts. as you can see in code (2) document snippet, post url is present inside tag '<a>' . In Jsoup its very easy to get a element using "getElementbyTag" - In line 18, Extract the value of 'href' attribute, in code (2) value of attribute is 'http://rashmi9425.blogspot.in/2015/04/programmatically-execution-of-create.html'
In this scenario only one latest post available.
<div class="post-footer">
<div class="post-footer-line post-footer-line-1">
<span class="post-author vcard"> Posted by <span class="fn" itemprop="author" itemscope itemtype="http://schema.org/Person">
<meta content="https://plus.google.com/113911429732794531347" itemprop="url">
<a class="g-profile" href="https://plus.google.com/113911429732794531347" rel="author" title="author profile">
<span itemprop="name">Rashmi Verma</span> </a> </span> </span>
</div>
</div>
(1)
This is code snippet of generated document. To get author select class "post-footer" then select tag span which has a attribute 'itemprop'. There are other span tag with itemprop attribute also present so we are selecting first span tag and extracting "text" from it. List<String> url = JsoupGet(rashmiBlog);
int i = 0;
System.out.println("Author:::" + url.get(i)+"\n");
i++;
System.out.println("Latest Posts");
while (url.size() != i) {
System.out.println(url.get(i)+"\n");
i++;
}
Author:::Rashmi Verma
Latest Posts
http://rashmi9425.blogspot.in/2015/04/programmatically-execution-of-create.html
Limitation: This code is only specific to Blogger URL. other blogs may have different code line.
No comments:
Post a Comment