Friday, 1 May 2015

Extract latest posts of a Blog (Blogger) using JSOUP API


JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods Refer. Using JSoup DOM traversal you can extract data. Manipulation of HTML attributes and elements is very easy through JSoup  library.

In this post, extraction of latest posts through JSoup API will be discussed. 
  • Select a blog URL. Here i have selected my own blog url to get my latest posts.
  •  String rashmiBlog="http://rashmi9425.blogspot.in/";  
  • JdoupGet method has been written to extract posts, where i am passing value which is defined for "rashmiBlog". This method is returning a list of posts.
  • 1:    public static List<String> JsoupGet(String rashmiBlog) {  
    2:        Document doc = null;            
    3:        List<String> list = new ArrayList();  
    
    4:         try {  
    5:           doc = Jsoup.connect(rashmiBlog).get();  
    6:         } catch (IOException e) {  
    7:             e.printStackTrace();  
    8:             System.out.println("not able to connect");  
    9:             return null;  
    10:           }  
    
    11:        //Extract useName  
    12:        String userName =doc.getElementsByClass("post-footer").select ("span[itemprop]").first().text();  
    13:        list.add(userName);  
    
    14:        //select  "posts" class  
    15:        Element posts = doc.getElementsByClass("posts").first();  
    16:        ListIterator<Element> link = posts.getElementsByTag("a").listIterator(); 
     
    17:        while (link.hasNext()) {  
    18:            String href = link.next().getElementsByTag("a").first().attr("href");  
    19:            list.add(href);  
    20:        }  
    21:       
             return list;  
    22:   }  
    
    1. In line 5, connecting to blog url and parse it to a DOM. 

    2. In line 11, Extracting Username or author of blog.
    3.  <div class="post-footer">   
             <div class="post-footer-line post-footer-line-1">  
                  <span class="post-author vcard"> Posted by <span class="fn" itemprop="author" itemscope itemtype="http://schema.org/Person">   
                    <meta content="https://plus.google.com/113911429732794531347" itemprop="url"> 
                      <a class="g-profile" href="https://plus.google.com/113911429732794531347" rel="author" title="author profile"> 
                        <span itemprop="name">Rashmi Verma</span> </a> </span> </span>   
             </div>   
       </div>  
      (1)
      This is code snippet of generated document. To get author select class "post-footer" then select tag span which has a attribute 'itemprop'. There are other span tag with itemprop attribute also present so we are selecting first span tag and extracting "text" from it.

    4. In line 14, 
       <ul class="posts">   
          <li><a href="http://rashmi9425.blogspot.in/2015/04/programmatically-execution-of-create.html">Programmatically execution of CREATE operation in ...</a>  
          </li>   
          <li>  
            .   
            .  
          </li>  
            .  
            .  
        </ul>  
      
      (2)
      Select element from class 'posts', create a list iterator of latest posts. as you can see in code (2) document snippet, post url is present inside tag '<a>' . In Jsoup its very easy to get a element  using "getElementbyTag"

    5. In line 18, Extract the value of  'href' attribute, in code (2) value of attribute is 'http://rashmi9425.blogspot.in/2015/04/programmatically-execution-of-create.html'
      In this scenario only one latest post available.

  • Here is the code to print list, first value inside list is author name and after that all latest posts.
         List<String> url = JsoupGet(rashmiBlog);  
         int i = 0;  
         System.out.println("Author:::" + url.get(i)+"\n");  
         i++;  
         System.out.println("Latest Posts");  
         while (url.size() != i) {  
           System.out.println(url.get(i)+"\n");  
           i++;  
         }  
    
  • Output on console.
     Author:::Rashmi Verma  
    
     Latest Posts  
     http://rashmi9425.blogspot.in/2015/04/programmatically-execution-of-create.html  
    

Limitation: This code is only specific to Blogger URL. other blogs may have different code line.