Jump to Real's How-to Main page

Remove HTML tags from a file to extract only the TEXT

Using regular expression

A special regular expression is used to strip out anything between a < and > .
import java.io.*;

public class Html2TextWithRegExp {
   private Html2TextWithRegExp() {}
   
   public static void main (String[] args) {
     try {
       StringBuilder sb = new StringBuilder();
       BufferedReader br = new BufferedReader
        (new FileReader
            ("java-new.html"));
       String line;
       while ( (line=br.readLine()) != null) {
         sb.append(line);
       }
       String nohtml = sb.toString().replaceAll("\\<.*?>","");
       System.out.println(nohtml);  
     }
     catch (Exception e) {
       e.printStackTrace();
     }
   }
}
However if any Javascript is present, the script will be seen as text. Also you may need to add some logic during the reading to take into account only what is inside the <BODY> tag.

Using javax.swing.text.html.HTMLEditorKit

In most cases, the HTMLEditorKit is used with a JEditorPane text component but it can be also used directly to extract text from an HTML page.
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
 StringBuffer s;

 public Html2Text() {}

 public void parse(Reader in) throws IOException {
   s = new StringBuffer();
   ParserDelegator delegator = new ParserDelegator();
   // the third parameter is TRUE to ignore charset directive
   delegator.parse(in, this, Boolean.TRUE);
 }

 public void handleText(char[] text, int pos) {
   s.append(text);
 }

 public String getText() {
   return s.toString();
 }

 public static void main (String[] args) {
   try {
     // the HTML to convert
     FileReader in = new FileReader("java-new.html");
     Html2Text parser = new Html2Text();
     parser.parse(in);
     in.close();
     System.out.println(parser.getText());
   }
   catch (Exception e) {
     e.printStackTrace();
   }
 }
}


If you find this article useful, consider making a small donation
to show your support for this Web site and its content.

Written and compiled by Réal Gagnon ©1998-2008
[ home ]