Remove HTML tags from a file to extract only the TEXTTag(s): IO String/Number Networking
Using regular expression
A special regular expression is used to strip out anything between a < and > .
import java.io.*;
public class Html2TextWithRegExp {
private Html2TextWithRegExp() {}
public static void main (String[] args) throws Exception{
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
// or
// sb.append(line).append(System.getProperty("line.separator"));
}
String nohtml = sb.toString().replaceAll("\\<.*?>","");
System.out.println(nohtml);
}
}
Using javax.swing.text.html.HTMLEditorKit
In most cases, the HTMLEditorKit is used with a JEditorPane text component but it can be also used directly to extract text from an HTML page.
import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.MutableAttributeSet;
public class HTMLUtils {
private HTMLUtils() {}
public static List<String> extractText(Reader reader) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
public void handleText(final char[] data, final int pos) {
list.add(new String(data));
}
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
public void handleEndTag(Tag t, final int pos) { }
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, true);
return list;
}
public final static void main(String[] args) throws Exception{
FileReader reader = new FileReader("java-new.html");
List<String> lines = HTMLUtils.extractText(reader);
for (String line : lines) {
System.out.println(line);
}
}
}
Using an HTML parser
This is maybe the best solution (if the choosen parser is good !).There are many parsers available on the net. In this HowTo, I will use the OpenSource package Jsoup.
Jsoup is entirely self contained and has no dependencies which is a good thing.
import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.io.BufferedReader;
import org.jsoup.Jsoup;
public class HTMLUtils {
private HTMLUtils() {}
public static String extractText(Reader reader) throws IOException {
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
String textOnly = Jsoup.parse(sb.toString()).text();
return textOnly;
}
public final static void main(String[] args) throws Exception{
FileReader reader = new FileReader
("C:/RealHowTo/topics/java-language.html");
System.out.println(HTMLUtils.extractText(reader));
}
}
Using Apache Tika
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class ParseHTMLWithTika {
public static void main(String args[]) throws Exception {
InputStream is = null;
try {
is = new FileInputStream("C:/Temp/java-x.html");
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(is, contenthandler, metadata, new ParseContext());
System.out.println(contenthandler.toString());
}
catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
}
}
}
See also Extract links from an HTML page and Remove XML tags from a string to keep only text
mail_outline
Send comment, question or suggestion to howto@rgagnon.com
Send comment, question or suggestion to howto@rgagnon.com