A piece of my code: Parser de HTML con Java

Quería compartir un código de ejemplo mostrando como "analizar gramaticalmente" (esa fue la traducción que me dio Google Translate para el verbo "parse", de ahora en adelante me quedo con el incorrectamente-usado-pero-mejormente-conocido-verbo: "parsear") una página HTML utilizando la librería de código abierto: HTML Parser. Me ha tocado utilizar esta librería en el pasado para un proyecto en donde necesitábamos extraer todos los enlaces de un sitio, y ahora me tocar usarla para otro proyecto donde necesito validar unas reglas de código en ciertas páginas HTML.
Así que vamos a asumir que necesitamos encontrar todos los enlaces absolutos de una página de un sitio. Primeramente creamos nuestra clase parseadora:

import java.io.IOException;
import java.net.URL;
import org.htmlparser.Node;
import org.htmlparser.Tag;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.util.ParserException;

/**
* Parses the HTML code of the page specified by it's URL.
* @author gabriel.solano
*
*/
public class URLHTMLParser {

/*
* Tag handler that will be used to process the tags.
* (This could be improved by implementing an observer
* pattern to be able to add more than one TagHandler)
*/
private TagHandler tagHandler;

/**
* Constructor.
* @param tagHandler
*/
public URLHTMLParser(TagHandler tagHandler) {
this.tagHandler = tagHandler;
}

/**
* Scans the specified URL.
* @param url
* @throws ParserException
* @throws IOException
*/
public void scanURL(URL url) throws ParserException, IOException {
Lexer lexer = new Lexer(url.openConnection());
extractHTMLNodes(lexer);
}

/**
* Extracts the HTML nodes and lets the TagHandler to do something
* with the tags.
* @param lexer
* @throws ParserException
*/
private void extractHTMLNodes(Lexer lexer) throws ParserException {
Node node;

while (null != (node = lexer.nextNode(false))) {
 if (node instanceof Tag) {
  Tag tag = (Tag) node;
  tagHandler.handleTag(tag);
 }
}
}
}

Como podrán notar, la última función de esta clase es la encargada de moverse a través de todos los nodos HTML de la página. Simplemente dejo a la clase "TagHandler" hacer lo que tenga que hacer con la etiqueta que encuentra.
Esta es la interface TagHandler:

import org.htmlparser.Tag;

/**
* Defines the interface for a TagHandler.
* @author gabriel.solano
*
*/
public interface TagHandler {

/**
* Handles the process of an HTML tag.
* @param tag
*/
public void handleTag(Tag tag);

}

Y aquí está la implementación para procesar las etiquetas de ancla:

import java.util.HashSet;
import java.util.Set;
import org.htmlparser.Tag;

/**
* Handles the event when an anchor tag is found while parsing
* HTML code of a page.
* This class has a functionality to count all absolute URLs
* found in the parsing process.
* @author gabriel.solano
*
*/
public class AnchorTagHandler implements TagHandler{

private Set<String> absoluteURLs; // All URLs found.

/**
* Constructor.
*/
public AnchorTagHandler() {
absoluteURLs = new HashSet<String> ();
}

/**
* Gets the found absolute URLs.
* The collection is filled only during the scanning process
* of an HTML page.
* @return
*/
public Set<String>getAbsoluteURLs() {
return absoluteURLs;
}

/**
* Handles the tag only if it is an anchor tag.
*/
public void handleTag(Tag tag) {
if (tag.getTagName().equalsIgnoreCase("a")) {
 // Process only if it's an anchor tag.
 processTag(tag);
}
}

/**
* Processes the anchor tag. In this case
* adds all absolute URL's found.
* @param tag
*/
private void processTag(Tag tag) {
String href = tag.getAttribute("href");

if (href != null) {
 href = href.toLowerCase(); 
 if (href.startsWith("http://") || href.startsWith("https://")) {
  // Add all URLs with HTTP protocol.
  absoluteURLs.add(href);
 }
}
}
}

La función "processTag" simplemente extrae el atributo "href" y verifica si es un URL absoluto. Finalmente creamos la clase principal para correr este código:

import java.net.URL;
import java.util.Set;

public class FindAbsoluteURLs {

public static void main(String[] args) {

AnchorTagHandler anchorTagHandler = new AnchorTagHandler();
URLHTMLParser htmlParser = new  URLHTMLParser(anchorTagHandler);

try {
 htmlParser.scanURL(new URL("http://www.crjug.org/"));
 Set<String>urls = anchorTagHandler.getAbsoluteURLs();

 for(String url : urls) {
  System.out.println(url);
 }

} catch (Exception e) { 
 e.printStackTrace();
}
}
}

Usando el sitio de la comunidad de usuarios de Java de Costa Rica obtenemos los siguientes URLs:

http://www.facebook.com/pages/costa-rica-jug/107081646760
http://www.oreilly.com/
http://www.java.net/

Esta es la dependencia de Maven en caso necesiten utilizar esta útil librería:

<dependency>
   <groupId>org.htmlparser</groupId>
   <artifactId>htmlparser</artifactId>
   <version>1.6</version>
</dependency>

A piece of my code

martes, 6 de septiembre de 2011

Parser de HTML con Java

No hay comentarios:

Publicar un comentario

Datos personales

Etiquetas

Buscar...

LinkedIn

Archivo del blog

Seguidores

Vistas de página en total

Entradas populares