org.pdfbox.util

Class PDFText2HTML


public class PDFText2HTML
extends PDFTextStripper

Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.
Version:
$Revision: 1.3 $
Author:
jjb - http://www.johnjbarton.com

Field Summary

Fields inherited from class org.pdfbox.util.PDFTextStripper

charactersByArticle, output

Constructor Summary

PDFText2HTML()
Constructor.

Method Summary

void
endDocument(PDDocument pdf)
protected void
endParagraph()
Write out the paragraph separator.
protected void
flushText()
protected String
getTitleGuess()
The guess to the document title.
protected TextPosition
guessTitle(Iterator textIter)
This method will attempt to guess the title of the document.
boolean
isSuppressParagraphs()
void
setSuppressParagraphs(boolean shouldSuppressParagraphs)
protected void
startParagraph()
Write out the paragraph separator.
protected void
writeCharacters(TextPosition position)
protected void
writeHeader()
Write the header to the output document.

Methods inherited from class org.pdfbox.util.PDFTextStripper

endDocument, endPage, endParagraph, flushText, getCharactersByArticle, getCurrentPageNo, getEndBookmark, getEndPage, getLineSeparator, getOutput, getPageSeparator, getStartBookmark, getStartPage, getText, getText, getWordSeparator, processPage, processPages, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, showCharacter, startDocument, startPage, startParagraph, writeCharacters, writeText, writeText

Methods inherited from class org.pdfbox.util.PDFStreamEngine

getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showCharacter, showString

Constructor Details

PDFText2HTML

public PDFText2HTML()
            throws IOException
Constructor.

Method Details

endDocument

public void endDocument(PDDocument pdf)
            throws IOException
Overrides:
endDocument in interface PDFTextStripper

endParagraph

protected void endParagraph()
            throws IOException
Write out the paragraph separator.
Overrides:
endParagraph in interface PDFTextStripper

flushText

protected void flushText()
            throws IOException
Overrides:
flushText in interface PDFTextStripper

getTitleGuess

protected String getTitleGuess()
The guess to the document title.
Returns:
A string that is the title of this document.

guessTitle

protected TextPosition guessTitle(Iterator textIter)
This method will attempt to guess the title of the document.
Parameters:
textIter - The characters on the first page.
Returns:
The text position that is guessed to be the title.

isSuppressParagraphs

public boolean isSuppressParagraphs()
Returns:
Returns the suppressParagraphs.

setSuppressParagraphs

public void setSuppressParagraphs(boolean shouldSuppressParagraphs)
Parameters:
shouldSuppressParagraphs - The suppressParagraphs to set.

startParagraph

protected void startParagraph()
            throws IOException
Write out the paragraph separator.
Overrides:
startParagraph in interface PDFTextStripper

writeCharacters

protected void writeCharacters(TextPosition position)
            throws IOException
Overrides:
writeCharacters in interface PDFTextStripper

writeHeader

protected void writeHeader()
            throws IOException
Write the header to the output document.