org.pdfbox.searchengine.lucene

Class LucenePDFDocument


public final class LucenePDFDocument
extends java.lang.Object

This class is used to create a document for the lucene search engine. This should easily plug into the IndexHTML or IndexFiles that comes with the lucene project. This class will populate the following fields.
Lucene Field NameDescription
pathFile system path if loaded from a file
urlURL to PDF document
contentsEntire contents of PDF document, indexed but not stored
summaryFirst 500 characters of content
modifiedThe modified date/time according to the url or path
uidA unique identifier for the Lucene document.
CreationDateFrom PDF meta-data if available
CreatorFrom PDF meta-data if available
KeywordsFrom PDF meta-data if available
ModificationDateFrom PDF meta-data if available
ProducerFrom PDF meta-data if available
SubjectFrom PDF meta-data if available
TrappedFrom PDF meta-data if available
Version:
$Revision: 1.22 $
Author:
Ben Litchfield

Constructor Summary

LucenePDFDocument()
Constructor.

Method Summary

Document
convertDocument(File file)
This will take a reference to a PDF document and create a lucene document.
Document
convertDocument(InputStream is)
Convert the PDF stream to a lucene document.
Document
convertDocument(URL url)
Convert the document from a PDF to a lucene document.
DateTools.Resolution
getDateTimeResolution()
Get the Lucene data time resolution.
static Document
getDocument(File file)
This will get a lucene document from a PDF file.
static Document
getDocument(InputStream is)
This will get a lucene document from a PDF file.
static Document
getDocument(URL url)
This will get a lucene document from a PDF file.
static void
main(String[] args)
This will test creating a document.
void
setDateTimeResolution(DateTools.Resolution resolution)
Set the Lucene data time resolution.
void
setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.

Constructor Details

LucenePDFDocument

public LucenePDFDocument()
Constructor.

Method Details

convertDocument

public Document convertDocument(File file)
            throws IOException
This will take a reference to a PDF document and create a lucene document.
Parameters:
file - A reference to a PDF document.
Returns:
The converted lucene document.

convertDocument

public Document convertDocument(InputStream is)
            throws IOException
Convert the PDF stream to a lucene document.
Parameters:
is - The input stream.
Returns:
The input stream converted to a lucene document.

convertDocument

public Document convertDocument(URL url)
            throws IOException
Convert the document from a PDF to a lucene document.
Parameters:
url - A url to a PDF document.
Returns:
The PDF converted to a lucene document.

getDateTimeResolution

public DateTools.Resolution getDateTimeResolution()
Get the Lucene data time resolution.
Returns:
current date/time resolution

getDocument

public static Document getDocument(File file)
            throws IOException
This will get a lucene document from a PDF file.
Parameters:
file - The file to get the document for.
Returns:
The lucene document.

getDocument

public static Document getDocument(InputStream is)
            throws IOException
This will get a lucene document from a PDF file.
Parameters:
is - The stream to read the PDF from.
Returns:
The lucene document.

getDocument

public static Document getDocument(URL url)
            throws IOException
This will get a lucene document from a PDF file.
Parameters:
url - The file to get the document for.
Returns:
The lucene document.

main

public static void main(String[] args)
            throws IOException
This will test creating a document. usage: java pdfparser.searchengine.lucene.LucenePDFDocument <pdf-document>
Parameters:
args - command line arguments.

setDateTimeResolution

public void setDateTimeResolution(DateTools.Resolution resolution)
Set the Lucene data time resolution.
Parameters:
resolution - set new date/time resolution

setTextStripper

public void setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.
Parameters:
aStripper - The new pdf text stripper.