org.pdfbox.searchengine.lucene
Class LucenePDFDocument
java.lang.Object
org.pdfbox.searchengine.lucene.LucenePDFDocument
public final class LucenePDFDocument
extends java.lang.Object
This class is used to create a document for the lucene search engine.
This should easily plug into the IndexHTML or IndexFiles that comes with
the lucene project. This class will populate the following fields.
Lucene Field Name | Description |
---|
path | File system path if loaded from a file |
url | URL to PDF document |
contents | Entire contents of PDF document, indexed but not stored |
summary | First 500 characters of content |
modified | The modified date/time according to the url or path |
uid | A unique identifier for the Lucene document. |
CreationDate | From PDF meta-data if available |
Creator | From PDF meta-data if available |
Keywords | From PDF meta-data if available |
ModificationDate | From PDF meta-data if available |
Producer | From PDF meta-data if available |
Subject | From PDF meta-data if available |
Trapped | From PDF meta-data if available |
Document | convertDocument(File file) - This will take a reference to a PDF document and create a lucene document.
|
Document | convertDocument(InputStream is) - Convert the PDF stream to a lucene document.
|
Document | convertDocument(URL url) - Convert the document from a PDF to a lucene document.
|
DateTools.Resolution | getDateTimeResolution() - Get the Lucene data time resolution.
|
static Document | getDocument(File file) - This will get a lucene document from a PDF file.
|
static Document | getDocument(InputStream is) - This will get a lucene document from a PDF file.
|
static Document | getDocument(URL url) - This will get a lucene document from a PDF file.
|
static void | main(String[] args) - This will test creating a document.
|
void | setDateTimeResolution(DateTools.Resolution resolution) - Set the Lucene data time resolution.
|
void | setTextStripper(PDFTextStripper aStripper) - Set the text stripper that will be used during extraction.
|
LucenePDFDocument
public LucenePDFDocument()
Constructor.
convertDocument
public Document convertDocument(File file)
throws IOException
This will take a reference to a PDF document and create a lucene document.
file
- A reference to a PDF document.
- The converted lucene document.
convertDocument
public Document convertDocument(InputStream is)
throws IOException
Convert the PDF stream to a lucene document.
- The input stream converted to a lucene document.
convertDocument
public Document convertDocument(URL url)
throws IOException
Convert the document from a PDF to a lucene document.
url
- A url to a PDF document.
- The PDF converted to a lucene document.
getDateTimeResolution
public DateTools.Resolution getDateTimeResolution()
Get the Lucene data time resolution.
- current date/time resolution
getDocument
public static Document getDocument(File file)
throws IOException
This will get a lucene document from a PDF file.
file
- The file to get the document for.
getDocument
public static Document getDocument(InputStream is)
throws IOException
This will get a lucene document from a PDF file.
is
- The stream to read the PDF from.
getDocument
public static Document getDocument(URL url)
throws IOException
This will get a lucene document from a PDF file.
url
- The file to get the document for.
main
public static void main(String[] args)
throws IOException
This will test creating a document.
usage: java pdfparser.searchengine.lucene.LucenePDFDocument <pdf-document>
args
- command line arguments.
setDateTimeResolution
public void setDateTimeResolution(DateTools.Resolution resolution)
Set the Lucene data time resolution.
resolution
- set new date/time resolution
setTextStripper
public void setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.
aStripper
- The new pdf text stripper.