org.faceless.pdf2
Class PageExtractor

java.lang.Object
  extended by org.faceless.pdf2.PageExtractor

public class PageExtractor
extends Object

This class enables the extraction of text and images from a PDFPage. You can get one by calling the PDFParser.getPageExtractor(int) method, assuming the PDF has the rights to let you extract text and/or images.

Once you've got one, you can extract the text of the page as a StringBuffer by calling getTextAsStringBuffer(). Note that extracting text from PDF's is not an exact science - the internals of a PDF allow text to be displayed in any order, and features like superscript, subscript, rotated text and so on which are easy to display in PDF can only be approximated in plain text.

Features like tables etc. have to be determined using heuristics, and some PDF's are encoded in a way that makes extracting their text almost impossible (storing each letter as an image, for example).

Depending on how the font has been stored, the library may replace unknown characters with a Unicode character in the private range (U+EF00 - U+EFFF). These replacements will be consistent, so if you find that U+EF01 is in fact the letter 'A', you can easily run a String.replace() on the string to correct the letters

Extracting BitMap images is a much simpler process. The PageExtractor.Image class represents an image on the current page. There is one instance for each time an image is drawn, although as an image is repeated each instance may contain the same RenderedImage. You can retrieve the list of images by calling the getImages() method.

This class requires the Extended Edition plus Viewer license to operate. Although it may be freely used in the trial version of the library, the extracted text will have the letter 'e' replaced with the letter 'a'.

Since:
2.6.2

Nested Class Summary
 class PageExtractor.Image
          A class representing a bitmap image which is extracted from the PageExtractor.
 class PageExtractor.Text
          A class representing a piece of text which is extracted from the PageExtractor.
 
Method Summary
 Collection getImages()
          Return every PageExtractor.Image on the page, in the order they were added to the page.
 Collection getMatchingText(String query)
           Return a Collection of PageExtractor.Text items on this page that are equal to the specified substring.
 Collection getMatchingText(String[] queries)
           Return a Collection of PageExtractor.Text items on this page that are equals to one of the specified substrings.
 PDFPage getPage()
          Return the PDFPage this PageExtractor relates to
 StringBuffer getText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, boolean displayorder)
          Return a StringBuffer containing a contiguous range of text from this PageExtractor.
 StringBuffer getTextAsStringBuffer()
          Parse and return all the text on the page as a StringBuffer.
 StringBuffer getTextAsStringBuffer(float x1, float y1, float x2, float y2)
          Parse and return the text in the specified area on the page as a String.
 Collection getTextInDisplayOrder()
          Return every PageExtractor.Text item on the page, in the order they are displayed on the screen - so the first item in the returned collection will nearest to the top left of the page.
 Collection getTextUnordered()
          Return every PageExtractor.Text item on the page, in the order they were added to the page.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getImages

public Collection getImages()
Return every PageExtractor.Image on the page, in the order they were added to the page. Some images may be displayed more than once, in which case the value returned by PageExtractor.Image.getImage() will be identical.

Returns:
an unmodifiable collection of PageExtractor.Image elements.

getTextUnordered

public Collection getTextUnordered()
Return every PageExtractor.Text item on the page, in the order they were added to the page. The ordering may not be consistant with the order items are positioned on screen.

Returns:
an unmodifiable collection of PageExtractor.Text elements.

getTextInDisplayOrder

public Collection getTextInDisplayOrder()
Return every PageExtractor.Text item on the page, in the order they are displayed on the screen - so the first item in the returned collection will nearest to the top left of the page.

Returns:
an unmodifiable collection of PageExtractor.Text elements.

getMatchingText

public Collection getMatchingText(String query)

Return a Collection of PageExtractor.Text items on this page that are equal to the specified substring. The Text items returned from getTextInDisplayOrder() are searched and possibly substrings extracted from them to create this collection. In this case the co-ordinates of the returned Text items will reflect the substring not the original Text object.

As an example, the following method could be used to search a PDF for a specified word and add a "highlight" annotation over it. The PDF can then be rendered or saved as normal.

 void highlightWords(PDF pdf, String word) {
   PDFParser parser = new PDFParser(pdf);
   for (int i=0;i<pdf.getNumberOfPages();i++) {
     PageExtractor extractor = parser.getPageExtractor(i);
     Collection co = extractor.getMatchingText(word);
     for (Iterator j = co.iterator();j.hasNext();) {
       PageExtractor.Text text = (PageExtractor.Text)j.next();
       AnnotationMarkup annot = text.createAnnotationMarkup("Highlight");
       text.getPage().getAnnotations().add(annot);
     }
   }
 }
 

Parameters:
query - the String to search for
Returns:
a Collection of PageExtractor.Text objects.
Since:
2.6.12

getMatchingText

public Collection getMatchingText(String[] queries)

Return a Collection of PageExtractor.Text items on this page that are equals to one of the specified substrings. This method runs exactly like getMatchingText(String) but allows more than one substring to be matched.

Parameters:
queries - a list of zero or more Strings to search for
Returns:
a Collection of PageExtractor.Text objects.
Since:
2.8.1

getTextAsStringBuffer

public StringBuffer getTextAsStringBuffer()
Parse and return all the text on the page as a StringBuffer. Text will be converted back to it's normalized form, and newlines and spaces will be inserted in an approximation of the original layout.


getTextAsStringBuffer

public StringBuffer getTextAsStringBuffer(float x1,
                                          float y1,
                                          float x2,
                                          float y2)
Parse and return the text in the specified area on the page as a String. Text will be converted back to it's normalized form, and newlines and spaces will be inserted in an approximation of the original layout. The co-ordinates define the start position of any phrases that are to be returned.

Parameters:
x1 - the left-most X co-ordinate of the text
y1 - the top-most Y co-ordinate of the text
x2 - the right-most X co-ordinate of the text
y2 - the bottom-most Y co-ordinate of the text
Returns:
a StringBuffer containing all the text within the specified rectangle

getText

public StringBuffer getText(PageExtractor.Text first,
                            int firstchar,
                            PageExtractor.Text last,
                            int lastchar,
                            boolean displayorder)
Return a StringBuffer containing a contiguous range of text from this PageExtractor. The range is specified by giving a starting and ending PageExtractor.Text object, and the offsets into those strings. This method is chiefly intended for use with a GUI that allows a range of text to be selected.

Parameters:
first - the first Text from this PageExtractor to be extracted
firstchar - the first character from "first" to be extracted
last - the last Text from this PageExtractor to be extracted
lastchar - the last character from "last" to be extracted (note this is inclusive).
displayorder - if true, the iteration from first to last will go in display order, right to left and top to bottom. If false, the iteration will run in the order the text items exist in the document.
Since:
2.10.3

getPage

public PDFPage getPage()
Return the PDFPage this PageExtractor relates to

Since:
2.10.3


Copyright © 2001-2008 Big Faceless Organization