Welcome to gbCapture, which can extract text from applications that display multiple pages of text but which do not provide built-in methods of copying and saving the displayed text of the entire document.
Here's an image of the gbCapture main window.
With gbCapture, the user displays the target application (the one that displays the text to be captured) in the center of the desktop. gbCapture can capture an image of the content of that application and use the free tesseract library to extract the text from the image.
gbCapture has a "Mini" mode to minimize its footprint on the screen. The "Mini" toolbar button toggles between full mode and mini mode.
Tesseract can be downloaded from UB Mannheim and must be installed in its default location.
To capture multiple pages, gbCapture starts by capturing an image of the current page, extracts the text from the image and then sends a keyboard "Next Page" command to the target application. That sequence is repeated for as many pages as the user specifies or until the end of the document is reached. The text extracted from all images is combined into a single document file.
gbCapture toolbar functions:
Bitmaps and text files are named simply as "0001" to "000X" (.bmp and .txt), according to how many pages are captured.
As shown in this next image, gbCapture can display the most recent set of saved images and their corresponding text content. A list of bitmaps is shown on the left, with the selected bitmap and its extracted text shown on the right. files.
The Settings toolbar button opens a window that lists all of the keyboard shortcuts and variables, along with the current value for the variables.
These keyboard shortcuts set operating variables:
And these shortcuts perform actions:
Note: "CS-B" means to press and hold the Control and Shift keys while pressing the "B" key.
Tesseract
For tesseract to work most accurately, the text must be fully
visible - meaning that there must be empty margins surrounding
the text. Partially visible lines of text will be mis-read by tesseract.
Some document viewers, such as Word, WordPad and Kindle for the PC, provide that margin.
Other document viewers, such as NotePad, Browsers, and RichEdit controls, allow display of partial lines of text and are not suitable for use with gbCapture.
End of Document
gbCapture does not limit the number of Pages a user can request
to be captured, but it will stop automatically when it reaches
the end of the document.
The end-of-document is assumed when two consecutive pages result in exactly the same extracted text.
Partial Last Page
The last page of content in some document viewers, such as Word,
present a partial page of content, with blank lines used to fill
the page below the content. This allows gbCapture to correctly
capture the final page of a document.
However, some document viewers fill the display of the last page with content from the previous page in order to avoid blank lines on the last page. This will cause gbCapture to incorrectly report the content of the last captured page.
GoDo List
Here's some of the items I want to work on in future releases of gbCapture.