Welcome to gbCapture, which can extract text from applications that display multiple pages of text but which do not provide built-in methods of copying and saving the displayed text of the entire document.



gbCapture Introduction

Here's an image of the gbCapture main window.

With gbCapture, the user displays the target application (the one that displays the text to be captured) in the center of the desktop. gbCapture can capture an image of the content of that application and use the free tesseract library to extract the text from the image.

gbCapture has a "Mini" mode to minimize its footprint on the screen. The "Mini" toolbar button toggles between full mode and mini mode.

Tesseract can be downloaded from UB Mannheim and must be installed in its default location.

To capture multiple pages, gbCapture starts by capturing an image of the current page, extracts the text from the image and then sends a keyboard "Next Page" command to the target application. That sequence is repeated for as many pages as the user specifies or until the end of the document is reached. The text extracted from all images is combined into a single document file.

gbCapture toolbar functions:


Target Application

gbCapture detects the application covering the center of the desktop and uses that application as the source for text. To confirm that the user is capturing the desired text, gCapture can display an image of the PC desktop with the target application highlighted, as shown in this image:


Capture Bitmaps and Extract Text

gbCapture captures an image of the target application, saves the image to a file. Tesseract is used to extract the text from the image and saves that text to a file as well. The text files are later combined to create the entire document.

Bitmaps and text files are named simply as "0001" to "000X" (.bmp and .txt), according to how many pages are captured.

As shown in this next image, gbCapture can display the most recent set of saved images and their corresponding text content. A list of bitmaps is shown on the left, with the selected bitmap and its extracted text shown on the right. files.


Document Text

The text extracted from each bitmap is appended into a single document file, which gbCapture can then display as shown in this next image. gbCapture can also copy the document content to the clipboard or save the document file to a new location.


Settings

gbCapture supports several keyboard shortcuts, which perform less frequently used actions. Using shortcuts helps minimize the footprint and complexity of the gbCapture main screen.

The Settings toolbar button opens a window that lists all of the keyboard shortcuts and variables, along with the current value for the variables.

These keyboard shortcuts set operating variables:

And these shortcuts perform actions:

Note: "CS-B" means to press and hold the Control and Shift keys while pressing the "B" key.


Operating Notes

Images and Files
When a capture is started, all images and text files from previous captures are deleted. It is up to the user to save/move files from a capture as needed.

Tesseract
For tesseract to work most accurately, the text must be fully visible - meaning that there must be empty margins surrounding the text. Partially visible lines of text will be mis-read by tesseract.

Some document viewers, such as Word, WordPad and Kindle for the PC, provide that margin.

Other document viewers, such as NotePad, Browsers, and RichEdit controls, allow display of partial lines of text and are not suitable for use with gbCapture.

End of Document
gbCapture does not limit the number of Pages a user can request to be captured, but it will stop automatically when it reaches the end of the document.

The end-of-document is assumed when two consecutive pages result in exactly the same extracted text.

Partial Last Page
The last page of content in some document viewers, such as Word, present a partial page of content, with blank lines used to fill the page below the content. This allows gbCapture to correctly capture the final page of a document.

However, some document viewers fill the display of the last page with content from the previous page in order to avoid blank lines on the last page. This will cause gbCapture to incorrectly report the content of the last captured page.

GoDo List
Here's some of the items I want to work on in future releases of gbCapture.


Comments and suggestions are welcome. Send to Gary Beene at gbeene@airmail.net.