施乐一体机出现receivedate date是什么意思

分享给朋友:通用代码: <input id="link4" type="text" class="form_input form_input_s" value="" />复 制Xerox Uses the Power of Data to Ignite Education下载至电脑扫码用手机看用或微信扫码在手机上继续观看二维码2小时内有效Xerox Uses the Power of Data to Ignite Education扫码用手机继续看用或微信扫码在手机上继续观看二维码2小时内有效,扫码后可分享给好友没有优酷APP?立即下载请根据您的设备选择下载版本
药品服务许可证(京)-经营- 请使用者仔细阅读优酷、、、Copyright(C)2017 优酷
版权所有不良信息举报电话:System for searching a corpus of document images by user specified document layout components
United States Patent 5999664
A document search system provides a user with a programming interface for dynamically specifying features of documents recorded in a corpus of documents. The programming interface operates at a high-level that is suitable for interactive user specification of layout components and structures of documents. In operation, a bitmap image of a document is analyzed by the document search system to identify layout objects such as text blocks or graphics. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes which are identified are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. After computing attributes for each layout object, a user can operate the programming interface to define unique document features. Each document feature is a routine defined by a sequence of selections operations which consume a first set of layout objects and produce a second set of layout objects. The second set of layout objects constitutes the feature in a page image of a document. Using the programming interface, a user flexibly defines a genre of document using the user-specified document features.
Inventors:
Mahoney, James V. (Los Angeles, CA)
Blomberg, Jeanette L. (Portola Valley, CA)
Trigg, Randall H. (Palo Alto, CA)
Shin, Christian K. (Fairport, NY)
Application Number:
Publication Date:
12/07/1999
Filing Date:
11/14/1997
Export Citation:
Xerox Corporation (Stamford, CT)
Primary Class:
Other Classes:
707/E17.023,
707/E17.026
International Classes:
G06F17/30; G06K9/20; G06T1/00; G06T7/40; G06T11/60; (IPC1-7): G06K9/54
Field of Search:
382/180, 382/305, 382/293, 382/295, 382/173, 382/176, 707/517, 707/520, 707/523
View Patent Images:
&&&&&&PDF help
US Patent References:
5848186Wang et al.382/1805848184Taylor et al.382/1805841900Rahgozar et al.382/1805832118Kim382/2625701500Ikeo et al.382/1805598507Kimber et al.395/2.555539841Huttenlocher et al.382/2185537491Mahoney et al.382/2185524066Kaplan et al.382/2295491760Withgott et al.382/2035442778Pedersen et al.395/6005434953Bloomberg395/1395390259Withgott et al.382/95384863Huttenlocher et al.382/95369714Withgott et al.382/95335088Fan358/4295325444Cass et al.382/95321770Huttenlocher et al.382/22
Other References:
Ashley, Jonathan et al. "Automatic and Semi-Automatic Methods for Image Annotation and Retrieval in QBIC," in Storage and Retrieval for Image and Video Databases III, Proceedings SPIE 2420, Feb. 9-10, 1995, pp.24-35.
Belongie, Serge et al. "Recognition of Images in Large Databases Using a Learning Framework," U.C. Berleley C.S. Technical Report 97-939.
Blomberg et al. "Reflections on a Work-Oriented Design Project," pdc '94: Proceedings of the Participatory Design Conference, Oct. 27-28, 1994: pp. 99-109. Revised publication in Human-Computer Interaction in 1996, at vol. 11, pp. 237-265.
Carson, Chad et al. "Region-Based Image Querying," IEEE Proceedings of CAIVL '97, Puerto Rico, Jun. 20, 1997.
Carson, Chad and Virginia E. Ogle. "Storage and Retrieval of Feature Data for a Very Large Online Image Collection," IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Dec. 1996, vol. 19, No. 4.
Fernandes et al. "Coding of Numerical Data in JBIG-2," published by ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG8) standards for Coding of Still Pictures (JBIG/JPEG), Aug. 18, 1997.
Haralick, R. "Document Image Understanding Geometric and Logical Layout," Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994: pp. 385-390.
Niblack, W. et al. "The QBIC Project: Querying Images By Content Using Color, Texture, and Shape," SPIE vol. ) pp. 173-187.
Rucklidge, William. 1996. Efficient Visual Recognition Using The Hausdorff Distance, Lecture Notes in Computer Science vol. 1173, G. Goos et al. ed., Santa Clara, Springer.
Syeda-Mahmood, Tanveer. "Indexing of Handwritten Document Images," Proceedings of IEEE Document Image Analysisi Workshop, Puerto Rico Jun. 20, 1997.
TextBridge Pro.sup.98 User's Guide, by ScanSoft Inc., a Xerox Company, 1997. (Available on the internet at: /scansoft/tbpro98win/tbpro98windocumentation.htm) With specific reference to "Zoning the Page" on pp. 2-18 through 2-20.
Primary Examiner:
Couso, Yon J.
Parent Case Data:
CROSS-REFERENCE TO RELATED
APPLICATIONSCross-reference is made to U.S.
patent application Ser. Nos. 08/971,210, entitled
"System For Summarizing A Corpus Of Documents By
Assembling User Specified Layout Components" (Attorney
Docket No. D/97493), 08/970,507, entitled "System For
Sorting Document Images By Shape Comparisons Among
Corresponding Layout Components" (Attorney Docket No.
D/97494), and 08/971,020, entitled "System For
Progressively Transmitting And Displaying Layout Components
Of Document Images" (Attorney Docket No. D/97495),
which are assigned to the same assignee as the present
invention.
1. A method for searching a corpus of document images stored in a memory, comprising the steps of:
segmenting each document image in the corpus of document images into a first s each layout object in the first set of layout objects being one of a plurality of each of the plurality of layout object types identifying a structural e
for each segmented document image, computing attributes for each layout object in the first s the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segm
providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpu the certain layout objects defining a feature of the ex
executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpu for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second s said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the ex and
displaying at the program interface the ones of the document images in the corpus of document images identified by said executing step.
2. The method according to claim 1, further comprising the step of highlighting in the example document image the feature identified by the routine composed at a program interface.
3. The method according to claim 1, wherein one of the selection operations executed by said executing step is a filtering operation that produces a second set of layout objects having attribute values between a minimum threshold value and a maximum threshold value.
4. The method according to claim 1, wherein one of the selection operations executed by said executing step is a gating operation that produces a second set of layout objects equal to the first set of layout objects when each of the layout objects in a document image have attribute values between a minimum threshold value and a maximum threshold value.
5. The method according to claim 1, wherein one of the selection operations executed by said executing step is an accumulation operation that identifies an attribute value for the layout objects in the first set of layout objects.
6. The method according to claim 1, further comprising the step of forming the corpus document images in the memory by scanning hardcopy documents.
7. The method according to claim 1, wherein said segmenting step defines one of the plurality of layout objects to be a text block of.
8. The method according to claim 1, further comprising the step of defining a structural model for identifying the structural model defining a set of features absent and a set of features present in document images to specify a class of document images which express a common communicative purpose that is independent of document content.
9. A program storage device readable by a machine, embodying a program of instructions executable by the machine to perform method steps for searching a corpus of document images stored in a memory of a document management system, said method steps comprising:
segmenting each document image in the corpus of document images into a first s each layout object in the first set of layout objects being one of a plurality of each of the plurality of layout object types identifying a structural e
for each segmented document image, computing attributes for each layout object in the first s the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segm
providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpu the certain layout objects defining a feature of the ex
executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpu for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second s said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the ex and
displaying at the program interface the ones of the document images in the corpus of document images identified by said executing step.
10. The program storage device as recited in claim 9, wherein said method steps further comprise the step of defining a structural model for identifying the structural model defining a set of features absent and a set of features present in document images to specify a class of document images which express a common communicative purpose that is independent of document content.
11. A document management system for searching a corpus of document images, comprising:
a memory for storing the corpus of document images and image processing instructions of the docume
a processor coupled to the memory and the display for executing the document image processing instructions of the docume the processor in executing the document image processing instructions:
segmenting each document image in the corpus of document images into a first s each layout object in the first set of layout objects being one of a plurality of each of the plurality of layout object types identifying a structural e
for each segmented document image, computing attributes for each layout object in the first s the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segm
providing a program interface on the display for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpu the certain layout objects defining a feature of the ex
executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpu for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second s
identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects to have the feature of the ex and
displaying at the program interface on the display the ones of the document images in the corpus of document images identified as having the feature of the example document image.
12. The document management system according to claim 11, further comprising means for displaying at the program interface on the display an example document with the feature highlighted.
13. The document management system according to claim 11, wherein the program interface on the display comprises:
a first display area for specifying a
a second display area for specifying a set of images which e
a third display area for specifying a set of
the set of input layout objects limiting the first set of layout objects of the document image consumed by the sequence of selection opera and
a fourth display area for defining the sequence of selections operations of the feature.
14. The document management system according to claim 11, further comprising a program interface on the display for defining a model of a genre of document.
15. The document management system according to claim 14, wherein the program interface on the display comprises:
means for specifying that a feature is present in t and
means for specifying that a feature is absent from the genre of document.
16. The document management system according to claim 14, wherein the genre of document specified is a letter.
17. The document management system according to claim 14, wherein the genre of document specified is a memo.
18. The document management system according to claim 11, wherein the document image is a bitmap image.
19. The document management system according to claim 11, wherein one of the sequence of selection operations is a filtering operation given by:
F(L,A,u,v,N)={lεL:uN≤A(l)&vN}, where:
L is a set of layout objects to which the filtering
u and v are and
N is a normalization argument.
20. The document management system according to claim 11, wherein one of the sequence of selection operations is a gate operation given by: ##EQU3## L is a set of layout objects to which the gate
u and v are and
N is a normalization argument.
Description:
BACKGROUND OF THE INVENTION1. Field of the Invention The present invention relates generally to a system for managing and searching a large corpus of documents, and more particularly, to a system for a user to dynamically specify layout components of documents recorded in the large corpus of documents. 2. Description of Related Art Searching for a document in a large heterogeneous corpus of documents stored in an electronic database is often difficult because of the sheer size of the corpus (e.g., 750,000 documents). Many of the documents that make up the corpus are documents that cannot be identified by simply performing text based searches. In some instances, some documents in the corpus may, for example, be scanned images of hardcopy documents, or images derived using PDF (Portable Documents Formats), or PostScript(R). In other instances, simply searching the text of documents may not narrow a search sufficiently to locate a particular document in the corpus. Techniques for searching the text of a document in a large corpus of documents exist. U.S. Pat. No. 5,442,778 discloses a scatter-gather browsing method which is a cluster-based method for browsing a large corpus of documents. This system addresses the extreme case in which there is no specific query, but rather a need to get an idea of what exists in a large corpus of documents. Scatter-gather relies on document clustering to present to a user descriptions of large document groups. Document clustering is based on the general assumption that mutually similar documents tend to be relevant to the same queries. Based on the descriptions of the documents groups, the user selects one or more of the document groups for further study. These selected groups are gathered together to form a sub-collection. This process repeats and bottoms out when individual documents are viewed. Also, techniques exist that analyze the machine readable text of a document for identifying the genre of documents. The genre of text relates to a type of text or type of document. An example of a method for identifying the genre of machine readable text is disclosed in U.S. Provisional Application Ser. No. 60/051,558, entitled "Article And Method Of Automatically Determining Text Genre Using Surface Features Of Untagged Texts," (Attorney Docket No. D/95465P). Initially, machine readable text is analyzed to formulate a cue vector. The cue vector represents occurrences in the text of a set of non-structural, surface cues, which are easily computable. A genre of the text is then determined by weighing the elements making up the cue vector. Besides text found in a document, often the layout of a particular document contains a significant amount of information that can be used to identify a document stored in a large corpus of documents. Using the layout structure of documents to search a large corpus of documents is particularly advantageous when documents in the corpus have not been tagged with a high level definition. Hardcopy documents which are scanned are recorded as bitmap images that have no structural definition that is immediately perceivable by a computer. A bitmap image generally consists of a sequence of image data or pixels. To become searchable, the structure of a bitmap image is analyzed to identify its layout structure. By examining different work practices, it has been found that a work process (i.e., manner of working) can be supported with a system that is capable of searching and retrieving documents in a corpus by their type or genre (i.e., functional category). Where some genres of documents are general in the sense that they recur across different organizations and work processes, other genre of documents are idiosyncratic to a particular organization, task, or even user. For example, a business letter and a memo are examples of a general genre. A set of documents with an individual's private stamp in the upper right corner of each document is an example of a genre that is idiosyncratic to a particular user. It has also been found that many different genre of documents have a predefined form or a standard set of components that depict a unique spatial arrangement. For example, business letters are divided into a main body, author and recipient addresses, and signature. Unlike specific text based identifiers, which are used to identify the genre of a document, the layout structure of documents can apply across different classes of documents. A number of different techniques have been developed for analyzing the layout structure of a bitmap image. Generally, page layout analysis has been divided into two broad categories: geometric layout analysis and logical structure analysis. Geometric layout analysis extracts whatever structure can be inferred without reference to models of particular kinds of pages--e.g., letter, memo, title page, table, etc. Logical structure analysis classifies a given page within a repertoire of known layouts, and assigns functional interpretations to components of the page based on this classification. Geometric analysis is generally preliminary to logical structure analysis. (For further background on image layout analysis see U.S. patent application Ser. No. 08/565,181, entitled "Method For Classifying Non-Running Text In An Image" and its references). The present invention concerns a method and apparatus for defining user-specified layout structures of documents (i.e., the visual appearance) to facilitate the search and retrieval of a document stored in a multi-genre database of documents. This method of searching documents focuses a search according to the manner in which the layout structure of a document is defined. Unlike many techniques for searching the text within a document, searching documents according to their layout structure is based on the appearance and not the textual content found in a document. The general premise for searching documents based on their layout structure is that the layout structure of text documents often reflect its genre. For example, business letters are in many ways more visually similar to one another than they are to magazine articles. Thus, a user searching for a particular document while knowing the class of documents is able to more effectively narrow the group of documents being searched. One problem addressed by this invention is how to best manage a large corpus of scanned documents. Many document search and retrieval systems rely entirely on the results of applying OCR (Optical Character Recognition) to every scanned document image. Generally, OCR techniques involve segmenting an image into individual characters which are then decoded and matched to characters in a library. Typically, such OCR techniques require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing. In operation, OCR techniques distinguish each bitmap of a character from its neighbor, analyze its appearance, and distinguish it from other characters in a predetermined set of characters. A disadvantage of OCR techniques is that they are often an insufficient means for capturing information in scanned documents because the quality of OCR results may be unacceptably poor. For example, the OCR results for a scanned document may be poor in quality because the original document was a heavily used original, a facsimile of an original, or a copy of an original. In each of these examples, the scanned results of an original document may provide insufficient information for an OCR program to accurately identify the text within the scanned image. In some instances, some scanned documents may be handwritten in whole or in part, thereby making those portions of the original document unintelligible to an OCR program. Another disadvantage of OCR techniques is that the layout or formatting of the document is typically not preserved by an OCR program. As recognized by Blomberg et al. in "Reflections on a Work-Oriented Design Project" (published in PDC'94: Proceedings of the Participatory Design Conference, p. 99-109, on Oct. 27-28, 1994), users searching for a particular document in a large corpus of documents tend to rely on clues about the form and structure of the documents. Such clues, which could be gained from either the original bitmap image or reduced scale images (i.e., thumbnails), tend to be lost in ASCII text renderings of images. Thus, the layout or formatting of a document, which is usually not captured or preserved when a scanned image is reduced to text using an OCR program, is crucial information that can be used for identifying that document in a large corpus of documents. Improved OCR programs such as TextBridge(R), which is produced by Xerox ScanSoft, Inc., are capable of converting scanned images into formatted documents (e.g. HTML (hypertext markup language)) with tables and pictures as opposed to a simple ASCII text document (more information can be found on the Internet at /xis/textbridge/). An alternative technique for identifying information contained in electronic documents without having to decode a document using OCR techniques is disclosed in U.S. Pat. No. 5,491,760 and its references. This alternative technique segments an undecoded document image into word image units without decoding the document image or referencing decoded image data. Once segmented, word image units are evaluated in accordance with morphological image properties of the word image units, such as word shape. (These morphological image properties do not take into account the structure of a document. That is, the word image units do not take into account where the shape appeared in a document.) Those word image units which are identified as semantically significant are used to create an ancillary document image of content which is reflective of the subject matter in the original document. Besides image summarization, segmenting a document into word image units has many other applications which are disclosed in related U.S. Pat. Nos. 5,539,841; 5,321,770; 5,325,444; 5,390,259; 5,384,863; and 5,369,714. For instance, U.S. Patent No. discloses a method for identifying when similar tokens (e.g., character, symbol, glyph, string of components) are present U.S. Pat. No. 5,324,444 discloses a method for determining the frequency of words in a document, and U.S. Pat. No. 5,369,714 discloses a method for determining the frequency of phrases found in a document. Another alternative to performing OCR analysis on bitmap images are systems that perform content-based searches on bitmap images. An example of such a system is IBM's Query by Image Content (QBIC) system. The QBIC system is disclosed in articles by Niblack et al., entitled "The QBIC project: querying images by content using color, texture and shape," in SPIE Proc. Storage and Retrieval for Image and Video Databases, 1993, and by Ashley et al., entitled "Automatic and semiautomatic methods for image annotation and retrieval in QBIC," in SPIE Proc. Storage and Retrieval for Image and Video Databases, pages 24-35, 1995. A demo of a QBIC search engine is available on the internet at "http://wwwqbic./.about.qbic/qbic.html". Using the QBIC(TM) system, bitmap images in a large database of images can be queried by image properties such as color percentages, color layouts, and textures. The image-based queries offered by the QBIC system are combined with text or keyword for more focused searching. Another system for performing content-based queries is being developed as part of the UC Berkeley Digital Library Project. Unlike the QBIC system which relies on low-level image properties to perform searches, the Berkeley system groups properties and relationships of low level regions to define high-level objects. The premise of the Berkeley system is that high-level objects can be defined by meaningful arrangements of color and texture. Aspects of the Berkeley system are disclosed in the following articles and their references: Chad Carson et al., "Region-Based Image Querying," CVPR '97 Workshop on Content-Based Access of Image and Video L Serge Belongie et al., "Recognition of Images in Large Databases Using a Learning Framework," UC Berkeley CS Tech Report 97-939; and Chad Carson et al., "Storage and Retrieval of Feature Data for a Very Large Online Image Collection," IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, December 1996, Vol. 19 No. 4. In addition to using OCR programs or the like to decipher the content of scanned documents, it is also common to record document metadata (i.e., document information) at the time a hardcopy document is scanned. This document metadata, which is searchable as text, may include the subject of the document, the author of the document, keywords found in the document, the title of the document, and the genre or type of document. A disadvantage of using document metadata to identify documents is that the genre specified for a particular corpus of documents is not static. Instead, the number of different genre of documents in a corpus can vary as the corpus grows. A further disadvantage of document metadata is that it is time consuming for a user to input into a system. As a result, a system for managing and searching scanned documents should be robust enough to provide a mechanism for defining categories and sub-categories of document formats as new documents are added to the corpus. Another method for locating documents in a large corpus of documents is by searching and reviewing human-supplied summaries. In the absence of human-supplied summaries, systems can be used that automatically generate documents summaries. One advantage for using summaries in document search and retrieval systems is that they reduce the amount of visual information that a user must examine in the course of searching for a particular document. By being presented on a display or the like with summaries of documents instead of the entire document, a user is better able to evaluate a larger number of documents in a given amount of time. Most systems that automatically summarize the contents of documents create summaries by analyzing the ASCII text that makes up the documents. One approach locates a subset of sentences that are indicative of document content. For example, U.S. Pat. No. 5,778,397, assigned to the same assignee as the present invention, discloses a method for generating feature probabilities that allow later generation of document extracts. Alternatively, U.S. Pat. No. 5,491,760 discloses a method for summarizing a document without decoding the textual contents of a bitmap image. The summarization technique disclosed in the '760 Patent uses automatic or interactive morphological image recognition techniques to produce documents summaries. Accordingly, it would be desirable to provide a system for managing and searching a large corpus of scanned documents in which not only are text identified using an OCR program and inputted document metadata searchable but also the visual representations of scanned documents can be identified. Such a system would advantageously search, summarize, sort, and transmit documents using information that defines the structure and format of a document. It would also be desirable in such a system to provide an interface for a user to flexibly specify the genre of document by the particular layout format of documents. One reason this is desirable is that genre of documents tend to change and emerge over the course of using and adding document to a corpus. Consequently, an ideal system would give users the flexibility to specify either a new genre or a specific class of genre that is of interest to a single user or group of users. SUMMARY OF THE INVENTION In accordance with the invention there is provided a system, and method and article of manufacture therefor, for identifying a portion of a document stored in a memory of a document management system. In accordance with one aspect of the invention, the document image is segmented into a first set of layout objects. Each layout object in the first set of layout objects is one of a plurality of layout object types, and each of the plurality of layout object types identifies a structural element of a document. Attributes for each layout object in the first set of layout objects are computed. The computed attributes of each layout object are assigned values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects in the document image. A routine is executed by the system for identifying a feature of the document image. The routine is defined by a sequence of selection operations that consumes the first set of layout objects and uses the computed attributes to produce a second set of layout objects. Once the routine is executed, the feature of the document image is identified by the second set of layout objects.
BRIEF DESCRIPTION OF THE DRAWINGS These and other aspects of the invention will become apparent from the following description read in conjunction with the accompanying drawings wherein the same reference numerals have been applied to like parts and in which: FIG. 1 is a block diagram of the general components used to practice t FIG. 2 illustrates a detailed block diagram of the document corpus management and search system shown in FIG. 1; FIG. 3 illustrates the manner in which document image data is arrange FIG. 4 is a flow diagram of an interaction cycle for defining a feature using sequences of
FIG. 5 is a flow diagram which sets forth the steps for specifying one or more selection operations or accumulation operations for the set of layout objects defined at step 408 in FIG. 4; FIG. 6 illustrates an example of a feature programmed using the interaction cycle set forth in FIGS. 4-5; and FIG. 7 illustrates in greater detail the genre model program interface 219 shown in FIG. 2; FIG. 8 illustrates examples of three different high level configurations of documents which can be defined by specifying either the absence or presence of attributes and features using the genre model program interface shown in FIG. 7; FIG. 9 illustrates an example of a search engine interface for searching the corpus of documents s FIG. 10 illustrates a summarization display profile, which can be used to define the output format of a composite summary image of user- FIG. 11 is a flow diagram which sets forth the steps in for generating user-crafted s FIGS. 12, 13, and 14 illustrate three different examples of summary images created using the steps outlined in FIG. 10; FIG. 15 is a flow diagram which sets forth the steps for sorting document images according to the similarities between layout objects segmented
FIG. 16 is a flow diagram which sets forth one embodiment for sorting the set of image segments at step 1508 shown in FIG. 15; FIG. 17 illustrates a grouping of image segments that is formed using the method set forth in FIGS. 15 and 16; FIG. 18 is a flow diagram which sets forth an embodiment for sorting layout objects segmented from document images by their similarity to a spe FIG. 19 illustrates an example in which features of document images are sorted according to the similarity of a feature in a spec FIG. 20 is a flow diagram setting forth the steps for performing progressive transmission of document images from the perspective of a server workstation running the document search
FIG. 21 illustrates a progressive display profile for defining the order in which features and attributes of a document image are to be transmitted and/ FIG. 22 illustrates an example page image after completing the first stage where selected features letter-date, letter-recipient, and letter-signature are displayed at a high FIG. 23 illustrates a page image after completing the first stage where layout objects which do not have the selected features are displayed using bounding polygons, unlike FIG. 22 where the same features are displayed at a seco FIG. 24 illustrates a page image after completing the first stage where layout objects having a selected attribute are displayed at the first or high resolution and those layout objects which do not have the selected attribute are displayed at a seco and FIG. 25 illustrates the page images shown in FIGS. 22-24 after completing the second stage of display where the entire image is displayed at the first or high resolution.
DETAILED DESCRIPTIONA. System Overview Referring now to the drawings where the showings are for the purpose of describing the invention and not for limiting same, FIG. 1 illustrates a computer system 110 for carrying out the present invention. The computer system 110 includes a central processing unit 114 (i.e., processor) for running various operating programs stored in memory 116 which may include ROM, RAM, or another form of volatile or non-volatile storage. User data files and operating program files are stored on file storage device 117 which may include RAM, flash memory, floppy disk, or another form of optical or magnetic storage. The computer system 110 is coupled to various I/O (input/output) components 119 through bus interface 115. The I/O components include a facsimile 126, printer 127, scanner 128, and network 130. The processor 114 is adapted to receive and send data from bus interface 115 which couples the various I/O components 119 to processor 114 via bus 124. In response to one or more programs running in memory 116, the processor 114 receives signals from, and outputs signals to the various I/O components 119. Since computer system 110 can be linked to the internet via network 130, processor 114 can receive image data from other scanners, facsimiles, and memory storage devices located on the intemet. Operating in memory 116 is a document corpus search system 140 which includes the present invention. The system 140 may be associated with an article of manufacture that is packaged as a software product in a portable storage medium 142 which can be read by the computer system 110 through access device such as CD ROM reader 118. The storage medium 142 may, for example, be a magnetic medium such as floppy disk or an optical medium such as a CD ROM, or any other appropriate medium for storing data. Display 132 is provided for displaying user interfaces for relaying information to a user operating the system 140. User input devices 134 which may include but are not limited to, a mouse, a keyboard, a touch screen, are provided for the input of commands by the user. In one instance, the display 132 and the input devices 134 are used to operate a user interface for directing file storage 117 to record images of documents from scanner 128, facsimile 126, or network 130. Also, the user interface can be used for directing file storage 117 to transmit images of documents to facsimile 126, printer 127, or network 130. In one embodiment, the system 140 is operated on computer system 110 through commands received from a browser operating on the internet.B. Overview of Document Corpus Management and Search System FIG. 2 illustrates a detailed block diagram of the document corpus management and search system 140 for searching a corpus of documents in accordance with the present invention. The document corpus search system 140 includes four operating components: a corpus manager 210, an image segmentor and text identifier 211, a search engine 212, and a program manager 214. Input from a user to the document corpus search system 140 is made in response to either document input interface 216, search interface 218, genre model program interface 219, or feature program interface 220. Each of the interfaces 216, 218, 219, and 220, which are displayed on display 132, correspond to different services provided by the document corpus search system 140, which are each discussed below. In one embodiment, each of the interfaces 216, 218, 219, and 220 operate over the internet through network 130 through a conventional internet browser such as Microsoft's Explorer(TM) or Netscape's Navigator(TM). In accordance with the present invention, the document corpus management and search system 140 develops a structural description of scanned documents using geometric layout analysis. The structural description of a document is based on the document's configuration or layout format. In developing a structural description of a document, the image segmentor 211 identifies layout objects 238 which are structural descriptions of parts of a document. In addition, the image segmentor 211 computes attributes 240 for the identified layout objects. The attributes of a layout object either quantify a property of the layout object or identify a spatial relationship with respect to other layout objects. Advantageously, geometric layout analysis can be performed to identify structural similarities among documents of a given genre of documents (e.g., memos). The spatial arrangements of segmented layout objects in the page images of document images (also referred to herein as simply documents) can either be defined using attributes 240 or features 242. In defining spatial arrangements of objects in a page image, the image segmentor 211 examines the structure of text and graphics found in the page image. The text structure of a page image is described in terms of the spatial relations that blocks of text in a page image have to frames of reference that are defined by other blocks of text. Text blocks that are detected by the image segmentor 211 identify a structural element such as a paragraph of text. Unlike text on a page image which may be spatially related, the graphics structure of a page image may involve ad hoc graphical relationships. The system 140 operates on the general assumption that the genre (i.e., type) of a document image is reflected in the spatial arrangement of at least some of the objects on the page images of the document image. Using the feature program interface, features 242 are defined by a user. In addition to deriving features, a user can specify genre models 244 using genre model program interface 219. Each genre model 244 identifies a spatial arrangement of objects in page images of a document image that are shared between a collection of document images. By defining a genre model, a user is capable of defining a class of document images which express a common communicative purpose that is independent of document content.C. Classifying a Corpus of Documents The service made available through the document input interface 216, provides a facility for populating a database (or collection) of document images 237. The database of document images is populated with either scanned hardcopy documents or electronically generated documents. For example, the scanner 128 can be used to create bitmap images that represent hardcopy documents, whereas the input devices 134 can be used to create electronic documents. In addition, the database of document images can be populated by receiving both scanned hardcopy documents and electronically generated documents over network 130. The document collection which populates file system 117 is arranged hierarchically. It will be understood by those skilled in the art, that for the purposes of the present invention, the operations set forth herein may be performed on the entire document collection or some subset of the document collection. As part of the file system's hierarchy, each document image 237 is associated with a document data structure which includes an array of one or more pages, a pointer to one or more genre values 244, and a pointer to document metadata 224. Each page in the array of pages is associated with a page data structure which includes a pointer to a page image 226, and can include a pointer to one or more reduced scale images 228, a pointer to one or more structural images 230, a pointer to layout objects 238, a pointer to attributes 240, a pointer to OCRed text 236, or a pointer to feature values 242. In accordance with the hierarchical arrangement, each document image 237 consists in part of one or more page images 226. A page image 226 is defined herein as one page of a scanned hardcopy or electronically generated document. Responsive to commands from a user, corpus manager 210 records document images 237 in file system 117. Using document input interface 216, a user can manually specify properties of document images which are recorded in file system 117 as document metadata 224. The document metadata 224 may be specified by a user at the time, or some time after, a document image is scanned or otherwise added to the file system 117. More specifically, document metadata 224 for a document image stored in file system 117 may have recorded therein a document type identifier, a document creation date, a document title, and document keywords. In addition to storing document metadata 224 and page images 226, corpus manager generates reduced scale images 228 and structural images 230. Depending on the preferences of a user, a particular resolution can be selected by a user for viewing the recorded page images. In accordance with user preferences, reduced scale images with varying degrees of resolution are generated for each of the page images 226. In one embodiment, reduced scale images are generated using the method set forth in U.S. Pat. No. 5,434,953, which is incorporated herein by reference. Generally, reduced scale images are used as a visual index into a higher resolution page image. Similar to the reduced scale images, structural images 230 have varying degrees of resolution that can be specified by a user. However, unlike reduced scale images, structural images 230 highlight particular layout objects in page images. In one embodiment, corpus manager 210 generates reduced scale images and structural images on demand to conserve disk space.C.1 Layout Object Segmentation After recording page images 226 of document images 237, image segmentor 211 segments the pages images of each document image into one or more layout objects 238. Each segmented layout object of a page image is identified by image segmentor 211 as one of the primitive layout object types (or "layout objects") listed in Table 1. Layout objects are defined herein as primitive elements which are structural descriptions of abstract parts of a document image. (As defined herein, a document image implicitly refers to its page images.) One skilled in the art, however, will appreciate that the list of primitive layout object types in Table 1 is illustrative and could be modified to include other layout object types. For example, Table 1 could include a layout object for halftone regions.
______________________________________
Layout Object Types OBJECT
EXPLANATION
______________________________________
Text-Blocksparagraph-level blocks of textPage
image region occupied by the pageGraphics connected components of salient width and heightH-Lines
horizontal straight line segments of graphicsV-Lines
vertical straight line segments of graphicsH-Rules
horizontal straight lines of salient lengthV-Rules
vertical straight lines of salient lengthH-Fragmentshorizontal straight line segments of non-salient lengthV-Fragmentsvertical straight line segments of non-salient
______________________________________
length In one embodiment, the image segmentor 211 performs text block segmentation that is based on standard mathematical morphology methods used for finding text blocks in optical character recognition systems, as discussed by R. Haralick, in "Document image understanding: geometric and logical layout," Proc. IEEE Conf. On Computer Vision and Pattem Recognition, 0. In another embodiment, the image segmentor 211 may perform a text block segmentation process that is similar to that employed in the software product TextBridge(R) produced by Xerox ScanSoft, Inc. Alternate methods of text block segmentation are disclosed in U.S. Pat. No. 5,889,886 and patent application Ser. No. and 08/565,181.C.2 Defining Layout Structure After segmenting the page images of a document image into one or more layout objects 238, image segmentor 211 computes image attributes 240 that correspond to each segmented layout object. The advantage of defining image attributes of layout objects as compared with other image analysis techniques which operate on the textual content of documents is that analyzing a page image to identify its image attributes does not rely on character recognition. Furthermore in certain situations, layout objects of documents offer more information about the genre of a document (e.g., letter, memo, etc.) than the textual content in the page image of a document image. A further advantage, therefore, of the present invention is that it operates regardless of whether there exists any understanding of the textual content of a layout object of a document image. Instead of using textual information to identify the content of layout objects, the present invention develops an understanding of the visual appearance of a document image by analyzing the attributes of layout objects and their relationship to one another. Different techniques are used to compute the attributes set forth in the Tables 2-6. Many of the attributes which are defined in Tables 2-6, specify the layout structure of a page image in terms of spatial relations that certain blocks of text have in relation to other blocks of text. Two fundamental attributes of layout objects set forth in the Table 2 include attributes that distinguish between running and non-running text blocks (e.g., running, non-running), and attributes that define grouping relations (or alignment) among text blocks (e.g., top-nr, mid-nr, and bot-nr). U.S. Pat. No. 5,889,886 and patent application Ser. No. 08/565,181 which are assigned to the same assignee as the present invention and incorporated herein by reference, disclose a method for detecting and classifying non-running text in a page image. Once identified, non-running text blocks are labeled as having either a top, bottom, or middle position in a page image based on their relative degrees of overlap with the top/bottom, and left/right borders of the image using the method disclosed in U.S. Pat. No. 5,537,491, which is incorporated herein by reference. In addition, non-running text blocks are labeled as having either a left, right, or center vertical alignment. To label a non-running text block as left-aligned, for example, it must belong to a left-x group to which a single-column of running text block also belongs (that is, the left-x value is the same for both the non-running and running text block). This requires that the sufficient stability method set forth in U.S. Pat. No. 5,889,886 is applied independently to the left-x, right-x, and center-x coordinates of all text blocks. In addition, non-running text blocks are labeled as being either a horizontal sequence of text blocks, a vertical sequence of text blocks, or a table using the method disclosed in U.S. patent application Ser. No. 08/565,181. These operations can be combined to define other more specific attributes (e.g., a top-left-aligned non-running text-block). Also, these operations can be combined with additional operations to impose further geometric constraints on image attributes (e.g., a top-left-aligned non-running text-block which is normalized relative to the total text-block area in a top non-running text region). The attribute types for layout objects are divided into generic attribute types and specific attribute types and stored in file system 117 as attributes 240. Generic attribute types are attributes that are defined for every primitive layout object. Table 2 illustrates generic attributes of each layout object (i.e., I/o) listed in Table 1. Specific attribute types are attributes that are defined specifically for a specific type of layout object. For example, Table 3 lists type specific attributes for text objects, Table 4 lists type specific attributes for graphic objects, and Table 5 lists type specific attributes for page objects. In addition, generic and specific attribute types of a layout object can be used to define composite attributes. Table 6 illustrates composite attributes that are defined using generic types of objects.
______________________________________
Type Generic Attributes For All Objects ATTRIBUTE EXPLANATION
______________________________________
I/o is a running text regionnon-runningI/o is a non-running text regiontop-r
I/o is a running text region adjacent to the top imagebordermid-r
I/o is a running text region not adjacent to the top imageborderbot-r
I/o is in a running text region adjacent to the bottomimage bordertop-nr
I/o is a non-running text region adjacent to the top imagebordermid-nr
I/o is a non-running text region not adjacent to the top orbottom image borderbot-nr
I/o is a non-running text region adjacent to the bottomimage bordertype
a numerical encoding of the type of I/o (e.g., text,graphics, etc.)left-x
the minimum x-coordinate in I/otop-y
the minimum y-coordinate in I/oright-x
the maximum x-coordinate in I/obot-y
the maximum y-coordinate in I/ox-span
bounding box width of I/oy-span
bounding box height of I/ogirth
the maximum of all shortest cross-sections of I/oarea
the area of I/o in pixelsbox-area the area of the bounding box of I/o in pixels
______________________________________
______________________________________
Type Specific Attributes For Text Objects ATTRIBUTE EXPLANATION
______________________________________
left-alignedI/o is left-aligned with the running textcenter-alignedI/o is center-aligned with the running textright-alignedI/o is right-aligned with the running textsingle-columnI/o is a single-column running textmulti-columnI/o is multi-column running texttwo-columnI/o is two-column running textthree-columnI/o is three-column running textfour-columnI/o is four-column running texttables
I/o is in a three-or-more column tabular structurepairings
I/o is a two-column tabular structureb-internalI/o is inside the bounding box of a Graphic Objecth-internalI/o is bounded above and below by H-Rule Objectv-internalI/o is bounded left and right by V-Rule Objectcavity-areathe area of top and bottom cavities of I/o in pixelstable-row the row-index of I/o in a tabular structure, if anytable-col the column-index of I/o in a tabular structure, if
______________________________________
______________________________________
Type Specific Attributes For Graphics Objects OBJECT
EXPLANATION
______________________________________
Graphicsoccupancy
text pixel count inside the bounding box ofI/oV-Rules h-occupancytext pixel count between I/o and theV-Rule immediately right of it.V-Rules h-index
horizontal index of I/o relative to the set ofV-RulesH-Lines h-occupancytext pixel count between I/o and theH-Rule immediately below it.H-Lines h-index
horizontal index of I/o relative to the set ofH-RulesV-Lines h-occupancytext pixel count between I/o and theV-Line immediately right of it.V-Lines h-index
horizontal index of I/o relative to the setof V-LinesH-Fragmentsv-occupancytext pixel count between I/o and theH-Rule immediately below it.H-Fragmentsv-index
vertical index of I/o relative to the set ofH-RulesH-Fragmentstext-adjacencycount of adjacent Text-Block pixelsV-Fragmentsv-occupancytext pixel count between I/o and the V-Fragment immediately right of it.V-Fragmentsv-index
horizontal index of I/o relative to the set ofV-FragmentsV-Fragmentstext-adjacencycount of adjacent Text-Block pixels
______________________________________
______________________________________
Type Specific Attributes For Page Objects ATTRIBUTE
EXPLANATION
______________________________________
contracted-widththe width of a set of objects, ignoring white spacecontracted-heightthe height of a set of objects, ignoring white spaceaspect-ratiox-span divided by y-span
______________________________________
______________________________________
Composite Attributes ATTRIBUTE
EXPLANATION
______________________________________
top-r-or-nr
conjunction of top-r and top-nrbot-r-or-nr
conjunction of bot-r and bot-nraspect-ratio
x-span divided by y-span
______________________________________
Attributes set forth in each of the Tables 2-6 can be binary-valued (i.e., true/false) or numerical-valued (i.e., integer or real). Those attribute types listed in the Tables 2-6 in italic font have boolean values. Binary valued attributes typically represent set membership relations among layout objects. For instance, the generic attribute types that are binary valued attributes such as "running" and "non-running" define grouping relations among layout objects. Numerical valued attributes typically represent intrinsic geometric properties of objects, or indices into sets with respect to ordinal relations. Although the values of the type attributes are represented as symbols in the Tables 2-6 for clarity, it will be understood by those skilled in the art that the values of the attributes, which are absolute (i.e., not normalized), are represented numerically. After identifying layout objects 238 for each page image 226, those layout objects identified as text blocks can be further processed by a text identifier which forms part of image segmentor 211. In one embodiment, each layout object identified as a text block is processed by text identifier 211 using an optical character recognition technique or a suitable alternative technique to recognize text located therein. It will be appreciated by those skilled in the art, however, that for the purposes of the present invention, there is no requirement to perform OCR on layout objects identified as text blocks. There exists, however, certain advantages for recognizing the text within layout objects identified as text blocks as will become evident from the teachings discussed below. Text that is recognized within a text-block layout object is stored in file system 117 as text 236, and may be searched using text based searches with search engine interface 218.C.3 Overview of Image Data FIG. 3 illustrates the organization of data that is associated with each of the page images 226 of a document image 237 stored in the file system 117. Initially, a user populates file system 117 with for example scanned images received from document scanner 128. Document metadata 224 for a document image can be entered by a user as type, date, title, and keyword information. Corpus manager 210 sub-samples page images 226 to form a set of reduced scale images 228. The reduced scale image with the lowest resolution is defined herein to be a thumbnail image. Other page images in descending resolution are defined herein to be large, mid, and small images. In addition, structural images 230 can be computed for each segmented layout object 238. As set forth above, image segmentor 211 segments the page images 226 of a document image into layout objects 238. For each of the layout objects that are segmented from the page images 226, the image segmentor further computes and stores in a compact form image attributes 240. The image attributes 240 can either be type-generic or type-specific attributes. In addition to attributes, each layout object 238 of a page image can be associated with one or more features 242 or genre models 244. The features 242 are defined using attributes 240 as described below in Section D. The genre models 244 are defined using either attributes 240 or the features 242 as set forth in Section E below.D. Defining the Layout Format of Documents Using Features Using the feature program interface 220, a user is able to specify a layout format that is unique to a particular genre of document by constructing a routine for detecting a feature. For example, a routine of a feature of a page image can be used to identify document images with a unique letterhead. In general, each feature 242 is defined by a routine and a value. The routine of a feature is a straight-line program having a sequence of one or more steps with no explicit branch operations. Each step of a routine is a selection operation that either gates or filters a set or a subset of layout objects of a page image 226. Each selection operation of a routine is programmed by the user with the feature program interface 220. A routine takes as input a set or subset of layout objects of a page image. Depending on the selection operation(s) of a routine and the layout objects being evaluated, the output of the routine is either a set of all, some, or none of the layout objects input into the routine. Once a user programs a feature at the feature program interface 220, the program manager 214 records the routine of the feature with other features 242 in file system 117. In addition, the program manager 214 performs, at some user specified time, selection operations specified in the routine, on each page image 226 in files system 117, a page image at a time. In other words, selection operations are performed by the program manager with respect to the layout objects of a single page image irrespective of the number of page images forming a document image. At each step of a routine, a determination is made by the program manager 214 as to whether the computed attributes (see Tables 2-6 for examples of attributes) of layout objects meet the specified constraints. The end result after making a determination for each step in a routine is a value for the page image. If the value of a feature for a page image is an empty (or null) set of layout objects, then the feature is not present in the page image. In contrast, if the value of a feature is a non-empty set of layout objects, then the feature is present in the page image. In one embodiment, a feature is recorded in file system 117 with a list of page images that have layout objects which satisfy the selection operations of the feature. For quick retrieval, an index of those layout objects which satisfy the selection operations of the features are stored along with each page image in file system 117. In effect, a feature 242 is used to identify page images 226 with layout objects 238 having attributes 240 that satisfy the programmed selection operation(s) of the feature. As additional page images 226 are added to the corpus of page images, layout objects 238, attributes 240, and features 242 can be computed for those additional page images. This computation need only be done once in general, this insures that invoking search engine 212 does not involve run-time image analysis of page images.D.1 Programming Routines After a set of image attributes have been computed for segmented layout objects a given corpus of document images, features can be defined using those attributes. Furthermore, after defining one or more features, new features can be defined using both attributes and any existing features. In this manner, features can be defined using previously defined features. Features, for example, can be defined using one or more routines (or functions) to perform selection operations over regions that have a particular structural layout in a page image. In its simplest form, a routine is defined so that when it is applied to a page image, the output of the routine is a set of layout objects in the page image which satisfy the definition of the routine. In effect, the layout format of a page image may be programmed using routines that operate on sets of layout objects 238. A user programs routines using a program composition language which only requires a user to define sequences of primitive operations or other previously defined routines. These sequences of primitive operations can either be applied to the entire corpus of documents or to a subset of the corpus of documents stored in file system 117. When the corpus is populated as set forth in Section C above, there is defined for each page image 226 a set of layout objects Li which specifies the set of all layout objects defined for a given page image. When executed, each routine consumes a set of layout objects Li and produces a new set of layout objects Lo, where Lo is a subset of the set of layout objects Li. Some routines R that are programmed using the program composition language perform either composition of filter operations and/or gate operations. A filter operation F(L,A,u,v,N) produces a subset of layout objects in L whose value of attribute argument A is not less than threshold uN but less than threshold vN. A gate operation G(L,A,u,v,N) produces the set of layout objects L itself if the value of the attribute argument A of L is not less than threshold uN but less than threshold vN; otherwise, it produces an empty set (i.e., φ). The gate operation provides a certain capacity of conditional behavior. Once defined, each selection operation of a routine, whether a gate operation or a filter operation, can be applied to the layout objects of each of the page images 226 stored in files system 117. The filter and gate selection operations can be defined mathematically as follows:
F(L,A,u,v,N)={lεL:uN≤A(l)&vN}; and ##EQU1## L is an input argument that specifies a set of layout objects to which each
A is an attribute argument that may be specified as either: a
or (In the event the attribute argument A is defined in a routine R, the attribute argument A is interpreted as a new binary valued attribute A as follows: ##EQU2## u and v are threshold arguments that may be either integer constants or real- and
N is a normalization argument that is a numerical value. Other routines R that are programmed using the program composition language consume a set of layout objects L and produce a scalar numerical value. The scalar numerical value represents a global value of layout objects which can be used in all the selection operations to specify one of the threshold arguments u or v, or to specify the attribute argument A of a gate operation. Such routines which produce a scalar numerical value are defined herein as accumulation operations. The feature composition language provides a user with the ability to define routines using the following three accumulation operations:
max, max(L,A), produces the maximum value of A for any l ε L;
min, min(L,A), produces the minimum value of A for any l ε L; and
sum, Σ(L,A), produces the sum of the values of A for all l ε L.These accumulation operations can compose with the filter and gate selection operations in that L may be the result of a sequence of operations.D.2 The Feature Program Interface FIG. 4 is a flow diagram of an interaction cycle for defining a feature using sequences of

我要回帖

更多关于 富士施乐一体机驱动 的文章

 

随机推荐