TEXT   126
Parsing in Lemur
Guest on 7th February 2023 02:25:35 AM


  1. Parsing in Lemur
  2. Overview
  3. This document discusses the parsing utilities provided by the Lemur toolkit.  They have been designed with flexibility and extendibility in mind. If the functionality required is not currently implemented by the toolkit, it should be easy to add the functionality and plug it into the parser framework.  The first section describes the parser applications and their options.  The other section describes the parser architecture or API for developers.
  4.  
  5. The Parser Applications
  6. There are three parser applications provided in the toolkit.  PushIndexer builds a database, ParseToFile writes parsed text to a file, and ParseQuery parses queries and writes output to file.  All applications use a parameter file for specifying parameters for parsing.  The format of the file is:
  7.  
  8.  
  9.  
  10.       parameter = value; /* comment */
  11.  
  12.                
  13.  
  14. The first command line argument must be the parameter file.  The other command line arguments specify the data files for applications to parse.
  15.  
  16. PushIndexer
  17. PushIndexer builds a database using either the TrecParser or WebParser class and InvFPPushIndex.
  18.  
  19.  
  20.  
  21. Usage:    PushIndexer paramfile datfile1 datfile2 ...
  22.  
  23.  
  24.  
  25. Summary of parameters in paramfile:
  26.  
  27. ·         index – Name of the index (without the .ifp extension).
  28.  
  29. ·         memory – Memory (in bytes) of InvFPPushIndex (def = 96000000).
  30.  
  31. ·         stopwords – Name of file containing stopword list.  Words in this file should be one per line. If this parameter is not specified, all words are indexed.
  32.  
  33. ·         acronyms – Name of file containing acronym list (one word per line).  Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list.  If no acronym list is specified, acronyms will not be recognized.
  34.  
  35. ·         docFormat – Specify “trec” for standard TREC formatted documents or “web” for web TREC formatted documents. The default is “trec”.
  36.  
  37. ·         stemmer – Specify “porter” to use Porter's stemmer.  If no stemmer is specified, no stemmer will be used.
  38.  
  39. ParseToFile
  40. ParseToFile parses documents and writes output compatible with BuildBasicIndex. The program uses either the TrecParser class or WebParser class to parse.
  41.  
  42.  
  43.  
  44. Usage:    ParseToFile paramfile datfile1 datfile2 ...
  45.  
  46.  
  47.  
  48. Summary of parameters in paramfile:
  49.  
  50. ·         outputFile – Name of file to output parsed documents to.
  51.  
  52. ·         stopwords – Name of file containing stopword list.  Words in this file should be one per line.  If this parameter is not specified, all words are output to the file.
  53.  
  54. ·         acronyms – Name of file containing acronym list (one word per line).  Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list.  If no acronym list is specified, acronyms will not be recognized.
  55.  
  56. ·         docFormat – Specify "trec" for standard TREC formatted documents or  “web” for web TREC formatted documents. The default is “trec”.
  57.  
  58. ·         stemmer – Specify “porter” to use Porter's stemmer.  If no stemmer is specified, no stemmer will be used.
  59.  
  60. ParseQuery
  61. ParseQuery parses queries using either the TrecParser or WebParser class and an Index.
  62.  
  63.  
  64.  
  65. Usage:    ParseQuery paramfile datfile1 datfile2 ...
  66.  
  67.  
  68.  
  69. Summary of parameters in paramfile:
  70.  
  71.  
  72.  
  73. ·         queryOutFile – The name of the file to write the parsed queries to.
  74.  
  75. ·         index – Name of the index (with the .ifp or .bsc extension).
  76.  
  77. ·         stopwords – Name of file containing stopword list.  Words in this file should be one per line.  If this parameter is not specified, all words are left in the query.
  78.  
  79. ·         acronyms – Name  of file containing acronym list (one word per line).  Uppercase words recognized as acronyms (eg USA U.S.A. USAs USA's U.S.A.) are left uppercase as USA if USA is in the acronym list.  If no acronym list is specified, acronyms will not be recognized.
  80.  
  81. ·         docFormat – Specify "trec" for standard TREC formatted documents or  “web” for web TREC formatted documents. The default is “trec”.
  82.  
  83. ·         stemmer – Specify “porter” to use Porter's stemmer.  If no stemmer is specified, no stemmer will be used.
  84.  
  85. The Parser Architecture for Lemur
  86.  
  87. The Lemur parser architecture revolves around one class, TextHandler, that allows for the chaining or pipelining of parser common parser components.  A TextHandler may be a stop-word list, stemmer, indexer, or parser.  Information is passed from a source, through TextHandlers that modify information and pass it on, to a destination TextHandler.   An example of a source TextHandler would be a parser.  A stemmer would modify text and pass the information on to other TextHandlers.  A destination TextHandler might write parsed data to a file or push build an index.  The diagram below is an example of how TextHandlers might be chained.
  88.  
  89.  
  90.  
  91.  
  92.  
  93.  
  94.  
  95.  
  96. The TextHandler class enforces chaining through its interface.  A diagram summarizing the functions of the TextHandler class is given below.  The next TextHandler in a chain is set using the setTextHandler function.  For example calling the Parser’s setTextHandler function with an argument of the Stop-word list would cause information to be passed from the Parser to the Stop-word list.  TextHandlers may modify the information it receives before passing the information on to the next TextHandler.  To do this, provide implementations to either the handleDoc or handleWord functions.  For example, a stemmer would stem the word in the handleWord function.  An Indexer would need to implement both handleDoc and handleWord functions.  Inside those functions, the Indexer would push the words and documents into an index.
  97.  
  98.  
  99.  
  100.  
  101.  
  102.  
  103. The foundDoc and foundWord functions enforce the chaining of the calls.  When either is called, the corresponding handleDoc/handleWord function is called with an argument of the document number or word.  The foundDoc or foundWord function of the textHandler of the object is then called using the return value of the handleDoc/handleWord as the argument.  Base implementations of all functions are provided by the TextHandler class, a subclass will only need to override the functions that it needs.  In general, subclasses should only override handleDoc and handleWord functions.  Classes that provide sources for information should call the foundWord and foundDoc functions of their textHandler.
  104.  
  105.  
  106.  
  107. The TextHandler class provides the basis for most of the classes used by Lemur for parsing.  The hope is that this class will provide a flexible base for extending parser functionality.  The following subsections discuss classes used by the parser applications.  The only of the following classes that does not extend the TextHandler class is the WordSet class.
  108.  
  109. WordSet
  110. The WordSet class is a simple wrapper to a set.  It is useful for stop-word lists or acronym lists.  It can load a list from a file.  The file format is one word per line.  WordSet does NOT remove white space on either side of the word – be careful when editing these files.  The contains function is used to check the presence of a word in the set.
  111.  
  112. Parser
  113. The parser class is a generic interface for the parsers in the toolkit.  It assumes subclasses implement a parse function, which takes a filename.  The acronym list is WordSet, and both parsers check uppercase words and recognized acronyms against this list.  If the word is in the acronym list, it is left uppercase.  Otherwise, the word is converted to lowercase.  If you do not wish to support the acronym list when you design your parser, that is fine.  You can simply ignore the acronym list.
  114.  
  115.  
  116.  
  117. Both the TrecParser and the WebParser remove contractions and possessives, have a simple acronym recognizer, and convert words to lowercase.
  118.  
  119. TrecParser
  120. The TrecParser provides a simple but effective parser for NIST’s TREC document format.  It recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields.
  121.  
  122. WebParser
  123. The WebParser is behaves very similarly to the TrecParser.  It parses HTML documents in the NIST TREC format used for the Web Tracks.  The parser removes HTML tags.  Text within SCRIPT tags is removed, as is text in HTML comments.
  124.  
  125. Stemmer
  126. The Stemmer class provides an interface for stemmers.   All that is required of a subclass is that it implement the stemWord function.  The stemWord function should overwrite the current word.  Currently, the toolkit provides one subclass, PorterStemmer.
  127.  
  128. PorterStemmer
  129. PorterStemmer uses Porter’s official stemmer (in c) to stem words.  The PorterStemmer class does not stem words beginning with an uppercase letter.  This is to prevent stemming of acronyms or names.
  130.  
  131. Stopper
  132. The Stopper class is a subclass of the WordSet class and the TextHandler class.  It replaces words in the stop-word list with a NULL pointer.
  133.  
  134. QueryTextHandler
  135. The QueryTextHandler checks to see if a word in the query occurs more often in uppercase than original form in an Index.  If the uppercase form is more common than the original form, the word is added to the query.  This is to handle cases where acronyms are not capitalized in the query,
  136.  
  137. WriterTextHandler
  138. The WriterTextHandler class writes information from a TextHandler chain to a file.  This file is in a format compatible with BuildBasicIndex.
  139.  
  140. InvFPTextHander
  141. The InvFPTextHandler takes information from a TextHandler chain

Raw Paste

Login or Register to edit or fork this paste. It's free.