StreamTokenizer

Bruce Eckel’s Thinking in Java Contents | Prev | Next

Consider
a program to count the occurrence of words in a text file:

//: SortedWordCount.java
// Counts words in a file, outputs
// results in sorted form.
import java.io.*;
import java.util.*;
import c08.*; // Contains StrSortVector
 
class Counter {
  private int i = 1;
  int read() { return i; }
  void increment() { i++; }
}
 
public class SortedWordCount {
  private FileInputStream file;
  private StreamTokenizer st;
  private Hashtable counts = new Hashtable();
  SortedWordCount(String filename)
    throws FileNotFoundException {
    try {
      file = new FileInputStream(filename);
      st = new StreamTokenizer(file);
      st.ordinaryChar('.');
      st.ordinaryChar('-');
    } catch(FileNotFoundException e) {
      System.out.println(
        "Could not open " + filename);
      throw e;
    }
  }
  void cleanup() {
    try {
      file.close();
    } catch(IOException e) {
      System.out.println(
        "file.close() unsuccessful");
    }
  }
  void countWords() {
    try {
      while(st.nextToken() !=
        StreamTokenizer.TT_EOF) {
        String s;
        switch(st.ttype) {
          case StreamTokenizer.TT_EOL:
            s = new String("EOL");
            break;
          case StreamTokenizer.TT_NUMBER:
            s = Double.toString(st.nval);
            break;
          case StreamTokenizer.TT_WORD:
            s = st.sval; // Already a String
            break;
          default: // single character in ttype
            s = String.valueOf((char)st.ttype);
        }
        if(counts.containsKey(s))
          ((Counter)counts.get(s)).increment();
        else
          counts.put(s, new Counter());
      }
    } catch(IOException e) {
      System.out.println(
        "st.nextToken() unsuccessful");
    }
  }
  Enumeration values() {
    return counts.elements();
  }
  Enumeration keys() { return counts.keys(); }
  Counter getCounter(String s) {
    return (Counter)counts.get(s);
  }
  Enumeration sortedKeys() {
    Enumeration e = counts.keys();
    StrSortVector sv = new StrSortVector();
    while(e.hasMoreElements())
      sv.addElement((String)e.nextElement());
    // This call forces a sort:
    return sv.elements();
  }
  public static void main(String[] args) {
    try {
      SortedWordCount wc =
        new SortedWordCount(args[0]);
      wc.countWords();
      Enumeration keys = wc.sortedKeys();
      while(keys.hasMoreElements()) {
        String key = (String)keys.nextElement();
        System.out.println(key + ": "
                 + wc.getCounter(key).read());
      }
      wc.cleanup();
    } catch(Exception e) {
      e.printStackTrace();
    }
  }
} ///:~ 

In
countWords( ),
the tokens are pulled one at a time from the stream, and the
ttype
information is used to determine what to do with each token, since a token can
be an end-of-line, a number, a string, or a single character.

In
main( )
you can see the use of a
SortedWordCount
to open and count the words in a file – it just takes two lines of code.
Then an enumeration to a sorted list of keys (words) is extracted, and this is
used to pull out each key and associated
Count.
Note that the call to
cleanup( )
is necessary to ensure that the file is closed.

StringTokenizer

Although
it isn’t part of the IO library, the
StringTokenizer
has sufficiently similar functionality to
StreamTokenizer
that it will be described here.

You
ask a
StringTokenizer
object for the next token in the string using the
nextToken( )
method, which either returns the token or an empty string to indicate that no
tokens remain.

As
an example, the following program performs a limited analysis of a sentence,
looking for key phrase sequences to indicate whether happiness or sadness is
implied.

//: AnalyzeSentence.java
// Look for particular sequences
// within sentences.
import java.util.*;
 
public class AnalyzeSentence {
  public static void main(String[] args) {
    analyze("I am happy about this");
    analyze("I am not happy about this");
    analyze("I am not! I am happy");
    analyze("I am sad about this");
    analyze("I am not sad about this");
    analyze("I am not! I am sad");
    analyze("Are you happy about this?");
    analyze("Are you sad about this?");
    analyze("It's you! I am happy");
    analyze("It's you! I am sad");
  }
  static StringTokenizer st;
  static void analyze(String s) {
    prt("nnew sentence >> " + s);
    boolean sad = false;
    st = new StringTokenizer(s);
    while (st.hasMoreTokens()) {
      String token = next();
      // Look until you find one of the
      // two starting tokens:
      if(!token.equals("I") &&
         !token.equals("Are"))
        continue; // Top of while loop
      if(token.equals("I")) {
        String tk2 = next();
        if(!tk2.equals("am")) // Must be after I
          break; // Out of while loop
        else {
          String tk3 = next();
          if(tk3.equals("sad")) {
            sad = true;
            break; // Out of while loop
          }
          if (tk3.equals("not")) {
            String tk4 = next();
            if(tk4.equals("sad"))
              break; // Leave sad false
            if(tk4.equals("happy")) {
              sad = true;
              break;
            }
          }
        }
      }
      if(token.equals("Are")) {
        String tk2 = next();
        if(!tk2.equals("you"))
          break; // Must be after Are
        String tk3 = next();
        if(tk3.equals("sad"))
          sad = true;
        break; // Out of while loop
      }
    }
    if(sad) prt("Sad detected");
  }
  static String next() {
    if(st.hasMoreTokens()) {
      String s = st.nextToken();
      prt(s);
      return s;
    }
    else
      return "";
  }
  static void prt(String s) {
    System.out.println(s);
  }
} ///:~ 

For
each string being analyzed, a
while
loop is entered and tokens are pulled off the string. Notice the first
if
statement, which says to
continue
(go back to the beginning of the loop and start again) if the token is neither
an “I” nor an “Are.”
This
means that it will get tokens until an “I” or an “Are”
is found. You might think to use the
==
instead of the equals( )
method, but that won’t work correctly, since
==
compares handle values while
equals( )
compares contents.

The
logic of the rest of the
analyze( )
method is that the pattern that’s being searched for is “I am
sad,” “I am not happy,” or “Are you sad?” Without
the
break
statement, the code for this would be even messier than it is. You should be
aware that a typical parser (this is a primitive example of one) normally has a
table of these tokens and a piece of code that moves through the states in the
table as new tokens are read.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read