sem.tokeniser
Class Tokeniser

java.lang.Object
  extended by sem.tokeniser.Tokeniser

public class Tokeniser
extends java.lang.Object

A very simple tokeniser implementation.

Separates text into tokens and sentences using heuristic rules. It does not function well in cases where sentences are surrounded by quotes, so it is best to remove those beforehand.


Constructor Summary
Tokeniser()
           
 
Method Summary
static void main(java.lang.String[] args)
           
static java.util.ArrayList<java.lang.String> sentenceSplit(java.lang.String tokenisedText)
          Take tokenised text and split it into separate sentences.
static java.lang.String tokenise(java.lang.String text)
          Split the text into tokens and sentences.
static java.util.ArrayList<java.lang.String> tokeniseAndSplit(java.lang.String text)
          First tokenise, then sentence-split the text.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokeniser

public Tokeniser()
Method Detail

tokenise

public static java.lang.String tokenise(java.lang.String text)
Split the text into tokens and sentences. Tokens are split on non-alphanumeric characters, except '-'.

Parameters:
text - Text to be tokenised.
Returns:
Tokenised text, with tokens separated by whitespace.

sentenceSplit

public static java.util.ArrayList<java.lang.String> sentenceSplit(java.lang.String tokenisedText)
Take tokenised text and split it into separate sentences.

Parameters:
tokenisedText - Tokenised text.
Returns:
ArrayList of sentences.

tokeniseAndSplit

public static java.util.ArrayList<java.lang.String> tokeniseAndSplit(java.lang.String text)
First tokenise, then sentence-split the text.

Parameters:
text - Input text.
Returns:
Tokenised and split sentences.

main

public static void main(java.lang.String[] args)