Tokeniser

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

sem.tokeniser
Class Tokeniser

java.lang.Object
  sem.tokeniser.Tokeniser

public class Tokeniser
extends java.lang.Object
extends java.lang.Object

A very simple tokeniser implementation.

Separates text into tokens and sentences using heuristic rules. It does not function well in cases where sentences are surrounded by quotes, so it is best to remove those beforehand.

Constructor Summary
`Tokeniser()`

Method Summary
`static void`	`main(java.lang.String[] args)`
`static java.util.ArrayList<java.lang.String>`	`sentenceSplit(java.lang.String tokenisedText)` Take tokenised text and split it into separate sentences.
`static java.lang.String`	`tokenise(java.lang.String text)` Split the text into tokens and sentences.
`static java.util.ArrayList<java.lang.String>`	`tokeniseAndSplit(java.lang.String text)` First tokenise, then sentence-split the text.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

Tokeniser

public Tokeniser()

Method Detail

tokenise

public static java.lang.String tokenise(java.lang.String text)

Split the text into tokens and sentences. Tokens are split on non-alphanumeric characters, except '-'.

Parameters:: text - Text to be tokenised.
Returns:: Tokenised text, with tokens separated by whitespace.

sentenceSplit

public static java.util.ArrayList<java.lang.String> sentenceSplit(java.lang.String tokenisedText)

Take tokenised text and split it into separate sentences.

Parameters:: tokenisedText - Tokenised text.
Returns:: ArrayList of sentences.

tokeniseAndSplit

public static java.util.ArrayList<java.lang.String> tokeniseAndSplit(java.lang.String text)

First tokenise, then sentence-split the text.

Parameters:: text - Input text.
Returns:: Tokenised and split sentences.

main

public static void main(java.lang.String[] args)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

sem.tokeniser Class Tokeniser

Tokeniser

tokenise

sentenceSplit

tokeniseAndSplit

main

sem.tokeniser
Class Tokeniser