N-Gram sentence splitting

2019/06/29

2 minute read

The Idea

one of the biggest annoyences in adding items to reminder app using voice is that the agent can’t figure out if there are multiple items in that utterance. Here is an example:

“add bananas oranges and lemons to my list”

on the current (12) version of iOS this will result in a single item with the content “bananas oranges and lemons” to be added. This is pretty annoying. If we assume we have some magical NLP that figures out the content it should not be that hard to figure out there are multiple items to be added to the list in this utterance. Here is very simple way of achieving this:

  1. assume we have a dicitionary of all the words/items without stop words.
  2. create all N-grams of the input string, starting with the number of words in the sentence for the N in the first N-gram
  3. search the dictionary for the words in the N-gram list, if found add it to the result list and remove from input string.
  4. decrease the number N and repeat the search until N = 0

For the example above let’s say our dictionary is:

bananas
banana
oranges
orange
lemons
lemon

Our first n-gram will be

["bananas oranges and lemons"]

This is not in the dictionary so nothing to do.

["bananas oranges and" , "oranges and lemons"]

Nothing from this list in the dictionary so continue.

["bananas oranges", "oranges and", "and lemons"]

["bananas", "oranges", "and", "lemons"]

3 of the items in the list our in our dictionary so we can add them as items in our result list.

How is this different from just splitting based on a space character? Well there are some words like “orange juice” that have to be a single item in the result list. If we did have this in the input sentence then we would have detected it in the 2-gram phase and added it to the result list.

Some implementation optimization thoughts

First thing is that when creating the N-grams we are using a sliding window of decreasing lengths. So instead of creating the N-grams for each index from scratch each time, adding the next word to the window and removing the first will yield a linear time algorithm for constructing the list.

Second thing is that there is no need to keep the whole dictionary in memory. Using a bloom filter will be enough because we can make our decision based on failure to be contained in the set.