ASpell-based Language Detection – basic, reasonably quick language identification

I recently published an old prototype language detection library on GitHub. The library utilises open-source ASpell dictionary files to identify the language of the text being passed to the library.

The code is available at: https://github.com/DavidCostello42/LanguageDetection-CSharp/

The code itself was originally created using .NET 2.0 – the library has a demo console application that utilises .NET 2.0 threading methods which have been superseded in newer frameworks with better functionality, but the library itself is not threaded but is thread safe.

The language detection is relatively simple – you input a string and an ASpell *dic file, and the library searches the dictionary file see how many words in the input string are contained within the dictionary file. You repeat this for each dictionary you wish to check and on average the identification takes about 0.2-0.6 seconds depending on input string length and the size of the given dictionary file. The GitHub project comes with a threaded demo console application that can loop through a number of dictionaries, identifying the language in about 3-10 seconds depending on number of threads, text length and dictionary quantity.

If you plan on implementing the library, consider a couple of factors;

  • The dictionaries are stored on disk. Speed of identification can be increased using SSD’s.
  • If you only expect to be detecting a select group of languages, you only need the *.dic files for the languages you wish to identify.
  • The reliability of detection increases the longer the input text.
  • Although the library itself is a .NET2 framework project, I would actively encourage any implementation to be done in the most recent framework version – as you can increase performance using .NET3 [and up] threading methods to decrease processing time.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.