Code Scraps: Word list tidying

I have recently been messing around with dictionary lists for a little personal project. My goal has to been to make an anagram/word finder, where you provide up to 9 letters and the program returns potential words that can be made from the letters you provide. This is a code scrap that takes in the dictionary and outputs a ‘clean’ word list.

NB: if you play with Dictionary Aspel/Hunspell .dic files (often found in most open source programs – I took this one from OpenOffice), the markup in the file also contains some meta data. I removed the meta data first using a few macros in notepad before writing this simple algorithm to remove any words that were capitalised, contained punctuation or were longer than 9 letters…

Code: C#.NET

String source = “C:\\TEMP\\en_GB.dic”;
String target = “C:\\TEMP\\wordslist.txt”;
String[] lines = File.ReadAllLines(source);
StringBuilder result = new StringBuilder();
foreach (String line in lines) {
char c = line[0];
if (!c.ToString().ToUpper().Equals(c.ToString())) {
if (line.Length < 10) {
if (!line.Contains(“-“) && !line.Contains(“‘”)) {
result.Append(line + “\r\n”);
}
}
}
}
File.WriteAllText(target, result.ToString(), Encoding.UTF8);

I then took my cleaned text file one step further to eliminate all accented characters (such as words with ‘é’). This extra pass also helped as I hadn’t noticed some of the words in my dictionary  had abbrevations using full stops/periods. (eg: “abbrv.”).

String validchars = “abcdefghijklmnopqrstuvwxyz”;
String source = “C:\\TEMP\\wordslist.txt”;
String target = “C:\\TEMP\\wordslist2.txt”;
String[] lines = File.ReadAllLines(source);
StringBuilder result = new StringBuilder();
foreach (String line in lines) {
bool skip = false;
for (int i = 0; i < line.Length; i++) {
if (!validchars.Contains(line[i].ToString().ToLower())) { skip = true; break; }
}
if (!skip) {
result.Append(line + “\r\n”);
}
}
File.WriteAllText(target, result.ToString(), Encoding.UTF8);

So what can you do with a clean word list like this? Well, I was using this list to build a little application to return the words made up from a set of up to 9 letters.  I was applying this to a program with no access to a database, so my solution to this was to produce two words lists – one cleaned (as produced by these code scraps), then a second list which was a copy of the first, but with all letters in the words arranged alphabetically (I call it my ‘alpha’ list). To find matches for a given set of letters, you then need to recursively search the alpha word list for matches, and from those index positions of matches you can get the English word from the clean list.

For example:

1aardvarkaaadkrrv
2abackaabck
3abacusaabcsu
4abaftaabft
5abaloneaabelno
6abandonaabdnno
7abandonedaabddenno
8abandoneraabdennor
9abaseaabes
10abasementaabeemnst

If you were given the letters: [“a,”d”,”b”,”n”,”e”,”d”,”n”,”a”,”o”], you first sort them alphabetically to form: [“a”,”a”,”b”,”d”,”d”,”e”,”n”,”n”,”o”], you can then search the ‘alpha’ list for matches and see there is a direct match on row 7. If you check the ‘clean’ word list you see this equates to “abandoned”.
You can then make use of some recursion, first trying every letter with ‘a’ at the start, followed by each subsequent letter in the string, then moving onto the second letter (in this case ‘a’ again), and trying the word with each subsequent letter. You continue the process to find all possible words.

Each recursion moves your starting letter one letter right in the alphabetical string. This is because you are searching the alpha list – so when you recursively search for ‘b’ words, you wouldn’t look for ‘ba’ or ‘baa’ as the alpha list doesn’t contain any ‘baa’ words, as it’s already pre-sorted to read ‘aab’ in that case, and would have been found on earlier recursions.

Recursion 1:

aabddenno
abddenno
addenno
adenno
aenno
anno
ano
ao

Recursion 2: (Unfortunately in this scenario this has wasted cycles as ‘a’ was already done in first pass)

abddenno
addenno
adenno
aenno
anno
ano
ao

Recursion 3:

bddenno
bdenno
benno
bnno
bno
bo

Recursion 4:

ddenno
denno
dnno
dno
do

… actually typing out for this blog post has just made me realise that this doesn’t handle duplicate letter words well. The word ‘node’ would not be found in this process, as the combination of letters is never locked for. I could fix this by adding on a filter so that if there are duplicate letters in a word then there will be an extra recursion of just unique letters. That would find words that do not contain duplicates.

Leave a Reply

Your email address will not be published. Required fields are marked *