WeightedLetter Name Generator
This name generator takes a seed of similar sounding names (or any words), and generates a new set of names based on the likelihood that one letter follows another in the original set. It picks the first and last letter, generates the letters in between, best-matches the second-to-last letter, and runs post-generation triple-letter and Damerau–Levenshtein checks to ensure uniformity with the seed.
Originally named the DamerauLevenshteinNamegen, in honor of its last filter step, this generator was first written in PHP, ported to AS3 for use in Dance of Death, and now to Objective C for Crossword Dungeon. The source code is available on GitHub as well as Google Code.
Let’s break it down, step-by-step:
Step 1: Parse the Letter Distribution
This is the largest part and core of this generator: we step through each name in the seed, then through each letter in that name, and keep track of how likely that letter is to follow its prior letter. We also keep track of the sizes of the names, to ensure the generated names are consistently sized. We only need to do this parsing once (or any time the seed changes).
After we’ve got the letter distribution recorded, it’s time to generate some names. We begin by picking a size and a starting letter, then pick following letters based on how likely they are to follow the previous letter in the seed names, until we’ve reached the length of the selected size (let’s call this process the name body generation loop). This is what we end up with (click Generate to generate a new batch of names, or click on the list of seed names to edit them):
Step 2: The Beginning and the End
The generated names are acceptable, but a bit weak, largely because there is no post-processing done to them, but mostly because the first and last letter do not follow the seed distribution. Even though the “Cambridge study” was bogus, there is still an intrinsic importance to the way a word starts and ends. To improve the algorithm, let’s keep track of the likelihood that a letter appears first and the likelihood that it appears last.
Picking the first letter is straightforward: we track it in a weighted array and pick from that; picking the last one is not so trivial. We can pick the last letter before the name body generation loop then best-fit the penultimate letter at the end of the loop (this is the solution I went with), or we can pick the last letter within the name body generation loop by prioritizing letters with higher “lastLetter” weight when picking the last letter (you are welcome to experiment with this approach!).
Thus we achieve:
Step 3: Postprocessing
This is an entirely optional step, since the generator at this point is pretty solid. For some additional refinement, however, we can look to two post-process filters: triple-letter checks, and maximum Damerau–Levenshtein distance from seed names.
The triple letter check is straightforward: step through the word, and ensure no three characters in a row are the same. This is a surprisingly likely occurrence if you have names in your seed with same-letter pairs, since a letter is allowed to follow itself and can do so recursively. In this implementation, this check is being done as a postprocess step, but it can also be integrated into the name body generation loop.
The Damerau–Levenshtein check ensures that the generated word is close enough to at least one of the names in the seed by measuring the Damerau–Levenshtein distance of the generated name to all of the names in the seed until it finds one whose distance is below the threshold (half of the name length works well). What this means is that we want to keep the general structure of half of the generated name, with some room for letter additions, subtractions, replacements, and swaps. This check ends up filtering out names that are very off-the-wall, which you might very well want to keep; it can also be somewhat costly if you have a large number of seed names. I’ve skipped it in the Objective C version without any regrets!
Here is the final algorithm at work:
Grab the source code from GitHub and Google Code. Feel free to tweak any of the values and modify the algorithm; attribution’s always appreciated, but last time I released code under GPLv3 it was met with some discontent, so have fun with it and let me know in what ways you improve the system!
I noticed that a lot of names end with gr even though none in the seed does. Whats up with that?
Name terminations that do not match the seed names are caused by a failed attempt to fit a penultimate letter, after which the current implementation of this algorithm gives up and ignores the placement of the last two letters.
This can be ameliorated in a few ways:
1) A more robust (in other words, longer) set of seed names should provide better candidates for the penultimate letter fitting.
2) Allowing multiple attempts at the best fit for the penultimate letter.
3) Having the last letter in the name body generation loop determine what the last letter of the name will be, instead of attempting a best-fit between the last and antepenultimate letters.
I have gotten good results with #1, mostly through feeding some of the generated names I like back into the seed. I have not experimented with the other two options, but if you fork/port the code and end up making such modifications, let me know!
Any chance of seeing the php version as well?
The AS3 version has progressed a lot farther than the initial PHP incarnation, so I’ll have to leave it as an exercise to the reader to port the logic from AS3 to PHP. 🙂
It should be pretty straightforward to understand what’s going on in AS3 even if you’re not familiar with the language, but let me know if there are any tricky bits that could use clarification, and I would be happy to point you in the right direction.