I'm interested in generating example sentences myself, but in a way, that chooses sentences that are simple, easy to understand and support the word, they are supposed to exemplify.
For example "She got a car for her birthday, while she was traveling in Italy eating pizza" does not tell the reader anything about what a car is, or how the word should be used. However "He drives his car to work", is a much better example of what a car is, what is a common associated verb and how it fits in a sentence.
How do you optimise selection for sentence like the latter?
There's the linguistic principle of "You shall know a word by the company it keeps", so for any particular word you can identify which other words are most specifically related to that word, the simplest measure that can be used for that is freq(both words together)/freq(that other word in general).
That would allow you to prioritize sentences containing "getting a car" over "driving a car" - even if "getting a car" is more frequent, driving is more specific according to such a measure.
Hmm, maybe I've been overcomplicating the problem in my mind. You've given me some good ideas.
Bigrams, as your own example shows, are too simple: in both examples, "car" will get related to "a", instead of "getting" and "driving".
Maybe if I parse all sentences with dependency and built dependency bigrams, and score sentences with frequency/inverse_freq and length of sentence (short sentences are better).
That's a great question. Optimizing for sentence selection is important for teaching. For now, I have a simple check that filters out sentences which are longer than 160 characters.
Also, I believe that this is one thing which humans can do better. I, therefore, plan to add upvote & downvote buttons to rate the quality of sentences.
I wonder if you might get a bit of an head start if you combine the shorter sentence idea with selection based on higher n-gram counts. For instance, if the keyword + words either side match a common n-gram, you could expect that sentence was reasonably representative and boost it in the initial rankings as compared to an n-gram that has a much lower count.
You could check to see if there are some verbs that are used predominately with the word you're trying to generate sentences for. Example, I would expect "drive" appearing in a sentence to carry a higher than average probability for "car" also appearing in the sentence. Or "wind" for "watch", or "sit" for "chair" and "couch".
Then, I think sentences containing "car" that also contain the verb "drive" would probably give better clues for the meaning of "car" than verbs like "bought".
That's the geek inside you talking. But the hard part was the word stats. Now that you get 1K words, writing a thousand sentences manually to illustrate them is not really hard. It's a one day manual work. Less than the work needing to figure out automation, with way better results.
I'm interested in generating example sentences myself, but in a way, that chooses sentences that are simple, easy to understand and support the word, they are supposed to exemplify.
For example "She got a car for her birthday, while she was traveling in Italy eating pizza" does not tell the reader anything about what a car is, or how the word should be used. However "He drives his car to work", is a much better example of what a car is, what is a common associated verb and how it fits in a sentence.
How do you optimise selection for sentence like the latter?