How do you decide on which sentences to use? I'm interested in generating exampl...

PeterisP · on May 31, 2017

There's the linguistic principle of "You shall know a word by the company it keeps", so for any particular word you can identify which other words are most specifically related to that word, the simplest measure that can be used for that is freq(both words together)/freq(that other word in general).

That would allow you to prioritize sentences containing "getting a car" over "driving a car" - even if "getting a car" is more frequent, driving is more specific according to such a measure.

wodenokoto · on June 2, 2017

Hmm, maybe I've been overcomplicating the problem in my mind. You've given me some good ideas.

Bigrams, as your own example shows, are too simple: in both examples, "car" will get related to "a", instead of "getting" and "driving".

Maybe if I parse all sentences with dependency and built dependency bigrams, and score sentences with frequency/inverse_freq and length of sentence (short sentences are better).

abhas9 · on May 30, 2017

That's a great question. Optimizing for sentence selection is important for teaching. For now, I have a simple check that filters out sentences which are longer than 160 characters.

Also, I believe that this is one thing which humans can do better. I, therefore, plan to add upvote & downvote buttons to rate the quality of sentences.

nmstoker · on May 31, 2017

Up/down voting seems good.

I wonder if you might get a bit of an head start if you combine the shorter sentence idea with selection based on higher n-gram counts. For instance, if the keyword + words either side match a common n-gram, you could expect that sentence was reasonably representative and boost it in the initial rankings as compared to an n-gram that has a much lower count.

anchpop · on May 31, 2017

I know!

You could check to see if there are some verbs that are used predominately with the word you're trying to generate sentences for. Example, I would expect "drive" appearing in a sentence to carry a higher than average probability for "car" also appearing in the sentence. Or "wind" for "watch", or "sit" for "chair" and "couch".

Then, I think sentences containing "car" that also contain the verb "drive" would probably give better clues for the meaning of "car" than verbs like "bought".

Just a thought.

sametmax · on May 31, 2017

That's the geek inside you talking. But the hard part was the word stats. Now that you get 1K words, writing a thousand sentences manually to illustrate them is not really hard. It's a one day manual work. Less than the work needing to figure out automation, with way better results.