This looks like a cool project. I like fully data-driven systems. Unfortunately it will be very hard to design such a system without the source dataset that you speak of. But give me a heads-up when you have it, I'll definitely give this a bash.