[Solved]: Automatically generate meaningful queries for a data table

Problem Detail: My field of research is not Database or AI. But I have some problems to solve, and would like to know which branch this kind of problems belong to, and what are the results. The main question is: given a table of data, there are lots of possible queries that can be made. But some queries are meaningful, some are meaningless. Are there some algorithms to automatically generate a ranked set of meaningful queries, from the relationships inside the table? Here is an example:

ArticleID   Price   Quantity    Sale A01         10      3           30 A01         10      3           30 A02         20      4           80 A02         20      5           100 A03         15      3           45 A03         15      4           60 A03         15      5           75 A04         20      2           40 A04         20      3           60 A04         20      4           80 

There are two relationships in this table, which may be given or inferred (the detection of the relationships is not an issue here): a) one ArticleID maps one Price; b) Sale = Price * Quantity. Then, the first issue is, how to automatically generate some queries? For instance:

1) Sum of Quantity by ArticleID 2) Sum of Sale by ArticleID 3) Sum of Price by ArticleID 

The second issue is, how to rank the meaningful queries? For example,
intuitively, we can say that Query 1 and Query 2 make more sense than Query 3. And this conclusion can be more or less inferred from the two given relationships. Certainly, the problem becomes more complexe when there are relationships among several tables. The tables and the relationships that I study are not very complicated. Could anyone tell me which field this problem belongs to? Are there some good results/proposals that are easy to understand?

Asked By : SoftTimur

Answered By : babou

I have no idea whether this has been addressed before. I have never seen it (but I am not in databases), and I would be surprised if no one had adressed it. Since there are no answers yet, I will attempt to give a view that may hopefully help. It is very elementary, and it would really require more study of how this kind of view may combine with the various types of queries one can make. What I write below was strongly influenced by your example and the queries that go with it. But there is more to a relational database. As remarked by Raphael, generating queries is relatively easy. But that is not your problem. What you want is meaningful queries. It is a semantic issue that depends much on the semantics you attach to your columns. I suspect that you may have a typing problem, which I tend to see also as a dimensional problem in the physics sense. It may not be easy or even meaningful to attempt to deal with it in a purely automated way. I am not thinking of elementary types as found in most programming languages, but of abstract types as found now in object oriented languages, and all languages with data abstraction. Hence, first of all, you should look whether there is something to be learned from object oriented databases. I also know that physicists use “databases” to store their experimental data before exploiting it with specialized software. If information can be found on these tools (usually very proprietary), there is a possibility that they may have addressed this typing issue, though a physicist told me that is not the case. Also you should make clear what you intend to provide as input to your generator, so that your queries make sense. Short of providing some kind of semantic information, a column of numbers is just a column of numbers, and any idea you may have to fold, spindle or mutilate them with SQL queries is fair game. You have to say what data, what semantic information, you provide to your generator, if you hope for an answer (which is a variant of the GIGO principle). I will attempt to illustrate a possible type vision with physics. Physics uses a lot of scalar quantities: weight, distance, speed, date, duration, temperature, heat. You could have these quantities appear to qualify columns in the data base of a physicist. But not all operations between them make sense. Testing equality of weight and date probably does not. Actually, testing equality between to quantities that have different physical dimension is likely to be meaningless, unless there is some implicit relation that only the database creator can know about. The same goes for addition quantities with different dimensions. Some dimensions have additive qualities, such as weight or volume, while others do not, such as temperature or density, though they can be averaged (some of them). Physicists distinguish them by talking of intensive or extensive properties. Extensive properties are proportional to the amount of material. They can be added. I suspect that these concepts exists also for your store management database, but may have to be refined (I am not sure). The price per unit of a good is clearly an intensive property. Adding these prices makes no sense (whether its concerns the same article or different ones). But, for example, taking the average may have sense, if you do it for distinct articles. Now such a classification can lead to more knowledge. For example. the ratio of two extensive properties is an intensive one. This is why the price per unit is intensive. But these concepts apply to properties of materials, and I suspect things may be more complex than that in general. A ratio could be additive in a context where intensive/extensive does not apply. Still, though speeds can be added in physics, if you have a column of speeds in your database, it is very likely that adding them makes no sense, and that taking the average makes a lot of sense. More generally, dimension equations may give you hints of what makes or does not make sense. Maybe there are other concepts such as intensive and extensive properties. Then, in the same spirits as abstract typing in programming languages, with a specification that states the allowed primitive operations, you could have a specification of your database that give you information about columns (such as dimension, or allowed operations) from which you can infer whether a queries is meaningful. But I do not believe this can be inferred from a simple knowledge of the database without any semantic knowledge provided by the designer. This knowledge should be part of the design process anyway. Another point is about ranking. I am not sure that making sense can be rated on a subtler scale than (true, false). If things can make sense at all, they can be very important to some people and without interest to others. This can probably be measured by actual database use. But again, it also depends on what you want the ranking for, and on the kind of semantic information you can provide.
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/29093  Ask a Question  Download Related Notes/Documents