Antti Rauhala
Co-founder
March 16, 2020 • 5 min read
We are proud to release the concept learning update. This update includes major improvements not just in Aito's statistical reasoning, but also in explanation formats and in recommendation probabilities. The update also includes great amounts of smaller improvements, optimizations and smaller fixes.
The changes will be rolled into production starting March 23rd. The documentation in the aito.ai website will be updated on the previous week.
If you have questions, comments or if you wish to delay the update on your specific instance, please send an email to antti@aito.ai. Or join the discussion in our public Slack channel.
The 3 main change categories are:
Reasoning improvements
Bug fixes and optimizations
API changes
The API changes include 2 API incompatibilities. In the $why field and the relate end point explanations: the proposition strings were replaced with proper proposition objects, because the proposition strings were difficult to use programmatically. You can find the migration guide here.
In essence, the technique turns individual features such as 'limited' and 'edition' into a bigger feature like 'limited edition'. This helps Aito differentiate the 'limited edition' concept from separate concepts like 'limited liability' and 'second edition'. Aito's concept learning method is a representation learning technique based on the minimum description length principle, and it forms higher levels concepts from the facts/knowns presented in the where clause.
This technique will be most visible in the $why field contents. Let's consider an example with the Kickstarter data used in the funding demo. In the example, the aim is to predict the Amazing Bean Comic campaign's success:
{
"from" : "project_status",
"where" : {
"project_id.name" : "Amazing Bean Comic",
"project_id.blurb" : "Once funded a limited edition comic will be produced. This book will contain five never before published strips.",
},
"predict" : "reached_goal",
"select" : ["$p", "feature", "$why"]
}
Among the prediction's $why
explanations, you can find the following proposition:
{
"type" : "relatedPropositionLift",
"proposition" : {
"$and" : [ {
"project_id.name" : {
"$has" : "comic"
}
}, {
"project_id.blurb" : {
"$has" : "book"
}
}, {
"project_id.blurb" : {
"$has" : "comic"
}
} ]
},
"value" : 0.5720473645702654
}
Aito has found the 'comic book' concept to describe the 'Amazing Bean Comic' campaign, and determined that such projects have altoghether 43% lower chance of succeeding. The concept is formed from propositions found in several fields.
The technique has three major benefits. First, it improves the prediction accuracy. Second, it provides better explanations, as Aito does the statistical reasoning and explaining on a higher conceptual level and as informative features are no longer removed as redundant. Third, the concept can make the probability estimates significantly more correct. In the Kickstarter dataset, the information gain metric (that is used to measure probability correctness) improved by 0.184 bits, which is significant improvement in a prediction setting, where the predicted variable contains only 1.038 bits of information.
While concept learning is being used for the propositions in the where-clause in predict, match, recommend and the generic query, concept learning is yet not used within the matched or recommended content.
There will be a separate blog post describing the concept learning in a greater detail.
Aito's recommend- functionality was originally only partially implemented. While the recommendation order was sound, the recommended probabilities were not strictly goal probabilities. The biggest improvement in the recommend is that the returned probability is now the goal probability. Consider the following recommendation:
{
"from" : "impressions",
"where" : {
"context.user":"veronica",
"context.weekday":"Friday"
},
"recommend":"product",
"goal": {"purchase":true},
"select": ["$p", "name", "$why"],
"limit" : 1
}
Aito will respond with a result of the following format:
{
"offset" : 0,
"total" : 42,
"hits" : [ {
"$p" : 0.35926037903011704,
"name" : "Juhla Mokka coffee 500g sj",
"$why" : {
"type" : "product",
"factors" : [ {
"type" : "baseP",
"value" : 0.04970128598913154,
"proposition" : {
"purchase" : {
"$has" : true
}
}
}, {
"type" : "hitLinkPropositionLift",
"proposition" : {
"product" : {
"$has" : "6411300000494"
}
},
"value" : 0.6923121244195961
}, {
"type" : "hitPropositionLift",
"proposition" : {
"tags" : {
"$has" : "coffee"
}
},
"value" : 4.721811785468995,
"factors" : [ {
"type" : "relatedPropositionLift",
"proposition" : {
"$and" : [ {
"context.user" : {
"$has" : "veronica"
}
}, {
"context.weekday" : {
"$has" : "Friday"
}
} ]
},
"value" : 4.721811785468995
} ]
}, ... ]
}
} ]
}
The format explains that the average click through rate is 5% (from 'baseP'), and while this specific product has 30% lower average CTR (from 'hitLinkPropositionLift'), the user veronica has 4.72x times higher probability of buying coffee on Fridays (from 'hitPropositionLift' and 'relatedPropositionLift'), which is her weekly shopping day. As a consequence of this (and other factors), the CTR is estimated to be 36%. Previously, the $p value didn't reflect the click probability / CTR.
Write speed is now improved in all the situations, but especially in situations, where the table contains nullable fields.
Matching performance should now be better in situations where the base probabilities are used with large datasets and where the matching is done against millions or tens of millions of propositions. This affects both _match end point, and situations, where a link to a table is ordered by $p or $lift.
As a performance optimization, Aito now uses sampled partial database statistics, whenever it is possible to do without significant increase in the measurement error. This reduces the need to calculate the expensive full database statistics with dense/common features.
Sorting is now much faster. This is noticeable when scoring hundreds of thousands or millions predicted, matched or recommended items.
Similarity now accepts arbitrary propositions for measuring the similarity score. Previously you couldn't use $knn or $gte to add components to the similarity comparison.
Similarity used to use a normalizer factor, which was based on the entire row's feature count, instead of the corresponding field's feature count. This led to suboptimal results, and the norm was removed.
Relation discovery has no more minimum threshold to filter out weak relations. This means that you can now use relate to find relations between statistically independent variables.
The prediction base probabilities used to be too high, which was visible with small samples and $on proposition. This is now fixed.
Evaluate can now be used to measure prediction accuracy in situations where the query yields no results. This allows the user to compare e.g. hard-filters against inference based solutions.
{
"test" : {
"$index" : {"$mod" : [10, 1]}
},
"evaluate" : {
"from" : {
"from": "impressions",
"where": { "purchase" : true }
},
"where" : {
"product.name" : {"$match": {"$get": "context.query" } }
} ,
"get":"product"
},
"select" : ["accuracy", "n"]
}
This method will successfully return the correct evaluation result:
{
"accuracy" : 0.3753943217665615,
"n" : 317
}
Still, while the accuracy measurement is correct and sensible, other metrics may not behave that way. For example, the information theoretic values can be counted only in the situations, where the correct value was within the result. As a consequence, such estimates may provide inaccurate or even misleading values. Also ranks are meaningless, when the correct result can not be found.
The $value field has been added to contain the information in the 'feature'. Consider the query:
{
"from" : "products",
"where" : {
"title" : "apple iphone"
},
"predict": "tags",
"select" : ["$p", "$value"],
"limit":3
}
The query will have the following result:
{
"offset" : 0,
"total" : 10,
"hits" : [ {
"$p" : 0.3656914544001758,
"$value" : "premium"
}, {
"$p" : 0.1546922568903658,
"$value" : "cover"
}, {
"$p" : 0.09493670104339776,
"$value" : "macosx"
} ]
}
This change is motivated by a need to have a single $value field that can be used consistently in predict, match and recommend independent of whether the returned entity is a proposition, field value or a linked table row. $value was chosen to avoid namespace clashes with the linked row features. The aim is to deprecate both 'field' and 'feature' fields.
$value is currently hidden, but it can be selected with the select-clause in every situation, where feature and field is returned.
$proposition was added in order to make it easier to use the returned proposition in the Aito queries. You can select the $proposition in the select clause.
{
"from": "products",
"where": {
"title": "Apple"
},
"predict": {
"$on": [
{ "$exists": "tags" },
{ "$and": [
{ "tags": { "$match": "phone" } },
{ "$not": { "tags": { "$match": "laptop" } } }
] }
]
},
"select": ["$p", "$value", "$proposition"],
"limit": 1
}
This provides the following results:
{
"offset" : 0,
"total" : 10,
"hits" : [ {
"$p" : 0.22622976807854914,
"$value" : "phone",
"$proposition" : {
"$on" : [ {
"tags" : {
"$has" : "phone"
}
}, {
"$and" : [ {
"tags" : {
"$has" : "phone"
}
}, {
"$not" : {
"tags" : {
"$has" : "laptop"
}
}
} ]
} ]
}
} ]
}
The returned proposition format is the same, as in relate and $why. It can be used directly in the where clause.
One can now select medianNs, medianUs and medianMs metrics in order to access the median query times. There are also allNs, allUs and allMs, which contain an array of time measurements for each evaluated query.
Consider the following query:
New types of explanations have been added to $why which replace most of the existing ones. Instead of variable factors like RelatedVariableLifts
you'll get proposition factors like RelatedPropositionLifts
. Proposition factors don't include a string field called variable
, but instead has an object field called proposition
with the same structure as query propositions. If you have been parsing the variable
fields you'll have to adapt your code to this new format, see details below.
{
"type" : "relatedPropositionLift",
"proposition" : {
"$and" : [
"project_id.name" : {
"$has" : "comic"
}
}, {
"project_id.blurb" : {
"$has" : "book"
}
}, {
"project_id.blurb" : {
"$has" : "comic"
}
} ]
},
"value" : 1.3066457238264617
}
This change has two major benefits. The first benefit is that the proposition format is now easier to parse programmatically. The second benefit is that it can be reused in where clause. You can - for example - use it in the relate- end point to request additional statistics for the explanation.
The obvious disadvantage is that the new format is incompatible with the old one. Still, the API breaking change was introduced, because the old design was considered to be broken, as the textual proposition format was difficult to parse and it was inconsistent with the where clause format.
Similar change has happened in the relate- end points results. The old textual proposition identifier format has been updated to the new object-based proposition format:
{
"related" : {
"project_id.blurb" : {
"$has" : "app"
}
},
"lift" : 0.24061375073833635,
"condition" : {
"$on" : [ {
"reached_goal" : {
"$has" : true
}
}, {
"days_until_deadline" : {
"$has" : 0
}
} ]
}
}
Both related and condition fields now contain the 'where'-clause propositions in their full form, containing all $on, $has and other substructures.
Back to blog listEpisto Oy
Putouskuja 6 a 2
01600 Vantaa
Finland
VAT ID FI34337429