dssg · thcrock · Dec 14, 2018 · Feb 21, 2019 · Feb 21, 2019 · Feb 27, 2019
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -75,13 +75,17 @@ nav:
     - Defining an Experiment: experiments/defining.md
     - Testing Feature Configuration: experiments/feature-testing.md
     - Running an Experiment: experiments/running.md
-    - Upgrading an Experiment: experiments/upgrading.md
+    - Upgrading an Experiment:
+        to v5: experiments/upgrade-to-v5.md
+        to v6: experiments/upgrade-to-v6.md
+        to v7: experiments/upgrade-to-v7.md
     - Temporal Validation Deep Dive: experiments/temporal-validation.md
     - Cohort and Label Deep Dive: experiments/cohort-labels.md
     - Prediction Ranking: experiments/prediction-ranking.md
     - Feature Generation Recipe Book: experiments/features.md
     - Experiment Algorithm: experiments/algorithm.md
     - Experiment Architecture: experiments/architecture.md
+    - Extending Experiment Features: experiments/extending-features.md
   - Model selection: dirtyduck/docs/audition.md
   - Postmodeling: postmodeling/index.md
   - Model governance:  dirtyduck/docs/ml_governance.md

diff --git a/docs/sources/experiments/extending-features.md b/docs/sources/experiments/extending-features.md
@@ -0,0 +1,145 @@
+# Extending Feature Generation
+
+This document describes how to extend Triage's feature generation capabilities by writing new FeatureBlock classes and incorporating them into Experiments.
+
+## What is a FeatureBlock?
+
+A FeatureBlock represents a single feature table in the database and how to generate it. If you're familiar with `collate` parlance, a `SpacetimeAggregation` is similar in scope to a FeatureBlock. A `FeatureBlock` class can be instantiated with whatever arguments it needs,and from there can provide queries to produce its output feature table. Full-size Triage experiments tend to contain multiple feature blocks. These all live in a collection as the `experiment.feature_blocks` property in the Experiment.
+
+## What existing FeatureBlock classes can I use?
+
+Class name | Experiment config key | Use
+------------ | ------------- | ------------
+triage.component.collate.SpacetimeAggregation | spacetime_aggregation  | Temporal aggregations of event-based data
+
+## Writing a new FeatureBlock class
+
+The `FeatureBlock` base class defines a set of abstract methods that any child class must implement, as well as a number of initialization arguments that it must take and implement in order to fulfill expectations Triage users have on feature generators. Triage expects these classes to define the queries they need to run, as opposed to generating the tables themselves, so that Triage can implement scaling by parallelization.
+
+### Abstract methods
+
+Any method here without parentheses afterwards is expected to be a property.
+
+Method | Task | Return Type
+------------ | ------------- | -------------
+feature_columns | The list of feature columns in the final, postimputation table. Should exclude any index columns (e.g. entity id, date) | list
+preinsert_queries | Return all queries that should be run before inserting any data. The creation of your feature table should happen here, and is expected to have `entity_id(integer)` and `as_of_date(timestamp)` columns. | list
+insert_queries | Return all inserts to populate this data. Each query in this list should be parallelizable, and should be valid after all `preinsert_queries` are run. | list
+postinsert_queries | Return all queries that should be run after inserting all data | list
+imputation_queries | Return all queries that should be run to fill in missing data with imputed values. | list
+
+Any of the query list properties can be empty: for instance, if your implementation doesn't have inserts separate from table creation and is just one big query (e.g. a `CREATE TABLE AS`), you could just define `preinsert_queries` so be that one mega-query and leave the other properties as empty lists.
+
+### Properties Provided by Base Class
+
+There are several attributes/properties that can be used within subclass implementations that the base class provides. Triage experiments take care of providing this data during runtime: if you want to instantiate a FeatureBlock object on your own, you'll have to provide them in the constructor.
+
+Name | Type | Purpose
+------------ | ------------- | -------------
+as_of_dates | list | Features are created "as of" specific dates, and expects that each of these dates will be populated with a row for each member of the cohort on that date.
+cohort_table | string | The final shape of the feature table should at least include every entity id/date pair in this cohort table.
+final_feature_table_name | string | The name of the final table with all features filled in (no missing values). This is provided by the user in feature config, as the key that corresponds to the configuration section that instantiates the feature block
+db_engine | sqlalchemy.engine | The engine to use to access the database. Although these instances are mostly returning queries, the engine may be useful for implementing imputation.
+features_schema_name | string | The database schema where all feature tables should reside. Defaults to None, which ends up in the public schema.
+feature_start_time | string/datetime | A time before which no data should be considered for features. This is generally only applicable if your FeatureBlock is doing temporal aggregations. Defaults to None, which means no data will be excluded.
+features_ignore_cohort | bool | If True (the default), features are only computed for members of the cohort. If False, the shape of the final feature table could include more.
+
+
+`FeatureBlock` child classes can, and in almost all cases will, include more configuration at initialization time that are specific to them. They probably also define many more methods to use internally. But as long as they adhere to this interface, they'll work with Triage.
+
+### Making the new FeatureBlock available to experiments
+
+Triage Experiments run on serializable configuration, and although it's possible to take fully generated `FeatureBlock` instances and bypass this (e.g. `experiment.feature_blocks = <my_collection_of_feature_blocks>`), it's not recommended. The last step is to pick a config key for use within the `features` key of experiment configs, within `triage.component.architect.feature_block_generators.FEATURE_BLOCK_GENERATOR_LOOKUP` and point it to a function that instantiates a bunch of your objects based on config.
+
+## Example
+
+That's a lot of information! Let's see this in action. Let's say that we want to create a very flexible type of feature that simply runs a configured query with a parametrized as-of-date and returns its result as a feature.
+
+```python
+from triage.component.architect.feature_block import FeatureBlock
+
+
+class SimpleQueryFeature(FeatureBlock):
+    def __init__(self, query, *args, **kwargs):
+        self.query = query
+        super().__init__(*args, **kwargs)
+
+    @property
+    def feature_columns(self):
+        return ['myfeature']
+
+    @property
+    def preinsert_queries(self):
+        return [f"create table {self.final_feature_table_name}" "(entity_id bigint, as_of_date timestamp, myfeature float)"]
+
+    @property
+    def insert_queries(self):
+        if self.features_ignore_cohort:
+            final_query = self.query
+        else:
+            final_query = f"""
+                select * from (self.query) raw
+                join {self.cohort_table} using (entity_id, as_of_date)
+            """
+        return [
+            final_query.format(as_of_date=date)
+            for date in self.as_of_dates
+        ]
+
+    @property
+    def postinsert_queries(self):
+        return [f"create index on {self.final_feature_table_name} (entity_id, as_of_date)"]
+
+    @property
+    def imputation_queries(self):
+        return [f"update {self.final_feature_table_name} set myfeature = 0.0 where myfeature is null"]
+```
+
+This class would allow many different uses: basically any query a user can come up with would be a feature. To instantiate this class outside of triage with a simple query, you could:
+
+```python
+feature_block = SimpleQueryFeature(
+    query="select entity_id, as_of_date, quantity from source_table where date < '{as_of_date}'",
+    as_of_dates=["2016-01-01"],
+    features_table_name="my_features",
+    cohort_table="my_cohort_table",
+    db_engine=triage.create_engine(<..mydbinfo..>)
+)
+
+feature_block.run_preimputation()
+feature_block.run_imputation()
+```
+
+To use it from a Triage experiment, modify `triage.component.architect.feature_block_generators.py` and submit a pull request:
+
+Before:
+
+```python
+FEATURE_BLOCK_GENERATOR_LOOKUP = {
+    'spacetime_aggregations': generate_spacetime_aggregations
+}
+```
+
+After:
+
+```python
+FEATURE_BLOCK_GENERATOR_LOOKUP = {
+    'spacetime_aggregations': generate_spacetime_aggregations,
+    'simple_query': SimpleQueryFeature,
+}
+```
+
+At this point, you could use it in an experiment configuration by adding a feature table section and specifying the `feature_generator_type` key to be the name you just put in the lookup, `simple_query`. All other keys/values in that config block will be passed to the constructor to your class. Since the class you defined only takes in one extra keyword argument (the query), the only other key you need to specify in config is that query.
+
+An example:
+
+```yaml
+
+features:
+    my_feature_table:
+        feature_generator_type: "simple_query" 
+        query: "select entity_id, as_of_date, quantity from source_table where date < '{as_of_date}'"
+    my_other_feature_table:
+        feature_generator_type: "simple_query"
+        query: "select entity_id, as_of_date, other_quantity from other_source_table where date < '{as_of_date}'"
+```
diff --git a/docs/sources/experiments/feature-testing.md b/docs/sources/experiments/feature-testing.md
@@ -2,26 +2,27 @@
 
 Developing features for Triage experiments can be a daunting task. There are a lot of things to configure, a small amount of configuration can result in a ton of SQL, and it can take a long time to validate your feature configuration in the context of an Experiment being run on real data.
 
-To speed up the process of iterating on features, you can run a list of feature aggregations, without imputation, on just one as-of-date. This functionality can be accessed through the `triage` command line tool or called directly from code (say, in a Jupyter notebook) using the `FeatureGenerator` component.
+To speed up the process of iterating on features, you can run a list of feature aggregations, without imputation, on just one as-of-date. This functionality can be accessed through the `triage` command line tool or called directly from code (say, in a Jupyter notebook) using the `feature_blocks_from_config` utility.
 
 ## Using Triage CLI
-![triage featuretest cli help screen](featuretest-cli.png)
 
 The command-line interface for testing features takes in two arguments:
-	- A feature config file. Refer to [example_feature_config.yaml](https://github.com/dssg/triage/blob/master/example/config/feature.yaml). Essentially this is the content of the [example_experiment_config.yaml](https://github.com/dssg/triage/blob/master/example/config/experiment.yaml)'s `feature_aggregations` section. It consists of a YAML list, with one or more feature_aggregation rows present.
-	- An as-of-date. This should be in the format `2016-01-01`.
 
-Example: `triage experiment featuretest example/config/feature.yaml 2016-01-01`
+- An experiment config file. It should have at least a `features` section, and if a `cohort_config` section is present, it will use that to limit the number of feature rows it creates to the cohort at the given date. Other keys can be in there but are ignored. In other lwords, you can use your experiment config file either before or after its fully completed.
+- An as-of-date. This should be in the format `2016-01-01`.
+
+Example: `triage experiment featuretest example/config/experiment.yaml 2016-01-01`
 
 All given feature aggregations will be processed for the given date. You will see a bunch of queries pass by in your terminal, populating tables in the `features_test` schema which you can inspect afterwards.
 
 ![triage feature test result](featuretest-result.png)
 
 ## Using Python Code
-If you'd like to call this from a notebook or from any other Python code, the arguments look similar but are a bit different. You have to supply your own sqlalchemy database engine to create a 'FeatureGenerator' object, and then call the `create_features_before_imputation` method with your feature config as a list of dictionaries, along with an as-of-date as a string. Make sure your logging level is set to INFO if you want to see all of the queries.
+If you'd like to call this from a notebook or from any other Python code, the arguments look similar but are a bit different. You have to supply the same arguments plus a few others to the `feature_blocks_from_config` function to create a set of feature blocks, and then call the `run_preimputation` method on each feature block. Make sure your logging level is set to INFO if you want to see all of the queries.
+
 
 ```
-from triage.component.architect.feature_generators import FeatureGenerator
+from triage.component.architect.feature_block_generators import feature_blocks_from_config
 from triage.util.db import create_engine
 import logging
 import yaml
@@ -32,28 +33,37 @@ logging.basicConfig(level=logging.INFO)
 db_url = 'your db url here'
 db_engine = create_engine(db_url)
 
-feature_config = [{
-	'prefix': 'aprefix',
-	'aggregates': [
-		{
-		'quantity': 'quantity_one',
-		'metrics': ['sum', 'count'],
-	],
-	'categoricals': [
-		{
-			'column': 'cat_one',
-			'choices': ['good', 'bad'],
-			'metrics': ['sum']
-		},
-	],
-	'groups': ['entity_id', 'zip_code'],
-	'intervals': ['all'],
-	'knowledge_date_column': 'knowledge_date',
-	'from_obj': 'data'
-}]
-
-FeatureGenerator(db_engine, 'features_test').create_features_before_imputation(
-	feature_aggregation_config=feature_config,
-	feature_dates=['2016-01-01']
+feature_config = {
+    'myfeaturetable': {
+        'feature_generator_type': 'spacetime_aggregation', 
+	    'prefix': 'aprefix',
+        'aggregates': [
+            {
+            'quantity': 'quantity_one',
+            'metrics': ['sum', 'count'],
+            }
+        ],
+        'categoricals': [
+            {
+                'column': 'cat_one',
+                'choices': ['good', 'bad'],
+                'metrics': ['sum']
+            },
+        ],
+        'groups': ['entity_id', 'zip_code'],
+        'intervals': ['all'],
+        'knowledge_date_column': 'knowledge_date',
+        'from_obj': 'data'
+    }
+}
+
+feature_blocks = feature_blocks_from_config(
+    feature_config,
+    as_of_dates=['2016-01-01'],
+    cohort_table=None,
+    db_engine=db_engine,
+    features_schema_name="features_test",
 )
+for feature_block in feature_blocks:
+    feature_block.run_preimputation(verbose=True)
 ```
diff --git a/docs/sources/experiments/upgrade-to-v7.md b/docs/sources/experiments/upgrade-to-v7.md
@@ -0,0 +1,71 @@
+# Upgrading your experiment configuration to v7
+
+
+This document details the steps needed to update a triage v6 configuration to
+v7, mimicking the old behavior.
+
+Experiment configuration v7 includes only one change from v6: The features are given at a different key. Instead of `feature_aggregations`, to make space for non-collate features to be added in the future, there is now a more generic `features` key. The value of this key is a dictionary, the key of which is the desired output table name for that feature table, and the value of which is the same as the configuration for each feature aggregation from before. There is one change to this. A new key called 'feature_generator_type', to specify which method is being used to generate this feature table. Since non-collate features have not been added yet, there is only one key for this: `spacetime_aggregation`. 
+
+Since the output feature table name is now configurable, there are two things to note:
+- Final tables won't necessarily be suffixed with `_aggregation_imputed` as they were before. If you would like to use the old naming system, for instance to avoid having to change postmodeling code that reads features from the database, you can add that suffix to your table name. The example below does set the table name to match what it was before, but there's no reason you have to follow this if you don't want! You can call the table whatever you want.
+- The `prefix` key is no longer used to construct the table name. It is still used to prefix column names, if present. If not present, the name of the feature table will be used.
+
+
+
+Old:
+
+```
+feature_aggregations:
+    -
+        prefix: 'prefix'
+        from_obj: 'cool_stuff'
+        knowledge_date_column: 'open_date'
+        aggregates_imputation:
+            all:
+                type: 'constant'
+                value: 0
+        aggregates:
+            -
+                quantity: 'homeless::INT'
+                metrics: ['count', 'sum']
+        intervals: ['1 year', '2 year']
+        groups: ['entity_id']
+```
+
+New:
+
+```
+features:
+    prefix_aggregation_imputed:
+        feature_generator_type: 'spacetime_aggregation'
+        prefix: 'prefix'
+        from_obj: 'cool_stuff'
+        knowledge_date_column: 'open_date'
+        aggregates_imputation:
+            all:
+                type: 'constant'
+                value: 0
+        aggregates:
+            -
+                quantity: 'homeless::INT'
+                metrics: ['count', 'sum']
+        intervals: ['1 year', '2 year']
+        groups: ['entity_id']
+```
+
+## Upgrading the experiment config version
+
+At this point, you should be able to bump the top-level experiment config version to v7:
+
+Old:
+
+```
+config_version: 'v6'
+```
+
+New:
+
+```
+config_version: 'v7'
+```
+
diff --git a/docs/sources/experiments/upgrading.md b/docs/sources/experiments/upgrading.md