https://github.com/ccao-data/model-sales-val

Heuristics for detecting outlier and non-arms-length sales

https://github.com/ccao-data/model-sales-val

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

aws-glue aws-s3 model python
Last synced: 5 months ago · JSON representation

Repository

Heuristics for detecting outlier and non-arms-length sales

Basic Info
  • Host: GitHub
  • Owner: ccao-data
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 3.86 MB
Statistics
  • Stars: 2
  • Watchers: 0
  • Forks: 1
  • Open Issues: 17
  • Releases: 0
Topics
aws-glue aws-s3 model python
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

Table of Contents

Model Overview

This repository contains code to identify and flag sales that may be non-arms-length transactions. A non-arms-length sale occurs when the buyer and seller have a relationship that influences the transaction price, leading to a sale that doesn't reflect the true market value of the property.

The sales validation model (hereafter referred to as "the model") uses simple statistics and heuristics to identify such sales. For example, it calculates the standard deviation in (log) sale price by area and property type, then flags any sales beyond a certain number of standard deviations from the mean. It also uses a variety of common heuristics, such as matching last names, foreclosure information, etc.

Non-arms-length transactions can affect any process that uses sales data. As such, we currently use the output of this model to exclude flagged transactions from:

In the future, it is likely the flagging outputs from this model will be used further in public-facing ratio studies and reporting, as well as used internally by the office.

What Gets Flagged

Sales from 2014 through present are flagged using this model. Ongoing sales are flagged on an ad-hoc basis as they are collected by the Department of Revenue and made available to the Data Department. See Model run modes for more information.

Commercial, industrial, and land-only property sales are not flagged by this model. Residential and condominium sales are flagged with the following specifications:

Residential

  • Includes classes:
    • 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 218, 219, 234, 278, 295
  • Excludes:
    • Multi-card sales

Condos

  • Includes classes:
    • 297, 299, 399
  • Excludes:
    • Parking spaces
    • Building common areas

Common exclusions for both residential and condos

  • Excludes any sales with a duplicate price within 365 days
  • Excludes sales less than $10,000
  • Excludes multi-PIN sales

Outlier Types

Outlier flags are broken out into 2 types: statistical outliers and heuristic outliers.

  • Statistical outliers are sales that are a set number of standard deviations (usually 2) away from the mean of a group of similar properties (e.g. within the same township, class, timeframe, etc.).
  • Heuristic outliers use some sort of existing flag or rule to identify potentially non-arms-length sales. Heuristic outliers are always combined with a statistical threshold, i.e. a sale with a matching last name must also be N standard deviations from a group mean in order to be flagged. Examples of heuristic outlier flags include:
    • PTAX flag: The PTAX-203 form is required by the Illinois Department of Revenue for most property transfers. Certain fields on this form are highly indicative of a non-arms-length transaction, i.e. Question 10 indicating a short sale.
    • Non-person sale: Flagged keyword suggests the sale involves a non-person legal entity (industrial buyer, bank, real estate firm, construction, etc.).
    • Flip Sale: Flagged when the owner of the home owned the property for less than 1 year
    • Anomaly: Flagged via an unsupervised machine learning model (isolation forest).

The following is a list of all current flag types:

High Price

| Indicator | Criteria | |-------------------------|---------------------------------------------------------------| | PTAX outlier (high) | PTAX flag & [1 high statistical outlier type] | | Home flip sale (high) | Short-term owner < 1 year & [1 high statistical outlier type] | | Family sale (high) | Last name match & [1 high statistical outlier type] | | Non-person sale (high) | Legal / corporate entity & [1 high statistical outlier type] | | Anomaly (High) | Anomaly algorithm (high) & [1 high statistical outlier type] | | High price (raw & sqft) | High price & high price per sq. ft. | | High price swing | Large swing away from mean & high price outlier | | High price (raw) | High price | | High price (per sqft) | High price per sq. ft. |

Low Price

| Indicator | Criteria | |-------------------------|---------------------------------------------------------------| | PTAX outlier (low) | PTAX flag & [1 low statistical outlier type] | | Home flip sale (low) | Short-term owner < 1 year & [1 low statistical outlier type] | | Family sale (low) | Last name match & [1 low statistical outlier type] | | Non-person sale (low) | Legal / corporate entity & [1 low statistical outlier type] | | Anomaly | Anomaly algorithm (low) & [1 low statistical outlier type] | | Low price (raw & sqft) | Low price & low price per sq. ft. | | Low price swing | Large swing away from mean & low price outlier | | Low price (raw) | Low price (or under $10k) | | Low price (per sqft) | Low price per sq. ft. |

Distribution of Outlier Types

As of 2024-03-15, around 6.9% of the total sales have some sort of outlier classification. Within that 6.9%, the proportion of each outlier type is:

Flagging Details

Model run modes

The model can be executed in three distinct run modes, depending on the state of the sales data and the specific requirements for flagging:

  1. Initial Run: This mode is triggered when no sales have been flagged. It's the first step in the model to instantiate tables and flag sales.
  2. Manual Update: This mode is used when sales need to be re-flagged, either due to errors or methodology updates. This allows for the selective re-flagging of sales. It also assigns flags to unflagged sales.
  3. Manual Update (New Sales Only): This mode borrows much of the same logic as the normal 'Manual Update' mode, but is used only to flag sales that do not have a current sales-val model determination. It will not re-flag any sales like the normal 'Manual Update' would.

```mermaid graph TD subgraph "Manual Update Mode" A3{{"Sales must be re-flagged"}} B3[Specify subset in yaml] C3[Run manualupdate.py] D3[Increment version if sale already flagged] E3[Assign Version = 1 if sale unflagged] F3[Update flags in default.vwpin_sale] G3[Save results to S3 with new run ID]

    A3 -->|Manual selection| B3
    B3 -->|Run update| C3
    C3 -->|Version check| D3
    D3 -->|Update process| F3
    C3 -->|New flag| E3
    E3 -->|Update process| F3
    F3 -->|Persist results| G3
end

subgraph "Manual Update (New Sales Only) Mode"
    A4{{"Flag only new sales"}}
    B4[Identify sales with no current model determination]
    C4[Run manual_update.py]
    E4[Assign Version = 1 if sale unflagged]
    F4[Update flags in default.vw_pin_sale]
    G4[Save results to S3 with new run ID]

    A4 -->|Filter new sales| B4
    B4 -->|Run update| C4
    C4 -->|New flag only| E4
    E4 -->|Update process| F4
    F4 -->|Persist results| G4
end

subgraph "Initial Run Mode"
    A1{{"No sales are flagged"}}
    B1[Run initial_flagging.py]
    C1[Flag sales as outliers or non-outliers<br>with Version = 1]
    D1[Join flags to<br>default.vw_pin_sale]
    E1[Save results to S3 with unique run ID]

    A1 -->|Initial setup| B1
    B1 -->|Flag sales| C1
    C1 -->|Store flags| D1
    D1 -->|Persist results| E1
end

style A1 fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
style A4 fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
style A3 fill:#bbf,stroke:#333,stroke-width:2px,color:#000;

```

Rolling window

The flagging model uses group means to determine the statistical deviation of sales, and flags them beyond a certain threshold. Group means are constructed using a rolling window strategy.

The current implementation uses a 12 month rolling window. This means that for any sale, the "group" contains all sales within the same month, along with all sales from the previous 11 months. This 12 month window can be changed by editing the configuration files: manual_flagging/yaml/ and main.tf. Additional notes on the rolling window implementation:

  • We take every sale in the same month of the sale date, along with all sale data from the previous N months. This window contains roughly 1 year of data.
  • This process starts with an .explode() call. Example here.
  • It ends by subsetting to the original_observation data. Example here.

Structure of Data

All flagging runs populate 3 Athena tables with metadata, flag results, and other information. These tables can be used to determine why an individual sale was flagged as an outlier. The structure of the tables is:

```mermaid erDiagram flag }|--|| metadata : describes flag }|--|{ parameter : describes flag }|--|{ group_mean : describes

flag {
    string meta_sale_document_num PK
    bigint meta_sale_price_original
    date rolling_window
    boolean sv_is_outlier
    boolean sv_is_ptax_outlier
    boolean ptax_flag_original
    boolean sv_is_heuristic_outlier
    string sv_outlier_reason1
    string sv_outlier_reason2
    string sv_outlier_reason3
    string group
    string run_id FK
    bigint version PK
}

metadata {
    string run_id PK
    string long_commit_sha
    string short_commit_sha
    string run_timestamp
    string run_type
    string run_note 
}

parameter {
    string run_id PK
    bigint sales_flagged
    timestamp earliest_data_ingest
    timestamp latest_data_ingest
    string run_filter
    string iso_forest_cols
    string stat_groups
    string sales_to_write_filter
    arraydouble dev_bounds
    arraydouble ptax_sd
    bigint rolling_window
    string time_frame
    bigint short_term_owner_threshold
    bigint min_group_threshold
}

group_mean {
    string group PK
    double group_mean
    double group_std
    double group_sqft_std
    double group_sqft_mean
    bigint group_size
    string run_id PK
}

```

AWS Glue Job Documentation

This repository manages the configurations, scripts, and details for an AWS Glue Job. It's essential to maintain consistency and version control for all changes related to the job. Therefore, specific procedures have been established.

⚠️ Important guidelines

  1. DO NOT modify the Glue job script, its associated flagging python script, or any of its job details directly via the AWS Console.
  2. All changes to these components should originate from this repository. This ensures that every modification is tracked and version-controlled.
  3. The only advisable actions in the AWS Console concerning this Glue job are:
    • Running the job
  4. To test a change to the Glue job script or the flagging script, make an edit on a branch and open a pull request. Our GitHub Actions configuration will deploy a staging version of your job, named z_ci_<your-branch-name>_sales_val_flagging, that you can run to test your changes. See the Modifying the Glue job section below for more details.

Modifying the Glue job, its flagging script, or its settings

The Glue job and its flagging script are written in Python, while the job details and settings are defined in a Terraform configuration file. These files can be edited to modify the Glue job script, its flagging script, or its job settings.

  1. Locate the desired files to edit:
    • Glue script: glue/sales_val_flagging.py
    • Flagging script: glue/flagging_script_glue/flagging.py
    • Job details/settings: main.tf, under the resource block aws_glue_job.sales_val_flagging (see the Terraform AWS provider docs for details)
  2. Any changes to these files should be made in the following sequence:
    • Make a new git branch for your changes.
    • Edit the files as necessary.
    • Open a pull request for your changes against the main branch. A GitHub Actions workflow called deploy-terraform will deploy a staging version of your job named z_ci_<your-branch-name>_sales_val_flagging that you can run to test your changes.
      • By default, this configuration will deploy an empty version of the sale.flag table, which simulates an environment in which there are no preexisting flags prior to a run.
      • If you would like to test your job against a subset of the production data, copy your data subset from the production job bucket to the bucket created by Terraform for your job (or leave the new bucket empty to simulate running the job when no flags exist). Then, run the crawler created by Terraform for your PR in order to populate the staging version of the sale.flag database that your staging job uses. If you're having trouble finding your staging bucket, job, or crawler, check the GitHub Actions output for the first successful run of your PR and look for the Terraform output displaying the IDs of these resources.
    • If you need to make further changes, push commits to your branch and GitHub Actions will deploy the changes to the staging job and its associated resources.
    • Once you're happy with your changes, request review on your PR.
    • Once your PR is approved, merge it into main. A GitHub Actions workflow called cleanup-terraform will delete the staging resources that were created for your branch, while a separate deploy-terraform run will deploy your changes to the production job and its associated resources.

Exporting Flags to iasWorld

Use the scripts/export.py script to generate a CSV that can be uploaded to iasWorld to save new flags.

Example use:

python3 scripts/export.py > sales_val_flags.csv

The sales_val_flags.csv file can then be sent over for upload to iasWorld.

Owner

  • Name: Cook County Assessor's Office
  • Login: ccao-data
  • Kind: organization
  • Email: assessor.data@cookcountyil.gov

GitHub Events

Total
  • Issues event: 8
  • Delete event: 5
  • Issue comment event: 5
  • Push event: 33
  • Pull request review event: 22
  • Pull request review comment event: 17
  • Pull request event: 13
  • Create event: 9
Last Year
  • Issues event: 8
  • Delete event: 5
  • Issue comment event: 5
  • Push event: 33
  • Pull request review event: 22
  • Pull request review comment event: 17
  • Pull request event: 13
  • Create event: 9

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 70
  • Total pull requests: 79
  • Average time to close issues: 26 days
  • Average time to close pull requests: 10 days
  • Total issue authors: 5
  • Total pull request authors: 4
  • Average comments per issue: 0.64
  • Average comments per pull request: 0.42
  • Merged pull requests: 67
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 11
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 0.17
  • Average comments per pull request: 0.18
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • wagnerlmichael (36)
  • jeancochrane (17)
  • dfsnow (8)
  • wrridgeway (1)
  • ccao-jardine (1)
Pull Request Authors
  • wagnerlmichael (76)
  • jeancochrane (16)
  • Damonamajor (5)
  • dfsnow (3)
Top Labels
Issue Labels
enhancement (7) bug (2) documentation (1) blocked (1)
Pull Request Labels
documentation (1)

Dependencies

.github/workflows/pre-commit.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
manual_flagging/requirements.txt pypi
  • PyYAML ==6.0.1
  • Random-Word ==1.0.11
  • awswrangler ==3.3.0
  • boto3 ==1.28.26
  • botocore ==1.31.26
  • certifi ==2023.7.22
  • charset-normalizer ==3.2.0
  • exceptiongroup ==1.1.3
  • fsspec ==2023.6.0
  • idna ==3.4
  • iniconfig ==2.0.0
  • jmespath ==1.0.1
  • joblib ==1.3.2
  • numpy ==1.24.3
  • packaging ==23.1
  • pandas ==2.0.2
  • pluggy ==1.2.0
  • pyarrow ==12.0.1
  • pyathena ==3.0.6
  • pytest ==7.4.0
  • python-dateutil ==2.8.2
  • pytz ==2022.1
  • requests ==2.31.0
  • s3transfer ==0.6.1
  • scikit-learn ==1.3.0
  • scipy ==1.11.1
  • six ==1.16.0
  • tenacity ==8.2.3
  • threadpoolctl ==3.2.0
  • tomli ==2.0.1
  • typing_extensions ==4.7.1
  • tzdata ==2023.3
  • urllib3 ==1.26.16
.github/actions/setup-terraform/action.yaml actions
  • aws-actions/configure-aws-credentials v4 composite
  • hashicorp/setup-terraform v2 composite
.github/workflows/cleanup-terraform.yaml actions
  • ./.github/actions/setup-terraform * composite
  • actions/checkout v3 composite
.github/workflows/deploy-terraform.yaml actions
  • ./.github/actions/setup-terraform * composite
  • actions/checkout v3 composite