Skip to content

opm

Office of Policy and Management dataset

The original database is available from CT’s OPM

https://portal.ct.gov/OPM/IGPP/Publications/Real-Estate-Sales-Listing

The data contains 985,862 observations of up to 14 variables.

fetch_opm(*, data_home=None, download_if_missing=True, return_X_y=False, include_numeric=True, include_categorical=False)

Fetch the OPM dataset from the Connecticut Open Data portal.

Parameters:

Name Type Description Default
data_home

str or pathlib.Path, default=None Specify another download and cache folder for the datasets. By default all spotriver data is stored in ‘~/spotriver_data’ subfolders.

None
download_if_missing

bool, default=True If False, raise an IOError if the data is not locally available rather than trying to download the data from the source site.

True
return_X_y

bool, default=False If True, return (X, y) instead of a Bunch object. See sklearn.utils.Bunch for more information.

False
include_numeric

bool, default=True If True, include numeric columns in the output. Numeric columns include ‘List Year’, ‘Assessed Value’, ‘Sale Amount’, ‘Sales Ratio’, ‘lat’, ‘lon’, and ‘timestamp_rec’.

True
include_categorical

bool, default=False If True, include categorical columns in the output. Categorical columns include ‘Town’, ‘Address’, ‘Property Type’, ‘Residential Type’, ‘Non Use Code’, ‘Assessor Remarks’, and ‘OPM remarks’. Columns with fewer than 200 unique values will be treated as categorical.

False

Returns:

Name Type Description
Union[Tuple[DataFrame, Series], DataFrame, Bunch]

Bunch or Tuple[pd.DataFrame, pd.Series] or pd.DataFrame:

Union[Tuple[DataFrame, Series], DataFrame, Bunch]

If return_X_y is False, return a Bunch object with the following

attributes Union[Tuple[DataFrame, Series], DataFrame, Bunch]

pd.DataFrame of shape (n_samples, n_features), the feature matrix;

target Union[Tuple[DataFrame, Series], DataFrame, Bunch]

pd.Series of shape (n_samples,) The target vector. DESCR : str

Union[Tuple[DataFrame, Series], DataFrame, Bunch]

A short description of the dataset. If return_X_y is True, return a tuple (X, y)

Union[Tuple[DataFrame, Series], DataFrame, Bunch]

where X is the feature matrix and y is the target vector.

Union[Tuple[DataFrame, Series], DataFrame, Bunch]

If only numeric or categorical columns are included in the output,

Union[Tuple[DataFrame, Series], DataFrame, Bunch]

return a pd.DataFrame instead of a Bunch.

Examples:

>>> from spotriver.data import fetch_opm
    # Fetch the OPM dataset and return a pandas DataFrame
    opm_df = fetch_opm()
    # Fetch the OPM dataset, include categorical columns, and return a Bunch object
    opm_data = fetch_opm(include_numeric=False, include_categorical=True, return_X_y=False)
    # Fetch the OPM dataset, include numeric and categorical columns, and return a tuple of pandas DataFrames
    X, y = fetch_opm(include_categorical=True, return_X_y=True)
Source code in spotriver/data/opm.py
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
def fetch_opm(
    *,
    data_home: Union[str, Path] = None,
    download_if_missing: bool = True,
    return_X_y: bool = False,
    include_numeric: bool = True,
    include_categorical: bool = False,
) -> Union[Tuple[pd.DataFrame, pd.Series], pd.DataFrame, Bunch]:
    """Fetch the OPM dataset from the Connecticut Open Data portal.

    Args:
        data_home : str or pathlib.Path, default=None
            Specify another download and cache folder for the datasets. By default
            all spotriver data is stored in '~/spotriver_data' subfolders.
        download_if_missing : bool, default=True
            If False, raise an IOError if the data is not locally available
            rather than trying to download the data from the source site.
        return_X_y : bool, default=False
            If True, return `(X, y)` instead of a `Bunch` object. See
            `sklearn.utils.Bunch` for more information.
        include_numeric : bool, default=True
            If True, include numeric columns in the output. Numeric columns include
            'List Year', 'Assessed Value', 'Sale Amount', 'Sales Ratio', 'lat', 'lon',
            and 'timestamp_rec'.
        include_categorical : bool, default=False
            If True, include categorical columns in the output. Categorical columns
            include 'Town', 'Address', 'Property Type', 'Residential Type',
            'Non Use Code', 'Assessor Remarks', and 'OPM remarks'. Columns with fewer
            than 200 unique values will be treated as categorical.

    Returns:
            Bunch or Tuple[pd.DataFrame, pd.Series] or pd.DataFrame:
            If `return_X_y` is False, return a `Bunch` object with the following
            attributes: pd.DataFrame of shape (n_samples, n_features), the feature matrix;
            target : pd.Series of shape (n_samples,) The target vector. DESCR : str
            A short description of the dataset.  If `return_X_y` is True, return a tuple `(X, y)`
            where `X` is the feature matrix and `y` is the target vector.
            If only numeric or categorical  columns are included in the output,
            return a pd.DataFrame instead of a Bunch.

    Examples:
        >>> from spotriver.data import fetch_opm
            # Fetch the OPM dataset and return a pandas DataFrame
            opm_df = fetch_opm()
            # Fetch the OPM dataset, include categorical columns, and return a Bunch object
            opm_data = fetch_opm(include_numeric=False, include_categorical=True, return_X_y=False)
            # Fetch the OPM dataset, include numeric and categorical columns, and return a tuple of pandas DataFrames
            X, y = fetch_opm(include_categorical=True, return_X_y=True)
    """
    filename = get_data_home(data_home=data_home) / "opm_2001-2020.csv"
    if not filename.is_file():
        if not download_if_missing:
            raise IOError("Data not found and `download_if_missing` is False")
        logger.info(f"Downloading OPM dataset to '{filename}'.")
        urlretrieve(url=OPM_URL, filename=filename)
    # FIXME: Add hash check for download.

    df = pd.read_csv(filename, dtype=OPM_DTYPE, parse_dates=["Date Recorded"])

    # Collect rows (observations) we want to keep and subset only once.
    #
    # This might look kind of ugly but is much more efficient than making copy
    # after copy of the (largish) data frame `df`.
    idx = (
        (df["Date Recorded"] >= "2001-09-30")
        & (df["Assessed Value"] >= 2000)
        & (df["Assessed Value"] <= 1e8)
        & (df["Sale Amount"] >= 2000)
        & (df["Sale Amount"] <= 2e8)
    )
    logger.debug(f"Removing {len(idx) - idx.sum()} rows for constraint violations.")

    # Now keep only those rows which we selected with `idx`, sort the values by
    # the date on which they were recorded and then reset the index.
    df = df.loc[idx].sort_values(by="Date Recorded").reset_index(drop=True)

    cols = []
    if include_numeric:
        # Extract latitude and longitude from Location field.
        # Converting to float32 looses precision.
        df[["lon", "lat"]] = df["Location"].str.extract(r"POINT \((-?\d+\.\d+) (-?\d+\.\d+)\)").astype("float")

        # Check if points are inside the bounding box for CT.
        # Bounding box taken from https://anthonylouisdagostino.com/bounding-boxes-for-all-us-states/
        outside_bbox = (
            (df["lon"] < -73.727775) | (df["lon"] > -71.786994) | (df["lat"] < 40.980144) | (df["lat"] > 42.050587)
        )
        df.loc[outside_bbox, ["lon", "lat"]] = np.nan
        logger.debug(f"Found {outside_bbox.sum()} locations outside of CT's bounding box.")

        # Convert types to smaller types to save some space.
        df["List Year"] = df["List Year"].astype("int16")

        # Add timestamp column by converting the Date Recorded to nanoseconds since epoch
        # and divide by 1e9 to go from nanoseconds to seconds since epoch.
        df["timestamp_rec"] = df["Date Recorded"].astype("int64") // 1e9

        # Converting `Assessed Value` to float32 changes 159 values
        df["Assessed Value"] = df["Assessed Value"].astype("int32")

        # Converting `Assessed Value` to int32/float32 changes 175/222 values
        # df["Sale Amount"] = df["Sale Amount"].astype("int32")

        cols.extend(["List Year", "Assessed Value", "Sale Amount", "Sales Ratio", "lat", "lon", "timestamp_rec"])

    # FIXME: We probably want to invest some time into deriving more meaninfgul
    # categorical variables from some of these. Especially the remarks columns
    # would benefit from a BoW approach and the Address column really carries very
    # little information.
    if include_categorical:
        categorical_columns = [
            "Town",
            "Address",
            "Property Type",
            "Residential Type",
            "Non Use Code",
            "Assessor Remarks",
            "OPM remarks",
        ]
        cols.extend(categorical_columns)

        for cat_col in categorical_columns:
            df[cat_col] = df[cat_col].fillna("Unknown")
            # If there less than 200 unique values, convert to "category"
            # instead of storing as a string to save space.
            if df[cat_col].nunique() < 200:
                df[cat_col] = df[cat_col].astype("category")

    if len(cols) == 0:
        raise Exception("No columns selected. Did you set both `include_numeric` and `include_categorical` to False?")

    X = df[cols]
    # y = df["Sale Amount"]
    y = X.pop("Sale Amount")

    if return_X_y:
        return (X, y)
    return Bunch(data=X, target=y)