Conventions =========== **finagg** has a number of conventions around package organization, data organization, and data normalization. Understanding these conventions makes **finagg** a bit more ergonomic. This page covers those conventions. Import Conventions ------------------ Following **finagg**'s import conventions guarantees updates to **finagg** won't break your code. Although definitions may shift around during or **finagg**'s organization may change slightly, your code won't be affected so long as you follow the import conventions. On top of this benefit, **finagg**'s import conventions just simplify **finagg**'s usage. **finagg** is designed to be imported once at the highest module: >>> import finagg # doctest: +SKIP Subpackages and submodules are usually accessed through their fully qualified names from the top-level module: >>> finagg.bea.api # doctest: +SKIP >>> finagg.fred.api # doctest: +SKIP >>> finagg.sec.api # doctest: +SKIP It's also common for subpackages to be imported using their name as an alias: >>> import finagg.bea as bea >>> import finagg.fred as fred >>> import finagg.sec as sec Package Organization -------------------- **finagg** is organized according to API implementations, SQL table definitions, and feature definitions. As such, each subpackage has up to three submodules within it: * an ``api`` module that implements the subpackage's API (if one exists) * a ``sql`` module that defines SQL tables for organizing aggregated data along with utility functions for common SQL queries (if there's enough data that deems SQL tables necessary) * a ``feat`` module that defines features aggregated from the ``api`` and ``sql`` submodules (if the data would benefit from special queries or normalization) As an example, the :mod:`finagg.sec` contains all three submodules because of the complexity of its API, its data, and its features: * :mod:`finagg.sec.api` for implementing the SEC EDGAR API * :mod:`finagg.sec.sql` for defining SQL tables around raw SEC EDGAR API data, refined SEC EDGAR API data, and helper functions that replicate some SEC EDGAR API methods using data from the SQL tables * :mod:`finagg.sec.feat` for defining features and helper methods for constructing those features API Implementations ------------------- APIs are implemented as singleton class instances within ``api`` submodules Each singleton has a ``get`` method for accessing data from API endpoints. Some API implementations include class attributes that define API metadata (such as URLs or endpoint names), while other API implementations include helper methods for navigating the APIs. The design of each API implementation is based on the reference API that's being implemented. As an example, the BEA API, implemented by :mod:`finagg.bea.api`, contains a singleton :data:`finagg.bea.api.gdp_by_industry` with an attribute :attr:`~finagg.bea.api.API.name` that describes the BEA API database that the singleton refers to. In addition, the singleton has methods :meth:`~finagg.bea.api.API.get_parameter_list` and :meth:`~finagg.bea.api.API.get_parameter_values` for getting API parameters and API parameter value options, respectively, while :meth:`~finagg.bea.api.GDPByIndustry.get` is the actual method for retrieving data from the API is implemented by. >>> finagg.bea.api.gdp_by_industry.name 'GdpByIndustry' >>> finagg.bea.api.gdp_by_industry.get_parameter_list() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE ParameterName ParameterDataType ParameterDescription ... 0 Frequency string A - Annual, Q-Quarterly ... 1 Industry string List of industries to retrieve (ALL for All) ... 2 TableID integer The unique GDP by Industry table identifier (A... ... 3 Year integer List of year(s) of data to retrieve (ALL for All) ... >>> finagg.bea.api.gdp_by_industry.get_parameter_values("TableID").head(5) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE Key Desc 0 1 Value Added by Industry (A) (Q) 1 5 Value added by Industry as a Percentage of Gro... 2 6 Components of Value Added by Industry (A) 3 7 Components of Value Added by Industry as a Per... 4 8 Chain-Type Quantity Indexes for Value Added by... Other implemented APIs, such as the SEC EDGAR API implemented by :mod:`finagg.sec.api`, don't have as many helper methods and are barebone implementations. Almost Everything is a Dataframe -------------------------------- Dataframes are just too convenient to not use as the fundamental type within **finagg**. Almost all objects returned by APIs and features are dataframes. Helper Methods for Inspecting Available Data -------------------------------------------- Most submodules and singletons contain helper methods for getting sets of IDs available through other methods. These methods are useful for verifying if data has been installed properly or for selecting a subset of data for further refinement. Examples of these methods include: * :meth:`finagg.fred.feat.Series.get_id_set` returns installed economic data series IDs * :meth:`finagg.sec.api.get_ticker_set` returns all the tickers that have at least *some* data available through the SEC EDGAR API * :meth:`finagg.sec.feat.Quarterly.get_ticker_set` returns all the tickers that have quarterly features available Data Organization ----------------- There are only a handful of conventions regarding data organization: * Data returned by API implementations that're used by features typically have their own SQL table definitions. This is convenient for querying API data offline and for customizing features without having to repeatedly get data from APIs. * Classes within ``feat`` submodules and SQL tables within ``sql`` submodules are named similarly to indicate their relationship. * Unaltered data from APIs are typically referred to as "raw" data while features are referred to as "refined" data. Refined data SQL tables typically have foreign key constraints on raw data SQL tables such that refined rows are deleted when raw rows are deleted with the same primary key. Data Normalization ------------------ Data returned by API implementations is not normalized or standardized beyond type casting and column renaming. However, data returned by feature implementations is normalized depending on the nature of the data. The general rules implemented for data normalization are as follows: * Data whose scale drifts over time or is not easily normalizable through other means (e.g., gross domestic product, compony stock price, etc.) is converted to log changes. Since the log change of the first sample in a series cannot be computed and is NaN, it is dropped from the series. * Data gaps and/or NaNs are forward-filled with the previous non-NaN value. If the series being forward-filled is a log change series then gaps and/or NaNs are replaced with zeros instead (indicating that no change occurs). * Inf values are replaced with NaNs and forward-filled with the same logic as the previous bullet. * Dataframe indices are always based on some time unit. When an index has multiple levels (e.g., features returned by :data:`finagg.sec.feat.quarterly`), the levels are ordered from least granular to most granular (e.g., year -> quarter -> date). Indices are always sorted. Feature Method Naming --------------------- Features are aggregations or collections of raw and/or refined data that're ready for ingestion by another process. Features can be aggregated from APIs, local SQL tables, or combinations of both. Features generally can be aggregated by more than one method, and a method's name determines where the feature is aggregated from. The feature's aggregation source(s) implies properties associated with instantiating and maintaining the feature. For example, if a feature is aggregated directly from an API, then that implies the feature is likely not being stored locally, saving a bit of disk space. Feature aggregation methods are named according to where the features are being aggregated from to clarify the implications associated with the methods: * A ``from_api`` method implies the feature is aggregated directly from API calls. It's best to reserve ``from_api`` for experimentation. * A ``from_raw`` method implies the feature is aggregated from local raw SQL tables. No extra storage space is being used to store the completely refined features; only already-stored raw data is being used to aggregate the features. * A ``from_refined`` method implies the feature is aggregated from local refined SQL tables. This is likely the fastest method for accessing a feature, but at the cost of additional disk usage. Disk usage can be significant and adds up quickly depending on the number of time series being stored. * A ``from_other_refined`` method implies the feature is aggregated from local refined SQL tables outside of the feature's subpackage. This is likely preferrable over ``from_refined`` when it's available as it uses significantly less storage with little loss in speed.