Postgres (Stable)

Coming soon

Postgre Profiling

PostgreSQL Profiling is a feature that gathers detailed statistics at both the table and column levels within a PostgreSQL database. It is designed to provide insights into the performance and characteristics of the data stored in the database.

1. Table and Column Level Statistics: PostgreSQL Profiling collects statistics not only on tables but also on individual columns within those tables. This granularity allows you to analyze the data in a more detailed manner.

2. SQL-Based Profiler: The profiling mechanism in PostgreSQL is SQL-based, meaning it relies on SQL queries to gather statistics. It doesn't operate in isolation but can be enabled for specific SQL-based data sources within your PostgreSQL environment.

3. Impact on Ingestion Runs: Enabling profiling comes with a trade-off. While it provides valuable insights, it can slow down the process of ingesting data into the database. This slowdown occurs because the profiling process adds an additional workload to the database, which can affect the overall performance of ingestion jobs.

Caution: It's important to exercise caution when utilizing profiling across a large number of tables or extensive datasets, as this can result in substantial expenses. While we have taken steps to optimize the resource usage of the profiler's queries, it is essential for you to carefully consider the specific tables on which profiling is activated and the frequency of profiling runs.

Capabilities Explained:

Row and Column Counts:
- Row Counts: The profiling tool can extract the total number of rows in each table, providing an overview of the data volume.
- Column Counts: It also captures the count of columns within each table, helping you understand the data structure.
Column-Level Information:
- Null Counts and Proportions: For each column, the tool identifies how many values are null (missing) and calculates the proportion of null values, which is crucial for data quality assessment.
- Distinct Counts and Proportions: It determines the number of unique values in each column and calculates the proportion of distinct values, offering insights into data cardinality.
- Statistical Measures: The tool computes various statistical measures for applicable columns, including the minimum, maximum, mean, median, standard deviation, and select quantile values. These statistics reveal the central tendency and variability of the data.
- Histograms or Frequencies: It can generate histograms or frequency distributions of unique values within columns. These visual representations provide a clearer understanding of data distribution, aiding in data profiling and analysis.

These capabilities enable data professional and compliance team and to comprehensively analyze and profile their data, uncovering valuable insights such as data completeness, data quality, and the distribution of values. This information is essential for tasks like data cleansing, data modeling, and data exploration, ensuring that data-driven decisions are based on a solid understanding of the underlying data.

Supported Sources

SQL profiling is supported for all SQL sources. Check the individual source page to verify if it supports profiling.

Field

Description

host_port

host URL

databasestring

database (catalog). If set to Null, all databases will be considered for ingestion.

database_aliasstring

[Deprecated] Alias to apply to database when ingesting.

include_table_location_lineageboolean

If the source supports it, include table lineage to the underlying storage location.Default: True

include_tablesboolean

Whether tables should be ingested.Default: True

include_view_lineageboolean

Include table lineage for viewsDefault: False

include_viewsboolean

Whether views should be ingested.Default: True

initial_databasestring

Initial database used to query for the list of databases, when ingesting multiple databases. Note: this is not used if database or sqlalchemy_uri are provided.Default: postgres

optionsobject

Any options specified here will be passed to SQLAlchemy.create_engine as kwargs.

passwordstring(password)

password

platform_instancestring

The instance of the platform that all assets produced by this recipe belong to

schemestring

database schemeDefault: postgresql+psycopg2

sqlalchemy_uristring

URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.

usernamestring

username

envstring

The environment that all assets produced by this connector belong toDefault: PROD

database_patternAllowDenyPattern

Regex patterns for databases to filter in ingestion. Note: this is not used if database or sqlalchemy_uri are provided.Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}

database_pattern.allowarray(string)

database_pattern.denyarray(string)

database_pattern.ignoreCaseboolean