Metadata Operations Guide ========================== MatrixOne provides powerful metadata scanning capabilities through the `metadata_scan` function, which allows you to analyze table statistics, column information, and data distribution. The MatrixOne Python client provides a convenient interface to access these capabilities. Overview -------- The metadata module provides the following key features: - **Table Metadata Scanning**: Analyze table structure and statistics - **Column Statistics**: Get detailed information about each column - **Row Count Analysis**: Count rows with optional tombstone filtering - **Size Information**: Analyze storage usage and data distribution - **Index Metadata**: Scan metadata for specific indexes - **Tombstone Analysis**: Analyze deleted data objects - **Structured Schema**: Fixed schema with 13 predefined columns - **Column Selection**: Choose specific columns to return - **Type Safety**: Strongly typed data structures for metadata rows - **Transaction Support**: Metadata operations within transactions Transaction-Aware Metadata Operations (Recommended) ---------------------------------------------------- Use ``client.session()`` for metadata analysis within transactions: Metadata Analysis in Transaction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from matrixone import Client from matrixone.orm import Base, Column, Integer, String from sqlalchemy import select, insert, func client = Client() client.connect(database='test') # Analyze metadata within transaction with client.session() as session: # Insert test data session.execute( insert(User).values(name='Alice', email='alice@example.com') ) session.execute( insert(User).values(name='Bob', email='bob@example.com') ) # Scan metadata within same transaction metadata_rows = session.metadata.scan( dbname='test', tablename='users', columns='*' ) for row in metadata_rows: print(f"Column: {row.col_name}") print(f" Rows: {row.rows_cnt}") print(f" Nulls: {row.null_cnt}") print(f" Size: {row.origin_size}") # Get table statistics stats = session.metadata.get_table_brief_stats( dbname='test', tablename='users' ) print(f"Total rows: {stats['users']['row_cnt']}") # All operations commit together client.disconnect() Metadata-Driven Table Operations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from matrixone import Client from sqlalchemy import select, func client = Client() client.connect(database='test') # Use metadata to drive table operations with client.session() as session: # Scan table metadata metadata_rows = session.metadata.scan( dbname='test', tablename='large_table' ) # Analyze null counts high_null_columns = [ row.col_name for row in metadata_rows if row.null_cnt > row.rows_cnt * 0.5 # > 50% nulls ] if high_null_columns: print(f"Columns with high null rate: {high_null_columns}") # Optionally clean up or report for col in high_null_columns: # Could update, log, or alert pass # Get detailed statistics stats = session.metadata.get_table_brief_stats( dbname='test', tablename='large_table' ) # All analysis within transaction client.disconnect() Cross-Table Metadata Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from matrixone import Client from sqlalchemy import select client = Client() client.connect(database='test') # Compare metadata across tables with client.session() as session: # Scan multiple tables users_meta = session.metadata.scan(dbname='test', tablename='users') orders_meta = session.metadata.scan(dbname='test', tablename='orders') # Compare sizes users_size = sum(row.origin_size for row in users_meta) orders_size = sum(row.origin_size for row in orders_meta) print(f"Users table size: {users_size} bytes") print(f"Orders table size: {orders_size} bytes") # Get brief stats for both users_stats = session.metadata.get_table_brief_stats('test', 'users') orders_stats = session.metadata.get_table_brief_stats('test', 'orders') print(f"Users: {users_stats['users']['row_cnt']} rows") print(f"Orders: {orders_stats['orders']['row_cnt']} rows") # All operations atomic client.disconnect() Basic Usage ----------- Synchronous Operations ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from matrixone import Client from matrixone.config import get_connection_kwargs # Connect to MatrixOne client = Client() client.connect(**get_connection_kwargs()) # Basic table metadata scan result = client.metadata.scan("mo_catalog", "mo_columns") rows = result.fetchall() for row in rows: print(f"Column: {row._mapping['col_name']}") print(f"Rows: {row._mapping['rows_cnt']}") print(f"Nulls: {row._mapping['null_cnt']}") print(f"Size: {row._mapping['origin_size']}") client.disconnect() Metadata Schema and Column Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The metadata scan returns a fixed schema with 13 predefined columns. You can choose to return all columns as structured data or select specific columns. Available Columns ^^^^^^^^^^^^^^^^^ The metadata schema includes the following columns: .. code-block:: python from matrixone.metadata import MetadataColumn # Available columns print(MetadataColumn.COL_NAME.value) # 'col_name' print(MetadataColumn.OBJECT_NAME.value) # 'object_name' print(MetadataColumn.IS_HIDDEN.value) # 'is_hidden' print(MetadataColumn.OBJ_LOC.value) # 'obj_loc' print(MetadataColumn.CREATE_TS.value) # 'create_ts' print(MetadataColumn.DELETE_TS.value) # 'delete_ts' print(MetadataColumn.ROWS_CNT.value) # 'rows_cnt' print(MetadataColumn.NULL_CNT.value) # 'null_cnt' print(MetadataColumn.COMPRESS_SIZE.value) # 'compress_size' print(MetadataColumn.ORIGIN_SIZE.value) # 'origin_size' print(MetadataColumn.MIN.value) # 'min' print(MetadataColumn.MAX.value) # 'max' print(MetadataColumn.SUM.value) # 'sum' Structured Data with All Columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To get all columns as structured data objects: .. code-block:: python from matrixone.metadata import MetadataColumn # Get all columns as structured MetadataRow objects rows = client.metadata.scan("mo_catalog", "mo_columns", columns="*") for row in rows: print(f"Column: {row.col_name}") print(f"Rows: {row.rows_cnt}") print(f"Hidden: {row.is_hidden}") print(f"Size: {row.origin_size}") print(f"Min: {row.min}") print(f"Max: {row.max}") Selecting Specific Columns ^^^^^^^^^^^^^^^^^^^^^^^^^^ To get only specific columns: .. code-block:: python from matrixone.metadata import MetadataColumn # Get only column name and row count rows = client.metadata.scan("mo_catalog", "mo_columns", columns=[MetadataColumn.COL_NAME, MetadataColumn.ROWS_CNT]) for row in rows: print(f"Column: {row['col_name']}") print(f"Rows: {row['rows_cnt']}") # Or using string column names rows = client.metadata.scan("mo_catalog", "mo_columns", columns=['col_name', 'origin_size']) for row in rows: print(f"Column: {row['col_name']}") print(f"Size: {row['origin_size']}") Distinct Object Names ^^^^^^^^^^^^^^^^^^^^^ To get only distinct object names from metadata scan: .. code-block:: python from matrixone.metadata import MetadataColumn # Get distinct object names only rows = client.metadata.scan("mo_catalog", "mo_columns", distinct_object_name=True) for row in rows: print(f"Object: {row._mapping['object_name']}") # Get distinct object names with structured data rows = client.metadata.scan("mo_catalog", "mo_columns", distinct_object_name=True, columns="*") for row in rows: print(f"Object: {row.object_name}") Asynchronous Operations ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import asyncio from matrixone import AsyncClient from matrixone.config import get_connection_kwargs async def main(): # Connect to MatrixOne client = AsyncClient() await client.connect(**get_connection_kwargs()) # Basic async table metadata scan (raw SQLAlchemy Result) result = await client.metadata.scan("mo_catalog", "mo_columns") rows = result.fetchall() for row in rows: print(f"Column: {row._mapping['col_name']}") print(f"Rows: {row._mapping['rows_cnt']}") print(f"Nulls: {row._mapping['null_cnt']}") print(f"Size: {row._mapping['origin_size']}") # Get structured data with all columns rows = await client.metadata.scan("mo_catalog", "mo_columns", columns="*") for row in rows: print(f"Column: {row.col_name}") print(f"Rows: {row.rows_cnt}") print(f"Hidden: {row.is_hidden}") print(f"Size: {row.origin_size}") # Get only specific columns rows = await client.metadata.scan("mo_catalog", "mo_columns", columns=['col_name', 'rows_cnt', 'origin_size']) for row in rows: print(f"Column: {row['col_name']}") print(f"Rows: {row['rows_cnt']}") print(f"Size: {row['origin_size']}") await client.disconnect() asyncio.run(main()) Metadata Scan Syntax -------------------- The `metadata_scan` function supports different syntax patterns for various use cases: Basic Table Scan ~~~~~~~~~~~~~~~~ .. code-block:: python # Scan all columns of a table result = client.metadata.scan("mo_catalog", "mo_columns") # Equivalent SQL: SELECT * FROM metadata_scan('mo_catalog.mo_columns', '*') Index-Specific Scan ~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Scan specific index result = client.metadata.scan("mo_catalog", "mo_columns", indexname="index_name") # Equivalent SQL: SELECT * FROM metadata_scan('mo_catalog.mo_columns.?index_name', '*') Tombstone Scan ~~~~~~~~~~~~~~ .. code-block:: python # Scan tombstone objects result = client.metadata.scan("mo_catalog", "mo_columns", is_tombstone=True) # Equivalent SQL: SELECT * FROM metadata_scan('mo_catalog.mo_columns.#', '*') Index Tombstone Scan ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Scan tombstone objects for specific index result = client.metadata.scan("mo_catalog", "mo_columns", indexname="index_name", is_tombstone=True) # Equivalent SQL: SELECT * FROM metadata_scan('mo_catalog.mo_columns.?index_name.#', '*') High-Level Methods ------------------ Table Brief Statistics ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Get brief statistics for a table stats = client.metadata.get_table_brief_stats("mo_catalog", "mo_columns") for table_name, table_stats in stats.items(): print(f"Table: {table_name}") print(f" Total objects: {table_stats['total_objects']}") print(f" Total rows: {table_stats['row_cnt']}") print(f" Total nulls: {table_stats['null_cnt']}") print(f" Original size: {table_stats['original_size']}") print(f" Compressed size: {table_stats['compress_size']}") Table Detailed Statistics ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Get detailed statistics for a table detail_stats = client.metadata.get_table_detail_stats("mo_catalog", "mo_columns") for table_name, table_details in detail_stats.items(): print(f"Table: {table_name}") for detail in table_details: print(f" Object: {detail['object_name']}") print(f" Created: {detail['create_ts']}") print(f" Rows: {detail['row_cnt']}, Nulls: {detail['null_cnt']}") print(f" Size: {detail['original_size']} -> {detail['compress_size']}") Transaction Operations ---------------------- Metadata operations can also be performed within transactions: .. code-block:: python with client.transaction() as tx: # Get table brief stats within transaction stats = tx.metadata.get_table_brief_stats("mo_catalog", "mo_columns") # Get table detailed stats within transaction detail_stats = tx.metadata.get_table_detail_stats("mo_catalog", "mo_columns") # Scan metadata within transaction result = tx.metadata.scan("mo_catalog", "mo_columns") Async Transaction Operations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python async with client.transaction() as tx: # Get table brief stats within async transaction stats = await tx.metadata.get_table_brief_stats("mo_catalog", "mo_columns") # Get table detailed stats within async transaction detail_stats = await tx.metadata.get_table_detail_stats("mo_catalog", "mo_columns") # Scan metadata within async transaction result = await tx.metadata.scan("mo_catalog", "mo_columns") Metadata Fields --------------- The metadata scan results contain the following fields: - **col_name**: Column name - **rows_cnt**: Number of rows in the column - **null_cnt**: Number of null values - **origin_size**: Original size in bytes - **min**: Minimum value (if applicable) - **max**: Maximum value (if applicable) - **sum**: Sum of values (if applicable) - **create_ts**: Creation timestamp - **is_tombstone**: Whether this is a tombstone object Example Use Cases ----------------- Database Analysis ~~~~~~~~~~~~~~~~~ .. code-block:: python def analyze_database(client, database_name): """Analyze all tables in a database""" # Get list of tables result = client.execute(f"SHOW TABLES FROM {database_name}") tables = [row[0] for row in result.rows] total_size = 0 total_rows = 0 for table in tables: stats = client.metadata.get_table_brief_stats(database_name, table) table_stats = stats.get(table, {}) total_size += table_stats.get('original_size', 0) total_rows += table_stats.get('row_cnt', 0) print(f"Table: {table}") print(f" Rows: {table_stats.get('row_cnt', 0)}") print(f" Size: {table_stats.get('original_size', 0)} bytes") print(f" Objects: {table_stats.get('total_objects', 0)}") print(f"\nDatabase Summary:") print(f"Total tables: {len(tables)}") print(f"Total rows: {total_rows}") print(f"Total size: {total_size} bytes") Storage Optimization ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python def find_large_tables(client, database_name, size_threshold=1024*1024): """Find tables larger than threshold""" result = client.execute(f"SHOW TABLES FROM {database_name}") tables = [row[0] for row in result.rows] large_tables = [] for table in tables: stats = client.metadata.get_table_brief_stats(database_name, table) table_stats = stats.get(table, {}) table_size = table_stats.get('original_size', 0) if table_size > size_threshold: large_tables.append({ 'table': table, 'size': table_size, 'objects': table_stats.get('total_objects', 0) }) # Sort by size large_tables.sort(key=lambda x: x['size'], reverse=True) print("Large tables:") for table_info in large_tables: print(f" {table_info['table']}: {table_info['size']} bytes, {table_info['objects']} objects") Data Quality Analysis ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python def analyze_data_quality(client, database_name, table_name): """Analyze data quality metrics""" result = client.metadata.scan(database_name, table_name) rows = result.fetchall() print(f"Data Quality Analysis for {table_name}:") for row in rows: column_name = row._mapping['col_name'] total_rows = row._mapping['rows_cnt'] null_count = row._mapping['null_cnt'] null_percentage = (null_count / total_rows * 100) if total_rows > 0 else 0 print(f" {column_name}:") print(f" Total rows: {total_rows}") print(f" Null values: {null_count} ({null_percentage:.2f}%)") print(f" Data completeness: {100 - null_percentage:.2f}%") Performance Monitoring ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python def monitor_table_growth(client, database_name, table_name): """Monitor table growth over time""" import time while True: stats = client.metadata.get_table_brief_stats(database_name, table_name) table_stats = stats.get(table_name, {}) print(f"Table: {table_name}") print(f" Rows: {table_stats.get('row_cnt', 0)}") print(f" Size: {table_stats.get('original_size', 0)} bytes") print(f" Timestamp: {time.strftime('%Y-%m-%d %H:%M:%S')}") print("-" * 40) time.sleep(60) # Check every minute Best Practices -------------- 1. **Use Appropriate Methods**: Choose the right method for your use case: - Use `scan()` for raw metadata access and column-specific analysis - Use `get_table_brief_stats()` for quick table overview - Use `get_table_detail_stats()` for comprehensive table analysis with object details 2. **Handle Tombstone Data**: Be aware of tombstone objects when analyzing data: - Use `is_tombstone=False` to exclude deleted data - Use `is_tombstone=True` to analyze deleted data patterns 3. **Index Analysis**: Use index-specific scans to analyze index performance: - Monitor index size and usage - Identify unused or oversized indexes 4. **Async Operations**: Use async methods for better performance in concurrent scenarios: - Async methods are non-blocking - Better resource utilization - Suitable for monitoring and analysis tools 5. **Transaction Context**: Use metadata operations within transactions when needed: - Ensures consistency with other operations - Useful for data migration and analysis Error Handling -------------- .. code-block:: python try: stats = client.metadata.get_table_brief_stats("mo_catalog", "mo_columns") except Exception as e: print(f"Error getting table stats: {e}") # Handle error appropriately try: result = client.metadata.scan("mo_catalog", "nonexistent_table") except Exception as e: print(f"Table does not exist: {e}") # Handle missing table Troubleshooting --------------- Common Issues ~~~~~~~~~~~~~ 1. **Table Not Found**: Ensure the table exists and you have proper permissions 2. **Database Not Found**: Verify the database name and connection 3. **Permission Denied**: Check user permissions for metadata access 4. **Empty Results**: Some metadata operations may return empty results for new tables Performance Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. **Large Tables**: Metadata scanning can be slow for very large tables 2. **Frequent Queries**: Consider caching results for frequently accessed metadata 3. **Index Usage**: Use index-specific scans when possible for better performance 4. **Async Operations**: Use async methods for better concurrency For more examples and advanced usage, see the `examples/example_metadata.py` file.