Data Loading Guide

This guide covers comprehensive data loading operations in MatrixOne using the pandas-style interface for CSV, TSV, JSON Lines, and Parquet files.

Overview

The MatrixOne SDK provides a pandas-compatible interface for data loading:

Pandas-style API: Methods like read_csv(), read_json(), read_parquet() match pandas naming
Multiple formats: CSV, TSV, JSON Lines, Parquet
ORM model support: Type-safe loading with SQLAlchemy models
Stage integration: Load from S3, local filesystem, and cloud storage
Transaction support: Atomic multi-file loading
Parallel loading: High-performance parallel data loading
Compression support: gzip, bzip2, LZ4, and more

Basic CSV Loading

The simplest way to load CSV data (pandas-style):

from matrixone import Client
from matrixone.orm import declarative_base, Column, Integer, String

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String(100))
    email = Column(String(255))
    age = Column(Integer)

client = Client()
client.connect(database='test')

# Create table
client.create_table(User)

# Basic CSV load (pandas-style)
client.load_data.read_csv('users.csv', table=User)

# CSV with header (pandas-style)
client.load_data.read_csv('users.csv', table=User, skiprows=1)

# Custom separator (pandas-style)
client.load_data.read_csv('users.txt', table=User, sep='|')

# Tab-separated (TSV) (pandas-style)
client.load_data.read_csv('users.tsv', table=User, sep='\t')

client.disconnect()

CSV with Advanced Options

Use pandas-compatible parameters for advanced CSV loading:

from matrixone import Client

client = Client()
client.connect(database='test')

# All pandas-style parameters
client.load_data.read_csv(
    'data.csv',
    table='users',
    sep=',',              # pandas: sep
    quotechar='"',        # pandas: quotechar
    skiprows=1,           # pandas: skiprows
    names=['id', 'name'], # pandas: names
    encoding='utf-8',     # pandas: encoding
    compression='gzip',   # pandas: compression
    parallel=True         # MatrixOne: parallel loading
)

client.disconnect()

JSON Lines Loading

Load JSON Lines (JSONL) files using pandas-compatible read_json():

from matrixone import Client

client = Client()
client.connect(database='test')

# JSON Lines with objects (pandas-style)
client.load_data.read_json(
    'events.jsonl',
    table='events',
    lines=True,          # pandas: lines
    orient='records'     # pandas: orient
)

# JSON Lines with arrays (pandas-style)
client.load_data.read_json(
    'data.jsonl',
    table='users',
    lines=True,
    orient='values'      # Array format
)

# Compressed JSON Lines
client.load_data.read_json(
    'events.jsonl.gz',
    table='events',
    lines=True,
    compression='gzip'
)

client.disconnect()

Parquet Loading

Load Parquet files using pandas-compatible read_parquet():

from matrixone import Client

client = Client()
client.connect(database='test')

# Basic Parquet load (pandas-style)
client.load_data.read_parquet('data.parquet', table='users')

# With ORM model
client.load_data.read_parquet('data.parquet', table=User)

client.disconnect()

Parquet Support Notes:

✅ Fully supports: SNAPPY, GZIP, LZ4, ZSTD, Brotli compression
✅ Fully supports: Parquet 1.0 and 2.0, statistics, nullable columns
⚠️ Must disable dictionary encoding: use_dictionary=False
⚠️ VARCHAR only: Use VARCHAR in table schema, not TEXT
⚠️ UTC timestamps: Use pa.timestamp('ms', tz='UTC')

Loading from External Stages

Load data from external stages (S3, local filesystem):

Using stage:// Protocol

from matrixone import Client

client = Client()
client.connect(database='test')

# Create S3 stage
client.stage.create_s3(
    name='s3_stage',
    bucket='my-bucket',
    path='data/',
    aws_key_id='key',
    aws_secret_key='secret'
)

# Load using stage:// protocol (pandas-style)
client.load_data.read_csv('stage://s3_stage/users.csv', table='users')
client.load_data.read_json('stage://s3_stage/events.jsonl', table='events')
client.load_data.read_parquet('stage://s3_stage/data.parquet', table='users')

client.disconnect()

Using Convenience Methods

from matrixone import Client

client = Client()
client.connect(database='test')

# Create stage
client.stage.create_local('local_stage', '/data/')

# Load using convenience methods (pandas-style)
client.load_data.read_csv_stage('local_stage', 'users.csv', table='users')
client.load_data.read_json_stage('local_stage', 'events.jsonl', table='events')
client.load_data.read_parquet_stage('local_stage', 'data.parquet', table='users')

# With options
client.load_data.read_csv_stage(
    'local_stage',
    'data.csv',
    table='users',
    sep='\t',
    skiprows=1
)

client.disconnect()

Transaction-Based Loading

Load multiple files atomically within a transaction:

from matrixone import Client
from sqlalchemy import select, insert

client = Client()
client.connect(database='test')

# Atomic multi-file loading (pandas-style)
with client.session() as session:
    # Load multiple files atomically
    session.load_data.read_csv('users.csv', table=User)
    session.load_data.read_csv('orders.csv', table=Order)
    session.load_data.read_json('events.jsonl', table=Event, lines=True)

    # Mix with other operations
    session.execute(insert(User).values(name='Admin', email='admin@example.com'))

    # All operations commit together or rollback on error

client.disconnect()

Inline Data Loading

Load data from strings without creating files:

from matrixone import Client

client = Client()
client.connect(database='test')

# CSV inline (pandas-style)
csv_data = "1,Alice,alice@example.com\\n2,Bob,bob@example.com\\n"
client.load_data.read_csv(csv_data, table='users', inline=True)

# Or use explicit inline method
client.load_data.read_csv_inline(csv_data, table='users')

# JSON Lines inline (pandas-style)
json_data = '{"id":1,"name":"Alice"}\\n{"id":2,"name":"Bob"}\\n'
client.load_data.read_json(json_data, table='users', inline=True)

# Or use explicit inline method
client.load_data.read_json_inline(json_data, table='users', orient='records')

client.disconnect()

Async Data Loading

All loading methods have async versions for non-blocking operations:

import asyncio
from matrixone import AsyncClient

async def async_load_example():
    client = AsyncClient()
    await client.connect(database='test')

    # Async CSV load (pandas-style)
    await client.load_data.read_csv('users.csv', table='users', skiprows=1)

    # Async JSON load (pandas-style)
    await client.load_data.read_json('events.jsonl', table='events', lines=True)

    # Async Parquet load (pandas-style)
    await client.load_data.read_parquet('data.parquet', table='users')

    # From stage
    await client.load_data.read_csv_stage('s3_stage', 'data.csv', table='users')

    await client.disconnect()

asyncio.run(async_load_example())

Concurrent Loading

import asyncio
from matrixone import AsyncClient

async def concurrent_load():
    client = AsyncClient()
    await client.connect(database='test')

    # Load multiple files concurrently (pandas-style)
    await asyncio.gather(
        client.load_data.read_csv('users.csv', table='users'),
        client.load_data.read_csv('orders.csv', table='orders'),
        client.load_data.read_csv('products.csv', table='products')
    )

    await client.disconnect()

asyncio.run(concurrent_load())

Parameter Reference

Pandas-Compatible Parameters

All parameters match pandas naming conventions:

Parameter	Pandas Equivalent	Description
`filepath_or_buffer`	`filepath_or_buffer`	File path, stage path, or inline data
`table`	N/A	Table name (str) or SQLAlchemy model
`sep`	`sep`	Field separator (default: ‘,’)
`quotechar`	`quotechar`	Quote character (e.g., ‘”’)
`skiprows`	`skiprows`	Number of rows to skip (default: 0)
`names`	`names`	Column names to load
`encoding`	`encoding`	Character encoding (e.g., ‘utf-8’)
`compression`	`compression`	Compression format (‘gzip’, ‘bzip2’, etc.)
`lines`	`lines`	Read JSON as lines (JSONL format)
`orient`	`orient`	JSON structure (‘records’ or ‘values’)

Best Practices

Use Pandas-Style API

The new read_csv(), read_json(), read_parquet() methods are more intuitive
Use ORM Models

Pass SQLAlchemy models for type safety: read_csv('data.csv', table=User)
Use Transactions for Atomicity

Load multiple files atomically with session()
Use Stages for External Data

Load from S3 or cloud storage using read_csv_stage() or stage:// protocol
Use Parallel Loading for Large Files

Enable parallel=True for files >100MB
Handle Headers with skiprows

Use skiprows=1 to skip header rows (pandas convention)

Common Use Cases

Data Migration

with client.session() as session:
    # Migrate multiple tables atomically (pandas-style)
    session.load_data.read_csv('users.csv', table=User, skiprows=1)
    session.load_data.read_csv('orders.csv', table=Order, skiprows=1)
    session.load_data.read_csv('products.csv', table=Product, skiprows=1)

ETL Pipeline

# Extract from various sources, load into MatrixOne (pandas-style)
client.load_data.read_csv('crm_export.csv', table='customers', sep='|')
client.load_data.read_json('events.jsonl', table='events', lines=True)
client.load_data.read_parquet('analytics.parquet', table='metrics')

Log File Import

# Load log files with custom parsing (pandas-style)
client.load_data.read_csv(
    'application.log',
    table='logs',
    sep='\t',
    names=['timestamp', 'level', 'message', 'source'],
    skiprows=0
)