Data Loading Guide
This guide covers comprehensive data loading operations in MatrixOne using the pandas-style interface for CSV, TSV, JSON Lines, and Parquet files.
Overview
The MatrixOne SDK provides a pandas-compatible interface for data loading:
Pandas-style API: Methods like
read_csv(),read_json(),read_parquet()match pandas namingMultiple formats: CSV, TSV, JSON Lines, Parquet
ORM model support: Type-safe loading with SQLAlchemy models
Stage integration: Load from S3, local filesystem, and cloud storage
Transaction support: Atomic multi-file loading
Parallel loading: High-performance parallel data loading
Compression support: gzip, bzip2, LZ4, and more
Basic CSV Loading
The simplest way to load CSV data (pandas-style):
from matrixone import Client
from matrixone.orm import declarative_base, Column, Integer, String
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String(100))
email = Column(String(255))
age = Column(Integer)
client = Client()
client.connect(database='test')
# Create table
client.create_table(User)
# Basic CSV load (pandas-style)
client.load_data.read_csv('users.csv', table=User)
# CSV with header (pandas-style)
client.load_data.read_csv('users.csv', table=User, skiprows=1)
# Custom separator (pandas-style)
client.load_data.read_csv('users.txt', table=User, sep='|')
# Tab-separated (TSV) (pandas-style)
client.load_data.read_csv('users.tsv', table=User, sep='\t')
client.disconnect()
CSV with Advanced Options
Use pandas-compatible parameters for advanced CSV loading:
from matrixone import Client
client = Client()
client.connect(database='test')
# All pandas-style parameters
client.load_data.read_csv(
'data.csv',
table='users',
sep=',', # pandas: sep
quotechar='"', # pandas: quotechar
skiprows=1, # pandas: skiprows
names=['id', 'name'], # pandas: names
encoding='utf-8', # pandas: encoding
compression='gzip', # pandas: compression
parallel=True # MatrixOne: parallel loading
)
client.disconnect()
JSON Lines Loading
Load JSON Lines (JSONL) files using pandas-compatible read_json():
from matrixone import Client
client = Client()
client.connect(database='test')
# JSON Lines with objects (pandas-style)
client.load_data.read_json(
'events.jsonl',
table='events',
lines=True, # pandas: lines
orient='records' # pandas: orient
)
# JSON Lines with arrays (pandas-style)
client.load_data.read_json(
'data.jsonl',
table='users',
lines=True,
orient='values' # Array format
)
# Compressed JSON Lines
client.load_data.read_json(
'events.jsonl.gz',
table='events',
lines=True,
compression='gzip'
)
client.disconnect()
Parquet Loading
Load Parquet files using pandas-compatible read_parquet():
from matrixone import Client
client = Client()
client.connect(database='test')
# Basic Parquet load (pandas-style)
client.load_data.read_parquet('data.parquet', table='users')
# With ORM model
client.load_data.read_parquet('data.parquet', table=User)
client.disconnect()
Parquet Support Notes:
✅ Fully supports: SNAPPY, GZIP, LZ4, ZSTD, Brotli compression
✅ Fully supports: Parquet 1.0 and 2.0, statistics, nullable columns
⚠️ Must disable dictionary encoding:
use_dictionary=False⚠️ VARCHAR only: Use
VARCHARin table schema, notTEXT⚠️ UTC timestamps: Use
pa.timestamp('ms', tz='UTC')
Loading from External Stages
Load data from external stages (S3, local filesystem):
Using stage:// Protocol
from matrixone import Client
client = Client()
client.connect(database='test')
# Create S3 stage
client.stage.create_s3(
name='s3_stage',
bucket='my-bucket',
path='data/',
aws_key_id='key',
aws_secret_key='secret'
)
# Load using stage:// protocol (pandas-style)
client.load_data.read_csv('stage://s3_stage/users.csv', table='users')
client.load_data.read_json('stage://s3_stage/events.jsonl', table='events')
client.load_data.read_parquet('stage://s3_stage/data.parquet', table='users')
client.disconnect()
Using Convenience Methods
from matrixone import Client
client = Client()
client.connect(database='test')
# Create stage
client.stage.create_local('local_stage', '/data/')
# Load using convenience methods (pandas-style)
client.load_data.read_csv_stage('local_stage', 'users.csv', table='users')
client.load_data.read_json_stage('local_stage', 'events.jsonl', table='events')
client.load_data.read_parquet_stage('local_stage', 'data.parquet', table='users')
# With options
client.load_data.read_csv_stage(
'local_stage',
'data.csv',
table='users',
sep='\t',
skiprows=1
)
client.disconnect()
Transaction-Based Loading
Load multiple files atomically within a transaction:
from matrixone import Client
from sqlalchemy import select, insert
client = Client()
client.connect(database='test')
# Atomic multi-file loading (pandas-style)
with client.session() as session:
# Load multiple files atomically
session.load_data.read_csv('users.csv', table=User)
session.load_data.read_csv('orders.csv', table=Order)
session.load_data.read_json('events.jsonl', table=Event, lines=True)
# Mix with other operations
session.execute(insert(User).values(name='Admin', email='admin@example.com'))
# All operations commit together or rollback on error
client.disconnect()
Inline Data Loading
Load data from strings without creating files:
from matrixone import Client
client = Client()
client.connect(database='test')
# CSV inline (pandas-style)
csv_data = "1,Alice,alice@example.com\\n2,Bob,bob@example.com\\n"
client.load_data.read_csv(csv_data, table='users', inline=True)
# Or use explicit inline method
client.load_data.read_csv_inline(csv_data, table='users')
# JSON Lines inline (pandas-style)
json_data = '{"id":1,"name":"Alice"}\\n{"id":2,"name":"Bob"}\\n'
client.load_data.read_json(json_data, table='users', inline=True)
# Or use explicit inline method
client.load_data.read_json_inline(json_data, table='users', orient='records')
client.disconnect()
Async Data Loading
All loading methods have async versions for non-blocking operations:
import asyncio
from matrixone import AsyncClient
async def async_load_example():
client = AsyncClient()
await client.connect(database='test')
# Async CSV load (pandas-style)
await client.load_data.read_csv('users.csv', table='users', skiprows=1)
# Async JSON load (pandas-style)
await client.load_data.read_json('events.jsonl', table='events', lines=True)
# Async Parquet load (pandas-style)
await client.load_data.read_parquet('data.parquet', table='users')
# From stage
await client.load_data.read_csv_stage('s3_stage', 'data.csv', table='users')
await client.disconnect()
asyncio.run(async_load_example())
Concurrent Loading
import asyncio
from matrixone import AsyncClient
async def concurrent_load():
client = AsyncClient()
await client.connect(database='test')
# Load multiple files concurrently (pandas-style)
await asyncio.gather(
client.load_data.read_csv('users.csv', table='users'),
client.load_data.read_csv('orders.csv', table='orders'),
client.load_data.read_csv('products.csv', table='products')
)
await client.disconnect()
asyncio.run(concurrent_load())
Parameter Reference
Pandas-Compatible Parameters
All parameters match pandas naming conventions:
Parameter |
Pandas Equivalent |
Description |
|---|---|---|
|
|
File path, stage path, or inline data |
|
N/A |
Table name (str) or SQLAlchemy model |
|
|
Field separator (default: ‘,’) |
|
|
Quote character (e.g., ‘”’) |
|
|
Number of rows to skip (default: 0) |
|
|
Column names to load |
|
|
Character encoding (e.g., ‘utf-8’) |
|
|
Compression format (‘gzip’, ‘bzip2’, etc.) |
|
|
Read JSON as lines (JSONL format) |
|
|
JSON structure (‘records’ or ‘values’) |
Best Practices
Use Pandas-Style API
The new
read_csv(),read_json(),read_parquet()methods are more intuitiveUse ORM Models
Pass SQLAlchemy models for type safety:
read_csv('data.csv', table=User)Use Transactions for Atomicity
Load multiple files atomically with
session()Use Stages for External Data
Load from S3 or cloud storage using
read_csv_stage()orstage://protocolUse Parallel Loading for Large Files
Enable
parallel=Truefor files >100MBHandle Headers with skiprows
Use
skiprows=1to skip header rows (pandas convention)
Common Use Cases
Data Migration
with client.session() as session:
# Migrate multiple tables atomically (pandas-style)
session.load_data.read_csv('users.csv', table=User, skiprows=1)
session.load_data.read_csv('orders.csv', table=Order, skiprows=1)
session.load_data.read_csv('products.csv', table=Product, skiprows=1)
ETL Pipeline
# Extract from various sources, load into MatrixOne (pandas-style)
client.load_data.read_csv('crm_export.csv', table='customers', sep='|')
client.load_data.read_json('events.jsonl', table='events', lines=True)
client.load_data.read_parquet('analytics.parquet', table='metrics')
Log File Import
# Load log files with custom parsing (pandas-style)
client.load_data.read_csv(
'application.log',
table='logs',
sep='\t',
names=['timestamp', 'level', 'message', 'source'],
skiprows=0
)
See Also
Stage Management Guide - External stage management
Data Export Guide - Data export operations
Quick Start - Quick start guide
Load Data Manager - LoadDataManager API reference