In the previous chapter, Protocol Tests, we ensured that clients (like MySQL tools or web browsers) could talk to ClickHouse. We verified the "Front Door" of the database.
But ClickHouse isn't just a destination; it is also a traveler. It often needs to reach out and pull data from other systems like AWS S3, Kafka, PostgreSQL, or MongoDB.
If Amazon changes how S3 works, or if we break our Kafka consumer code, data ingestion stops. To prevent this, we need External Integrations Tests.
Imagine you are building a central "Smart Home" hub (ClickHouse).
The Challenge: We cannot control the third parties. AWS S3 is a massive cloud service. We cannot "restart" AWS for a test. We also don't want to pay real money every time we run a test in CI.
Central Use Case: We want to verify that ClickHouse can read a CSV file stored in Object Storage (S3). To do this without using real AWS, we will use MinIO, a tool that pretends to be S3 but runs locally in Docker.
To simulate the outside world, we rely on Docker Compose and Mock Services.
Since we can't use the real internet services in a sealed CI environment, we use local versions:
This is a tool that lets us define a "Environment File" (docker-compose.yml). It says: "Start one ClickHouse, one MinIO, and one Kafka, and connect them all on a shared network."
ClickHouse has special SQL functions to talk to these services directly without creating a table first.
s3(...): Reads files from object storage.kafka(...): Reads streams from a message bus.mysql(...): Reads tables from a remote MySQL database.We will build a test that verifies ClickHouse can download and query a file from our fake S3 (MinIO).
First, we need to ensure MinIO is running. In tests/integration, we configure the ClickHouseCluster to include MinIO.
import pytest
from helpers.cluster import ClickHouseCluster
# We start a cluster that includes a "MinIO" container
cluster = ClickHouseCluster(__file__)
node = cluster.add_instance('node', with_minio=True)
@pytest.fixture(scope="module")
def started_cluster():
cluster.start()
yield cluster
cluster.shutdown()
Explanation: The with_minio=True flag tells our test runner (from Chapter 7) to spin up a MinIO container alongside ClickHouse.
In the test function, we first act as the user uploading data. We use a standard Python library (minio) to put a file into our fake bucket.
from minio import Minio
def test_s3_read(started_cluster):
# Connect to the local MinIO instance
minio_client = cluster.minio_client
# Create a bucket named 'data-lake'
if not minio_client.bucket_exists("data-lake"):
minio_client.make_bucket("data-lake")
# Upload a simple CSV file
csv_data = b"1,ClickHouse\n2,Integrations"
minio_client.put_object(
"data-lake", "data.csv", io.BytesIO(csv_data), len(csv_data)
)
Explanation: We connect to the MinIO container. We create a bucket (folder) and upload a file data.csv containing two rows.
Now, we verify that ClickHouse can "see" this file using the s3 table function.
# ClickHouse needs to know the address of MinIO inside the Docker network
# usually: http://minio:9000/bucket/file
s3_url = "http://minio:9000/data-lake/data.csv"
# Run the query
result = node.query(f"""
SELECT * FROM s3(
'{s3_url}',
'minio', 'minio123',
'CSV'
)
""")
Explanation:
s3() function in SQL.minio/minio123), and the format (CSV).Finally, we assert that the data traveled from MinIO to ClickHouse correctly.
# Expected output: 1, ClickHouse (newline) 2, Integrations
expected = "1\tClickHouse\n2\tIntegrations\n"
assert result == expected
Explanation: If the result matches, it means the entire chain (Network -> HTTP Request -> CSV Parsing) is working correctly.
When you run an s3 query, ClickHouse acts like a web browser downloading a file, but it processes the file while it is downloading (streaming).
s3() function parses the URL.StorageS3 opens a TCP connection to the MinIO container.GET /data-lake/data.csv request.CSV format decoder turns raw bytes into columns and rows.Here is the flow:
ReadBuffer
The magic happens in the C++ layer handling Input/Output. ClickHouse uses an abstraction called ReadBufferFromHTTP.
// Simplified concept from src/IO/ReadBufferFromHTTP.cpp
class ReadBufferFromHTTP : public ReadBuffer
{
public:
bool nextImpl() override
{
// 1. If we ran out of data in the buffer, fetch more
// 2. Read from the socket connected to S3/MinIO
ssize_t bytes_read = socket.read(internal_buffer);
// 3. If bytes_read is 0, the file is finished
return bytes_read > 0;
}
};
Explanation:
nextImpl().Modern "Lakehouse" architectures use complex formats like Apache Iceberg. Testing these is similar but requires more setup.
.json, .avro) pointing to data files (.parquet).def test_iceberg_integration():
# 1. Start Spark and create an Iceberg table in MinIO
run_spark_job("create_table.py")
# 2. ClickHouse reads the metadata file
node.query("SELECT * FROM iceberg('http://minio.../metadata.json')")
Explanation: This proves ClickHouse can decode the complex metadata structures used by big data engines.
External Integration Tests are vital because:
In this chapter, we learned about External Integrations Tests.
We have covered almost every aspect of testing now. However, writing these tests requires a lot of repetitive code (starting clusters, creating tables). To make life easier, the framework provides a library of helpers.
In the final chapter, we will look at Integration Test Helpers to see the tools available to speed up your test writing.
Next Chapter: Integration Test Helpers
Generated by Code IQ