In the previous chapter, Python-C++ Bridge (Bindings), we learned how to start the high-performance C++ engine using Python.
Now that the engine is running, we have a problem: Where do we put the data?
You can't just throw documents into the engine randomly. You need a structured container to organize, label, and safely store them. In zvec, this container is called the Collection.
Think of a Collection as a physical Smart Filing Cabinet.
We want to build a search engine for a library. We need to store:
To use a Collection, we need to understand three components:
zvec is typedโyou can't put text into an integer field.Let's define our Library Schema and create the Collection.
We use FieldSchema to define columns and CollectionSchema to wrap them up.
from zvec import CollectionSchema, FieldSchema, DataType
# Define the columns
id_field = FieldSchema(name="book_id", dtype=DataType.INT64)
title_field = FieldSchema(name="title", dtype=DataType.STRING)
# Vectors need a dimension size (e.g., 4 floats)
vec_field = FieldSchema(name="vector", dtype=DataType.VECTOR_FLOAT, dim=4)
# Create the blueprint
library_schema = CollectionSchema(name="library", fields=[id_field, title_field, vec_field])
Now we create the actual collection on the disk.
from zvec import Collection
# create_and_open will create the folder "./my_library_db"
# If it already exists, it loads the data inside.
collection = Collection.create_and_open(
path="./my_library_db",
schema=library_schema
)
What just happened?
zvec created a directory on your hard drive. Inside, it initialized specific files to track versions and metadata.
Now we insert a document (a book).
from zvec import Doc
# Create a document
book = Doc(
fields={
"book_id": 101,
"title": "The Great Gatsby",
"vector": [0.1, 0.9, 0.2, 0.5]
}
)
# Put it in the cabinet
collection.insert(book)
The data is now in memory and being prepared for storage. To ensure it is physically saved to the disk immediately, you can call collection.flush().
How does zvec handle this internally? The Collection is actually a coordinator. It doesn't store the data itself; it delegates storage to smaller units called Segments.
Think of the Collection as the Manager.
Let's look at src/db/collection.cc. This is the brain of the operation.
CollectionImpl)
The CollectionImpl class holds the state of the database.
// src/db/collection.cc
class CollectionImpl : public Collection {
private:
// The path on the disk (e.g., "./my_library_db")
std::string path_;
// The schema we defined in Python
CollectionSchema::Ptr schema_;
// The segment currently accepting new data
Segment::Ptr writing_segment_;
// A list of old segments that are sealed and read-only
SegmentManager::Ptr segment_manager_;
};
Explanation:
writing_segment_: This is the "Open Folder" on your desk. All new inserts go here.segment_manager_: This is the "Archive Cabinet". When the writing_segment_ gets full, it is moved here and sealed. (We will cover this in Segment & Storage Management).
When you call insert(), the Collection acts as a traffic controller.
// src/db/collection.cc
Result<WriteResults> CollectionImpl::Insert(std::vector<Doc> &docs) {
// 1. Thread Safety: Lock the collection so two people don't write at once
std::lock_guard write_lock(write_mtx_);
// 2. Check if the current segment is full
if (need_switch_to_new_segment()) {
switch_to_new_segment_for_writing();
}
// 3. Hand the work over to the segment
return writing_segment_->Insert(docs);
}
Why is this design "Beginner Friendly" for the developer?
The Collection abstracts away the complexity of file management. It automatically decides when a file is "full" and creates a new one. The user just keeps pushing data in, and the Collection manages the physical files.
What happens if you restart the script? The Collection::Open method (in C++) is smart.
VersionManager file (a manifest).// src/db/collection.cc
Status CollectionImpl::Open(const CollectionOptions &options) {
if (schema_ == nullptr) {
// Schema is null? This means we are loading an existing DB!
return recovery();
} else {
// Schema provided? Create a brand new DB!
return create();
}
}
In this chapter, we learned:
Now that we have data stored in the collection, the most exciting part begins: How do we find what we are looking for?
In the next chapter, we will explore how zvec performs searches across these documents using its powerful engine.
Next Chapter: Hybrid Query Engine
Generated by Code IQ