In the previous chapter, Hybrid Query Engine, we learned how to search through our data using complex filters.
But as you keep adding millions of documents, a new problem arises: How do we manage all this data physically? If we stored everything in one gigantic file, it would become slow to read and impossible to update.
In this chapter, we explore Segmentsβthe strategy zvec uses to break your data into bite-sized, manageable chunks.
Imagine you are writing an Encyclopedia.
In zvec, these volumes are called Segments.
Imagine a system logging user activity.
This ensures that writing is always fast because you are only ever dealing with a small, fresh "notebook."
In most cases, zvec manages this automatically based on your configuration. However, you can manually trigger "Housekeeping" (Optimization).
When you define your schema, you (or the system defaults) decide how big a segment should be.
# Conceptual configuration
# zvec automatically switches to a new segment
# when the current one hits this limit.
max_docs_per_segment = 1024 * 1024 # 1 Million docs
If you have deleted many documents, your disk might be full of "holes" (data marked as deleted but still taking up space). You can force zvec to clean up.
from zvec import OptimizeOptions
# Tell the collection to merge small segments
# and remove deleted data physically.
collection.optimize(
options=OptimizeOptions(concurrency=4)
)
What happens here? The database looks at all the sealed segments. If it finds two small ones (e.g., Vol 1 is 30% full, Vol 2 is 20% full), it merges them into a new, efficient Vol 3 and deletes the old ones.
How does the data flow from your Python script to the hard drive?
Let's look at how zvec handles these "Encyclopedia Volumes" in C++.
src/db/index/segment/segment_manager.cc)
The SegmentManager is the "Librarian." It keeps a list of all the finished volumes.
// src/db/index/segment/segment_manager.cc
Status SegmentManager::add_segment(Segment::Ptr segment) {
if (!segment) {
return Status::InvalidArgument("Segment is null");
}
// Store the segment in a map (Dictionary)
// Key: Segment ID, Value: The Segment object
segments_map_[segment->id()] = segment;
return Status::OK();
}
Explanation: This is a simple registry. When a segment is finished, the Collection hands it over to this Manager so it can be queried later.
src/db/collection.cc)
This logic inside CollectionImpl decides when to swap the notebooks.
// src/db/collection.cc
Status CollectionImpl::switch_to_new_segment_for_writing() {
// 1. Save the current data to disk
auto s = writing_segment_->dump();
// 2. Give the old segment to the Librarian (SegmentManager)
s = segment_manager_->add_segment(writing_segment_);
// 3. Create a brand new segment
auto new_segment = Segment::CreateAndOpen(
path_, *schema_, allocate_segment_id(), ...);
// 4. Set the new one as the active writer
writing_segment_ = new_segment.value();
return Status::OK();
}
Explanation: This code is the "Pivot Point." It ensures that we never stop accepting data. We dump the old one, register it for reading, and instantly spin up a new one for writing.
src/db/index/storage/parquet_writer.cc)
When we say "Dump to disk," we don't just write text. We write Parquet. zvec uses the Apache Arrow library to do this efficiently.
// src/db/index/storage/parquet_writer.cc
arrow::Status ParquetWriter::write_batch(const arrow::RecordBatch &batch, ...) {
// 'batch' is a chunk of data in memory (like a spreadsheet)
// 1. Check if we need to filter out deleted rows
// (Code omitted for brevity...)
// 2. Write the batch to the Parquet file
return writer_->WriteRecordBatch(batch);
}
Beginner Explanation:
In this chapter, we learned:
zvec uses Apache Arrow and Parquet to make storage incredibly efficient.Now we have data stored efficiently on disk. But simply storing data isn't enough. We need to perform vector searches (finding the "nearest neighbor"). To do that quickly, we need specialized algorithms.
In the next chapter, we will learn about the math and logic behind Vector Indexing.
Next Chapter: Vector Indexing Algorithms
Generated by Code IQ