building a small query engine in python (lazyq internals)

Apr 12, 2026

this is not a framework not a replacement for pandas not even a full database idea

just an attempt to understand:

what does a query engine really need?

the starting point

most data code looks like this:

result = []

for item in data:
    if item["age"] > 18:
        result.append(item["name"])

it works but it doesn’t scale well in thinking

you keep rewriting the same patterns

first step”

instead of doing the work immediately, store the steps:

q = Query(data)\
    .filter(lambda x: x["age"] > 18)\
    .map(lambda x: x["name"])

nothing runs yet

you’re just building a pipeline

representing operations

each step is stored as an object:

class Map:
    def __init__(self, fn):
        self.fn = fn

    def __call__(self, x):
        return self.fn(x)

class Filter:
    def __init__(self, fn):
        self.fn = fn

    def __call__(self, x):
        return x if self.fn(x) else SKIP

instead of running immediately, they describe behavior

the idea of SKIP and STOP

SKIP = object()
STOP = object()

these are just signals:

SKIP → ignore this item
STOP → stop the pipeline

this avoids extra flags or conditions everywhere

executing the pipeline

execution happens in one place:

def __iter__(self):
    data = self.data() if callable(self.data) else self.data
    ops = [f() for f, _ in self.operations]

    for op in ops:
        if hasattr(op, "run"):
            data = op.run(data)
        else:
            data = self._apply_stream_op(data, op)

    for item in data:
        yield item

two types of operations appear here.

stream vs batch operations

stream (row by row)

Map
Filter
Limit

processed like:

for item in data:
    item = op(item)

batch (need full data)

Sort
GroupBy
Reduce

they implement:

def run(self, data):
    ...

and transform the whole dataset at once

example: sort

class Sort:
    def __init__(self, key=None, reverse=False):
        self.key = key
        self.reverse = reverse

    def run(self, data):
        if self.key is None:
            return sorted(data, reverse=self.reverse)
        if callable(self.key):
            return sorted(data, key=self.key, reverse=self.reverse)
        return sorted(data, key=lambda x: x[self.key], reverse=self.reverse)

this supports:

.sort()
.sort("age")
.sort(lambda x: x["age"])

grouping

class GroupBy:
    def __init__(self, key):
        self.key = key

    def run(self, data):
        group = {}
        for item in data:
            k = item[self.key]
            if k not in group:
                group[k] = []
            group[k].append(item)
        return group.items()

returns:

(key, [items])

which later operations can consume

aggregations are just reduce

class Reduce:
    def __init__(self, fn, initial=None):
        self.fn = fn
        self.initial = initial

everything like:

sum
count
max

is built on top of this

lazy execution

nothing happens until:

.collect()
.show()
list(query)

this allows:

chaining freely
working with streams
delaying computation

why this structure works

a few small ideas:

operations are composable
stream + batch are separated
everything is just iterables

no hidden magic just controlled flow

what’s missing (for now)

joins
multiple key sorting
better error handling
optimization

but the core idea holds

closing

this is still small still evolving

but building it clarified something:

a query system is not about data it’s about how you move through it

real version

this idea grew into a small library:

nothing fancy just exploring how data flows

go back to tech