building a small query engine in python (lazyq internals)

Apr 12, 2026

this is not a framework not a replacement for pandas not even a full database idea

just an attempt to understand:

what does a query engine really need?


the starting point

most data code looks like this:

result = []

for item in data:
    if item["age"] > 18:
        result.append(item["name"])

it works but it doesn’t scale well in thinking

you keep rewriting the same patterns


first step”

instead of doing the work immediately, store the steps:

q = Query(data)\
    .filter(lambda x: x["age"] > 18)\
    .map(lambda x: x["name"])

nothing runs yet

you’re just building a pipeline


representing operations

each step is stored as an object:

class Map:
    def __init__(self, fn):
        self.fn = fn

    def __call__(self, x):
        return self.fn(x)
class Filter:
    def __init__(self, fn):
        self.fn = fn

    def __call__(self, x):
        return x if self.fn(x) else SKIP

instead of running immediately, they describe behavior


the idea of SKIP and STOP

SKIP = object()
STOP = object()

these are just signals:

this avoids extra flags or conditions everywhere


executing the pipeline

execution happens in one place:

def __iter__(self):
    data = self.data() if callable(self.data) else self.data
    ops = [f() for f, _ in self.operations]

    for op in ops:
        if hasattr(op, "run"):
            data = op.run(data)
        else:
            data = self._apply_stream_op(data, op)

    for item in data:
        yield item

two types of operations appear here.


stream vs batch operations

stream (row by row)

Map
Filter
Limit

processed like:

for item in data:
    item = op(item)

batch (need full data)

Sort
GroupBy
Reduce

they implement:

def run(self, data):
    ...

and transform the whole dataset at once


example: sort

class Sort:
    def __init__(self, key=None, reverse=False):
        self.key = key
        self.reverse = reverse

    def run(self, data):
        if self.key is None:
            return sorted(data, reverse=self.reverse)
        if callable(self.key):
            return sorted(data, key=self.key, reverse=self.reverse)
        return sorted(data, key=lambda x: x[self.key], reverse=self.reverse)

this supports:


grouping

class GroupBy:
    def __init__(self, key):
        self.key = key

    def run(self, data):
        group = {}
        for item in data:
            k = item[self.key]
            if k not in group:
                group[k] = []
            group[k].append(item)
        return group.items()

returns:

(key, [items])

which later operations can consume


aggregations are just reduce

class Reduce:
    def __init__(self, fn, initial=None):
        self.fn = fn
        self.initial = initial

everything like:

is built on top of this


lazy execution

nothing happens until:

.collect()
.show()
list(query)

this allows:


why this structure works

a few small ideas:

no hidden magic just controlled flow


what’s missing (for now)

but the core idea holds


closing

this is still small still evolving

but building it clarified something:

a query system is not about data it’s about how you move through it


real version

this idea grew into a small library:


nothing fancy just exploring how data flows

go back to tech