this is not a framework not a replacement for pandas not even a full database idea
just an attempt to understand:
what does a query engine really need?
most data code looks like this:
result = []
for item in data:
if item["age"] > 18:
result.append(item["name"])
it works but it doesn’t scale well in thinking
you keep rewriting the same patterns
instead of doing the work immediately, store the steps:
q = Query(data)\
.filter(lambda x: x["age"] > 18)\
.map(lambda x: x["name"])
nothing runs yet
you’re just building a pipeline
each step is stored as an object:
class Map:
def __init__(self, fn):
self.fn = fn
def __call__(self, x):
return self.fn(x)
class Filter:
def __init__(self, fn):
self.fn = fn
def __call__(self, x):
return x if self.fn(x) else SKIP
instead of running immediately, they describe behavior
SKIP = object()
STOP = object()
these are just signals:
SKIP → ignore this itemSTOP → stop the pipelinethis avoids extra flags or conditions everywhere
execution happens in one place:
def __iter__(self):
data = self.data() if callable(self.data) else self.data
ops = [f() for f, _ in self.operations]
for op in ops:
if hasattr(op, "run"):
data = op.run(data)
else:
data = self._apply_stream_op(data, op)
for item in data:
yield item
two types of operations appear here.
Map
Filter
Limit
processed like:
for item in data:
item = op(item)
Sort
GroupBy
Reduce
they implement:
def run(self, data):
...
and transform the whole dataset at once
class Sort:
def __init__(self, key=None, reverse=False):
self.key = key
self.reverse = reverse
def run(self, data):
if self.key is None:
return sorted(data, reverse=self.reverse)
if callable(self.key):
return sorted(data, key=self.key, reverse=self.reverse)
return sorted(data, key=lambda x: x[self.key], reverse=self.reverse)
this supports:
.sort().sort("age").sort(lambda x: x["age"])class GroupBy:
def __init__(self, key):
self.key = key
def run(self, data):
group = {}
for item in data:
k = item[self.key]
if k not in group:
group[k] = []
group[k].append(item)
return group.items()
returns:
(key, [items])
which later operations can consume
class Reduce:
def __init__(self, fn, initial=None):
self.fn = fn
self.initial = initial
everything like:
is built on top of this
nothing happens until:
.collect()
.show()
list(query)
this allows:
a few small ideas:
no hidden magic just controlled flow
but the core idea holds
this is still small still evolving
but building it clarified something:
a query system is not about data it’s about how you move through it
this idea grew into a small library:
nothing fancy just exploring how data flows