▲Show HN: Semlib – Semantic Data Processinggithub.com

48 points by anishathalye 7 hours ago | 10 comments

esafak 6 hours ago [-]

Instead of building a new data processing library, I would have offered only the novel NLP part and exposed it to existing libraries like pandas, polars, and spacy.

Does it batch requests?

anishathalye 5 hours ago [-]

Yeah, that was just a design choice that I made: I wanted a library that worked with `Iterator`s, felt more lightweight to me / fit my immediate needs better. I'm personally not a huge fan of Pandas DataFrames for certain applications.

LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus

Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/

ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).

flowgrammer 3 hours ago [-]

[dead]

datajoely 5 hours ago [-]

[dead]

curtisszmania 4 hours ago [-]

[dead]

Y_Y 6 hours ago [-]

  >>> await sort(presidents, by="right-leaning")
  ['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan']

Is this supposed to be impressive? GIGO, if you want to vibe-classify your data then go right ahead, but I hope nobody serious relies on it.

anishathalye 6 hours ago [-]

That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).

Y_Y 6 hours ago [-]

As you compose fuzzy operations your errors multiply! Nobody is asking for perfection, but this tool seems to me a straightforward way to launder bad data. If you want to do a quick check of an idea then it's probably great, but if you're going to be rigorous and use hard data and reproducible, understandable methods then I don't think it offers anything. The plea for citations at the end of the readme also rubs me the wrong way.

anishathalye 5 hours ago [-]

I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.

adastra22 3 hours ago [-]

FWIW it doesn't serve as a great example because the ordering is not obvious. I think that is what GP was reacting to. When I say "sort a list of presidents by how right-leaning they are" in any other context people would probably assume the MOST right-leaning president to be listed first. It took me a moment to remember that Pythons 'sort' would be in ascending order by default.

Y_Y 4 hours ago [-]

Thank you for engaging with me so politely and constructively. I care probably more than I should about good science and honesty in academia, and so I feel compelled to push back in cases where I see things like: blatant overstating of capabilities, artificially farming citations.

This case seems to have been a false positive. Surely people will misuse your tool,but that's not your responsibility, as long as you haven't mislead them to begin with. Good luck with the project, I hope to someday need to cite the software myself.

hobofan 6 hours ago [-]

Why not?

List-sorting/prioritizing list is among one of the best use-cases for LLMs, especially if the metrics for it are fuzzy, e.g. "what are the 10 sales lead of this list of 1000 that I should prioritize".

One of the more interesting approaches for that is arbitron[0], which does pairwise ranking with multiple metrics/agents to provide a multi-faceted sorting.

[0]: https://github.com/davidgasquez/arbitron

rekwah 5 hours ago [-]

Asc vs Desc sort order doing a lot of lifting here.

Loading comments...

esafak 6 hours ago [-]

Instead of building a new data processing library, I would have offered only the novel NLP part and exposed it to existing libraries like pandas, polars, and spacy.

Does it batch requests?

anishathalye 5 hours ago [-]

flowgrammer 3 hours ago [-]

[dead]

datajoely 5 hours ago [-]

[dead]

curtisszmania 4 hours ago [-]

[dead]

Y_Y 6 hours ago [-]

  >>> await sort(presidents, by="right-leaning")
  ['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan']

Is this supposed to be impressive? GIGO, if you want to vibe-classify your data then go right ahead, but I hope nobody serious relies on it.

anishathalye 6 hours ago [-]

Y_Y 6 hours ago [-]

anishathalye 5 hours ago [-]

adastra22 3 hours ago [-]

Y_Y 4 hours ago [-]

hobofan 6 hours ago [-]

Why not?

One of the more interesting approaches for that is arbitron[0], which does pairwise ranking with multiple metrics/agents to provide a multi-faceted sorting.

[0]: https://github.com/davidgasquez/arbitron

rekwah 5 hours ago [-]

Asc vs Desc sort order doing a lot of lifting here.