simonwillisonblog
tsv-utils
simonwillisonblog | tsv-utils | |
---|---|---|
28 | 10 | |
163 | 1,396 | |
- | 0.0% | |
8.1 | 0.0 | |
about 15 hours ago | over 1 year ago | |
JavaScript | D | |
Apache License 2.0 | Boost Software License 1.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
simonwillisonblog
- Sandboxing Python with Win32 App Isolation
-
AI for Web Devs: Addressing Bugs, Security, & Reliability
Simon Willison has pointed out several examples of prompt injection attacks and why it may never be a solved problem:
-
Where Have All the Websites Gone?
I want more people to have link blogs.
I have one in the sidebar of https://simonwillison.net/ which I've been running since November 2003. You can search through all 6,836 links here: https://simonwillison.net/search/?type=blogmark
I can post things to it with a bookmarklet. It has an Atom feed.
It's such a low-friction way of publishing. A lot of https://daringfireball.net works like this too. I also like https://waxy.org/ and https://kottke.org/ for this.
I'd love to see more of these.
- Ask HN: Is it feasible to train my own LLM?
-
Moving Away from Substack
My approach is to publish to my own blog at https://simonwillison.net and then copy and paste content from that into a Substack newsletter at https://simonw.substack.com a few times a month.
It's been working really well.
Substack don't have an API, but they do support copy and paste - so I built myself a tool that assembles my blog content into rich text I can copy and paste straight into the Substack editor.
I wrote about how that works here: https://simonwillison.net/2023/Apr/4/substack-observable/
-
Building a Blog in Django
Hah, yeah securing something like WordPress can be a challenge, especially if you're running a bunch of plugins.
My blog is a pretty straight-forward Django setup without many other dependencies, so it's a lot less of an attack surface: https://github.com/simonw/simonwillisonblog
-
Show HN: Superfunctions – AI prompt templates as an API
That specific prompt is just an example and it's pretty bad, it was the shortest and simplest prompt I could come up with that would be easily understood.
You can set response content-types (text, html, json, etc...). If you use json it will get pretty good results because I have some is some logic to attempt to pick out json or json5 objects from the text output. I dont yet have logic to support json arrays, but I'm hoping to add that soon.
But still client side validation is needed for applications with untrusted input. I dont attempt to solve prompt injection. I saw a lot of interesting posts on this topic from this blog https://simonwillison.net/. I need to find sometime to read more about it.
Try this one instead, it should be better
-
Stopping at 90%
I've started to consider "commit to writing about it" as the price I have to pay for giving into the lure of another project. It's one of the main reasons I publish so much content on https://simonwillison.net/ and https://til.simonwillison.net
A project with a published write-up unlocks so much more value than one which you complete without giving others a chance of understanding what you built.
I've maintained internal blogs (sometimes just a Slack channel or Confluence area) at previous employers for this purpose too.
-
Stanford A.I. Courses
I think you are asking specifically about practical LLM engineering and not the underlying science.
Honestly this is all moving so fast you can do well by reading the news, following a few reddits/substacks, and skimming the prompt engineering papers as they come out every week (!).
https://www.latent.space/p/ai-engineer provides an early manifesto for this nascent layer of the stack.
Zvi writes a good roundup (though he is concerned mostly with alignment so skip if you don’t like that angle): https://thezvi.substack.com/p/ai-18-the-great-debate-debates
Simon W has some good writeups too: https://simonwillison.net/
I strongly recommend playing with the OpenAI APIs and working with langchain in a Colab notebook to get a feel for how these all fit together. Also, the tools here are incredibly simple and easy to understand (very new) so looking at, say, https://github.com/minimaxir/simpleaichat/tree/main/simpleai... or https://github.com/smol-ai/developer and digging in to the prompts, what goes in system vs assistant roles, how you gourde the LLM, etc.
-
Seeking Your Top Recommendations for Resources on ChatGPT and Generative AI
Simon Willison's Weblog
tsv-utils
-
Frawk: An efficient Awk-like programming language. (2021)
If you need just csv/tsv parsing, you can also take a look at https://github.com/eBay/tsv-utils
-
Tracking SQLite Database Changes in Git
You might want to look at tsv-utils, or a similar project: https://github.com/eBay/tsv-utils
For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)
-
I feel like an idiot but… I need Excel help.
TSV is most often a better format than CSV. Localization, in particular, is a nightmare with CSV.
- Splitting CSV files at 3GB/s
-
Modernizing AWK, a 45-year old language, by adding CSV support
For anything down and dirty, what's wrong with -F'"'? For anything fancy there are plenty of things like the below.
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
includes csv to tsv: https://github.com/eBay/tsv-utils
HT: https://simonwillison.net/
-
Dlang 2.098.0 released, now available on OpenBSD
As an example, eBay's tsv-utils took full advantage of the GC and performed better than existing programs that had been hand-optimized in C etc.
-
[OC]Tidy Viewer (tv) is a cross-platform csv pretty printer that uses column styling to maximize viewer enjoyment.
tsv-utils - Command line csv data manipulation toolkit. D
-
Changing Registry Key Value Based on Contents of TXT/CSV File
In the majority of cases you'll be better off with Tab Separated Values over Comma Separated Values. More info here.
-
Return 1 to N results from a large (19MM line) CSV
May well be overkill for your needs, but I'm a fan of tsv-utils It's fast and enormously flexible, and seems to me a "best of breed" toolset for data mining CSV files (that is what it was written for). https://github.com/eBay/tsv-utils
What are some alternatives?
pg_cjk_parser - Postgres CJK Parser pg_cjk_parser is a fts (full text search) parser derived from the default parser in PostgreSQL 11. When a postgres database uses utf-8 encoding, this parser supports all the features of the default parser while splitting CJK (Chinese, Japanese, Korean) characters into 2-gram tokens. If the database's encoding is not utf-8, the parser behaves just like the default parser.
dextool - Suite of C/C++ tooling built on LLVM/Clang
pgvector - Open-source vector similarity search for Postgres
structured-text-tools - A list of command-line tools for manipulating structured text data
awesome-personal-blogs - A delightful list of personal tech blogs
csvtk - A cross-platform, efficient and practical CSV/TSV toolkit in Golang
awesome-ml - Curated list of useful LLM / Analytics / Datascience resources
q - Quick and dirty debugging output for tired programmers. ⛺
knowledge - Everything I know
goawk - A POSIX-compliant AWK interpreter written in Go, with CSV support
zsv - zsv+lib: tabular data swiss-army knife CLI + world's fastest (simd) CSV parser