simonwillisonblog
xsv
simonwillisonblog | xsv | |
---|---|---|
28 | 64 | |
163 | 10,089 | |
- | - | |
8.1 | 0.0 | |
about 16 hours ago | 2 months ago | |
JavaScript | Rust | |
Apache License 2.0 | The Unlicense |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
simonwillisonblog
- Sandboxing Python with Win32 App Isolation
-
AI for Web Devs: Addressing Bugs, Security, & Reliability
Simon Willison has pointed out several examples of prompt injection attacks and why it may never be a solved problem:
-
Where Have All the Websites Gone?
I want more people to have link blogs.
I have one in the sidebar of https://simonwillison.net/ which I've been running since November 2003. You can search through all 6,836 links here: https://simonwillison.net/search/?type=blogmark
I can post things to it with a bookmarklet. It has an Atom feed.
It's such a low-friction way of publishing. A lot of https://daringfireball.net works like this too. I also like https://waxy.org/ and https://kottke.org/ for this.
I'd love to see more of these.
- Ask HN: Is it feasible to train my own LLM?
-
Moving Away from Substack
My approach is to publish to my own blog at https://simonwillison.net and then copy and paste content from that into a Substack newsletter at https://simonw.substack.com a few times a month.
It's been working really well.
Substack don't have an API, but they do support copy and paste - so I built myself a tool that assembles my blog content into rich text I can copy and paste straight into the Substack editor.
I wrote about how that works here: https://simonwillison.net/2023/Apr/4/substack-observable/
-
Building a Blog in Django
Hah, yeah securing something like WordPress can be a challenge, especially if you're running a bunch of plugins.
My blog is a pretty straight-forward Django setup without many other dependencies, so it's a lot less of an attack surface: https://github.com/simonw/simonwillisonblog
-
Show HN: Superfunctions – AI prompt templates as an API
That specific prompt is just an example and it's pretty bad, it was the shortest and simplest prompt I could come up with that would be easily understood.
You can set response content-types (text, html, json, etc...). If you use json it will get pretty good results because I have some is some logic to attempt to pick out json or json5 objects from the text output. I dont yet have logic to support json arrays, but I'm hoping to add that soon.
But still client side validation is needed for applications with untrusted input. I dont attempt to solve prompt injection. I saw a lot of interesting posts on this topic from this blog https://simonwillison.net/. I need to find sometime to read more about it.
Try this one instead, it should be better
-
Stopping at 90%
I've started to consider "commit to writing about it" as the price I have to pay for giving into the lure of another project. It's one of the main reasons I publish so much content on https://simonwillison.net/ and https://til.simonwillison.net
A project with a published write-up unlocks so much more value than one which you complete without giving others a chance of understanding what you built.
I've maintained internal blogs (sometimes just a Slack channel or Confluence area) at previous employers for this purpose too.
-
Stanford A.I. Courses
I think you are asking specifically about practical LLM engineering and not the underlying science.
Honestly this is all moving so fast you can do well by reading the news, following a few reddits/substacks, and skimming the prompt engineering papers as they come out every week (!).
https://www.latent.space/p/ai-engineer provides an early manifesto for this nascent layer of the stack.
Zvi writes a good roundup (though he is concerned mostly with alignment so skip if you don’t like that angle): https://thezvi.substack.com/p/ai-18-the-great-debate-debates
Simon W has some good writeups too: https://simonwillison.net/
I strongly recommend playing with the OpenAI APIs and working with langchain in a Colab notebook to get a feel for how these all fit together. Also, the tools here are incredibly simple and easy to understand (very new) so looking at, say, https://github.com/minimaxir/simpleaichat/tree/main/simpleai... or https://github.com/smol-ai/developer and digging in to the prompts, what goes in system vs assistant roles, how you gourde the LLM, etc.
-
Seeking Your Top Recommendations for Resources on ChatGPT and Generative AI
Simon Willison's Weblog
xsv
-
Show HN: TextQuery – Query and Visualize Your CSV Data in Minutes
I realize it's not really that comparable since these tools don't support SQL, but a more fully functioned CLI tool is - https://github.com/BurntSushi/xsv
They are both fairly good
- Qsv: Efficient CSV CLI Toolkit
-
Joining CSV Data Without SQL: An IP Geolocation Use Case
I have done some similar, simpler data wrangling with xsv (https://github.com/BurntSushi/xsv) and jq. It could process my 800M rows in a couple of minutes (plus the time to read it out from the database =)
-
Qsv: CSVs sliced, diced and analyzed (fork of xsv)
xsv, which seems to be why qsv was created.
[1] https://github.com/BurntSushi/xsv/issues/267
-
I wrote this iCalendar (.ics) command-line utility to turn common calendar exports into more broadly compatible CSV files.
CSV utilities (still haven't pick a favorite one...): https://github.com/harelba/q https://github.com/BurntSushi/xsv https://github.com/wireservice/csvkit https://github.com/johnkerl/miller
- Icsp – Command-line iCalendar (.ics) to CSV parser
-
ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}
$ git remote -v origin [email protected]:rust-lang/rust (fetch) origin [email protected]:rust-lang/rust (push) $ git rev-parse HEAD 3b0d4813ab461ec81eab8980bb884691c97c5a35 $ time grep -ri burntsushi ./ ./src/tools/cargotest/main.rs: repo: "https://github.com/BurntSushi/ripgrep", ./src/tools/cargotest/main.rs: repo: "https://github.com/BurntSushi/xsv", grep: ./target/debug/incremental/cargotest-2dvu4f2km9e91/s-gactj3ma2j-1b10l4z-2l60ur55ixe6n/query-cache.bin: binary file matches grep: ./target/debug/incremental/cargotest-38cpmhhbdgdyq/s-gactj3luwq-1o12vgp-t61hd8qdyp7t/query-cache.bin: binary file matches grep: ./target/debug/incremental/cargotest-17632op6djxne/s-gawuq5468i-1h69nfw-4gm0s8yhhiun/query-cache.bin: binary file matches grep: ./target/debug/incremental/cargotest-2trm4kt5yom3r/s-gawuq53qqg-bjiezj-lo0gha8ign8w/query-cache.bin: binary file matches grep: ./target/debug/deps/libregex_automata-c74a6d9fd0abd77b.rmeta: binary file matches grep: ./target/debug/deps/libsame_file-a0e0363a2985455d.rlib: binary file matches grep: ./target/debug/deps/libsame_file-a0e0363a2985455d.rmeta: binary file matches grep: ./target/debug/deps/libsame_file-7251d8d3586a319b.rmeta: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-sysroot/lib/rustlib/x86_64-unknown-linux-gnu/lib/libaho_corasick-999a08e2b700420d.rlib: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-sysroot/lib/rustlib/x86_64-unknown-linux-gnu/lib/libregex_automata-0d168be5d25b3ac5.rlib: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-tools/x86_64-unknown-linux-gnu/release/deps/libregex_automata-7d6bec0156f15da1.rlib: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-tools/x86_64-unknown-linux-gnu/release/deps/libregex_automata-7d6bec0156f15da1.rmeta: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-tools/x86_64-unknown-linux-gnu/release/deps/libaho_corasick-07dee4514b87d99b.rmeta: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-tools/x86_64-unknown-linux-gnu/release/deps/libaho_corasick-07dee4514b87d99b.rlib: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/libaho_corasick-999a08e2b700420d.rlib: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/libaho_corasick-999a08e2b700420d.rmeta: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/libregex_automata-0d168be5d25b3ac5.rlib: binary file matches grep: ./build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/libregex_automata-0d168be5d25b3ac5.rmeta: binary file matches grep: ./build/bootstrap/debug/deps/libaho_corasick-992e1ba08ef83436.rmeta: binary file matches grep: ./build/bootstrap/debug/deps/libignore-54d41239d2761852.rmeta: binary file matches grep: ./build/bootstrap/debug/deps/libsame_file-9a5e3ddd89cfe599.rlib: binary file matches grep: ./build/bootstrap/debug/deps/libregex_automata-8e700951c9869a66.rlib: binary file matches grep: ./build/bootstrap/debug/deps/libignore-54d41239d2761852.rlib: binary file matches grep: ./build/bootstrap/debug/deps/libaho_corasick-992e1ba08ef83436.rlib: binary file matches grep: ./build/bootstrap/debug/deps/libregex_automata-8e700951c9869a66.rmeta: binary file matches grep: ./build/bootstrap/debug/deps/libsame_file-9a5e3ddd89cfe599.rmeta: binary file matches real 16.683 user 15.793 sys 0.878 maxmem 8 MB faults 0
-
Any Linux admins willing to try Pygrep?
Unrelated, are you the same burntsushi that wrote xsv?
-
Analyzing multi-gigabyte JSON files locally
If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).
https://github.com/BurntSushi/xsv
https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...
-
What monitoring tool do you use or recommend?
Oh and there's rad cli shit out there for CSV files too, like xsv
What are some alternatives?
pg_cjk_parser - Postgres CJK Parser pg_cjk_parser is a fts (full text search) parser derived from the default parser in PostgreSQL 11. When a postgres database uses utf-8 encoding, this parser supports all the features of the default parser while splitting CJK (Chinese, Japanese, Korean) characters into 2-gram tokens. If the database's encoding is not utf-8, the parser behaves just like the default parser.
csvtk - A cross-platform, efficient and practical CSV/TSV toolkit in Golang
pgvector - Open-source vector similarity search for Postgres
miller - Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
awesome-personal-blogs - A delightful list of personal tech blogs
ripgrep - ripgrep recursively searches directories for a regex pattern while respecting your gitignore
tsv-utils - eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
Servo - Servo, the embeddable, independent, memory-safe, modular, parallel web rendering engine
awesome-ml - Curated list of useful LLM / Analytics / Datascience resources
Fractalide - Reusable Reproducible Composable Software
knowledge - Everything I know
svgcleaner - svgcleaner could help you to clean up your SVG files from the unnecessary data.