AI will enable mass spying

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • RedPajama-Data

    The RedPajama-Data repository contains code for preparing large datasets for training large language models.

  • There's a lot of speculation in the comments so I want to talk about the technology that we have __TODAY__. I post a lot about being in ML research and while my focus is on image generation I'm working with another team doing another task but not going to state it explicitly for obvious reasons.

    What can AI/ML do __today__?

    We have lots of ways to track people around a building or city. The challenge is to do these tasks through multi-camera systems. This includes things like people tracking (person with random ID but consistent across cameras), face identification (more specific representation that is independent of clothing, which usually identifies the former), gait tracking (how one walks), device tracking (based on bluetooth, wifi, and cellular). There is a lot of mixed success with these tools but I'll let you know some part that should concern you: right now these are mostly ResNet50 models, datasets are small, and they are not using advanced training techniques. That is changing. There are legal issues and datasets are becoming proprietary but the size and frequency of gathering data is growing.

    I'm not going to talk about social media because the metadata problem is an already well discussed one and you all have already made your decisions and we've witnessed the results of those decisions. I'm also not going to talk about China, the most surveilled country in the world, the UK, or any of that for similar reasons. We'll keep talking in general, that is invariant to country.

    What I will talk about is that modern ML has greatly accelerated the data gathering sector. Your threat models have changed from governments rushing to gather all the data that they can, to big companies joining the game, to now small mom and pop shops doing so. I __really__ implore you all to look at what's in that dataset[0]. There's 5B items, this tool helps retrieve based on CLIP embeddings. You might think "oh yes, Google can already do this" but the difference is that you can't download Google. Google does not give you 16.5TB of clip filtered image,text, & metadata. Or look into the RedPajama dataset[1] which has >30T tokens and 5TB of storage. With 32k tokens being about 50 pages, that's about 47 billion pages. That is, a stack of paper 5000km tall, reaching 5x the height of the ISS and is bigger than the diameter of the moon. I know we all understand that there's big data collection, but do you honestly understand how big these numbers are? I wouldn't even claim to because I cannot accurately conceptualize the size of the moon nor the distance to the ISS. They just roll into the "big" bin in my brain.

    Today, these systems can track you with decent accuracy even if you use basic obscurification techniques like glasses, hats, or even a surgical mask. Today we can track you not just by image, but how you walk, and can with moderate success do this through walls (meaning no camera to see if you want to know you're being tracked). Today, these systems can de-anonymize you through unique text patterns that you use (see Enron dataset, but scale). Today, these machines can uncanny valley replicas of your speech and text. Today we can make images of people that are convincingly real. Today, these tools aren't exclusive to governments or trillion dollar corporations, but available to any person that is willing to spend a few thousand dollars on compute.

    I don't want to paint this as a picture of doom and gloom. These tools are amazing and have the potential to do extraordinary good, at levels that would be unimaginable only a few decades ago. Even many of these tools that can invade your privacy are benefits in some ways, but just need to consider context. You cannot build a post scarce society when you require humans to monitor all stores.

    But like Uncle Ben says, with great power comes great responsibility. A technology that has the capacity to do tremendous good also has the power to do tremendous horrors.

    The choice is ours and the latter prevails when we are not open. We must ever push for these tools to be used for good, because with them we can truly do amazing things. We do not need AGI to create a post scarce world and I have no doubt that were this to become our primary goal, we could easily reach it within our lifetime without becoming a Sci-Fi dystopia and while tackling existential issues such as climate. To poke the bear a little, I'd argue that if your country wants to show dominance and superiority on the global stage, it is not done so through military power but technology. You will win the culture wars of all culture wars and whoever creates the post scarce world will be a country that will never be forgotten by time. Lift a billion people out of poverty? Try lifting 8 billion not just out of poverty, but into the lower middle class, where no child dreams of being hungry. That is something humans will never forget. So maybe this should be our cold war, not the one in the Pacific. If you're so great, truly, truly show me how superior your country/technology/people are. This is a battle that can be won by anyone at this point, not just China vs the US, but even any European power has the chance to win.

    [0] https://rom1504.github.io/clip-retrieval/

    [1] https://github.com/togethercomputer/RedPajama-Data

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • GPT-4o: Learn how to Implement a RAG on the new model, step-by-step!

    1 project | dev.to | 13 May 2024
  • GPT-4o

    6 projects | news.ycombinator.com | 13 May 2024
  • Tired of Makefiles

    3 projects | news.ycombinator.com | 13 May 2024
  • Python library that provides easy to integrate string token based pagination

    1 project | news.ycombinator.com | 13 May 2024
  • Python FastAPI: Integrating OAuth2 Security with the Application's Own Authentication Process

    4 projects | dev.to | 13 May 2024