Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
That’s interesting.
I have predictive models that can predict if a headline (w/o the rest of the article and not considering the URL) will (a) get more than 10 votes and (b) if it does get more than 10 votes will the votes/comments ratio be more than 2 (which is roughly average)
The first model gets a ROC-AUC (see https://scikit-learn.org/stable/modules/generated/sklearn.me...) in the low 60’s (not good, the second model gets in the low 70’s (actually pretty good though it is a heat seeking missile for clickbait headlines) and my latest content-based recommender for RSS items gets almost 80. (I saw a paper that one system at TikTok gets about 85)
To do all that you need about 10,000 headlines and don’t get a lot of benefit from having more than 100,000. The ceilings on performance have more to do with the nature of the problem rather than my models: the same article can get submitted twice and get 0 votes one time and 200 the other time so it can never be as accurate as “is this an article about galactic astronomy?”
I had it ingest the HN comments firehose and found the amount of articles was overwhelming, my YOShInOn RSS reader now ingests the “best comments” from
https://hnrss.github.io/
together with 110 other feeds and actually I like the comments it picks out a lot. Now that the system is adding about 3000 items per day it might be able to handle a big feed like the comments firehose since now those comments are diluted with so many quality articles. For a problem like that you might want a two-score system with: (i) is it relevant? (something I like) and (ii) is it popular? (like Google’s PageRank)
I think you could make a model that compares comments in the best comments feed with other comments. I have tried formulating the problems above as regression problems where I try to predict the actual score and it does not work well because of the uncertainty problem but formulated as a classification problem for a score over a threshold it is easy to make a well-calibrated model that tells you “this article has a 20% chance of frontpaging” which is about the best anyone can do.