Building a README Crawler With Node.js

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

readme-crawler

3 4 - JavaScript

A Node.js web crawler to download README files and follow contained links. Fetch repositories from a valid GitHub URL

A recent project of mine was this Node.js web crawler. Working on that led to this idea for another crawler. I wanted a way to navigate through GitHub and search for obvious typos. I had this idea after stumbling across silly typos on numerous portfolio pages. Perhaps I could help fix these errors and ensure these portfolio/sites are more presentable.

node-fetch

91 8,642 1.7 JavaScript

A light-weight module that brings the Fetch API to Node.js

To execute the algorithm, we will use Node.js (for the JavaScript runtime) and node-fetch (for network requests). This means we will run the code locally from the command line. For this project, we will have an output folder to store all the README data, as well as a list (queue) of repository URLs to visit. Before diving into the code, it is important to plan the input and output of the algorithm. For this web crawler, we will start at a valid GitHub repository page, which would be one URL string. After visiting each page with a README, we will export the data into a new file. Now lets cover the process of requesting a repository page from a URL. For this, we only care about saving the README file that is displayed, and we will ignore any other links that GitHub displays (such as the navbar). We will send a URL request with node-fetch, and retrieve the result of a HTML string. If we convert the HTML string to a DOM Tree, we can search for a specific element. GitHub stores the README file under a div with the class "markdown-body". We can use a library called 'jsdom' to use Browser API methods, and return a specific node.

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
node

927 103,799 9.9 JavaScript

Node.js JavaScript runtime ✨🐢🚀✨

To execute the algorithm, we will use Node.js (for the JavaScript runtime) and node-fetch (for network requests). This means we will run the code locally from the command line. For this project, we will have an output folder to store all the README data, as well as a list (queue) of repository URLs to visit. Before diving into the code, it is important to plan the input and output of the algorithm. For this web crawler, we will start at a valid GitHub repository page, which would be one URL string. After visiting each page with a README, we will export the data into a new file. Now lets cover the process of requesting a repository page from a URL. For this, we only care about saving the README file that is displayed, and we will ignore any other links that GitHub displays (such as the navbar). We will send a URL request with node-fetch, and retrieve the result of a HTML string. If we convert the HTML string to a DOM Tree, we can search for a specific element. GitHub stores the README file under a div with the class "markdown-body". We can use a library called 'jsdom' to use Browser API methods, and return a specific node.

lists

6 9,553 7.8

The definitive list of lists (of lists) curated on GitHub and elsewhere

import ReadMeCrawler from 'readme-crawler'; var crawler = new ReadMeCrawler({ startUrl: 'https://github.com/jnv/lists', followReadMeLinks: true, outputFolderPath: './output/' }); crawler.run();

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project