قالب وردپرس درنا توس
Home / Tips and Tricks / How To Scratch A List Of Subjects From A Subreddit With Bash

How To Scratch A List Of Subjects From A Subreddit With Bash



  Linux Terminal on Ubuntu Laptop Concept
Fatmawati Achmad Zaenuri / Shutterstock.com

Reddit offers JSON feeds for every subreddit. To create a bash script that downloads and analyzes a list of posts from any subreddits. This is just one thing you can do with Reddit's JSON feeds.

Installing Curl and JQ

We use curl to retrieve the JSON feed from Reddit and jq to parse the JSON data and the desired fields from the Extract results. Install these two dependencies with apt-get on Ubuntu and other Debian-based Linux distributions. Instead, use the package management tool of your distribution on other Linux distributions.

  sudo apt-get install curl jq 

Get some JSON data from Reddit

Let's see what the data feed looks like. Use curl to get the latest posts from the MildlyInteresting subreddit:

  curl -s A "reddit scraper example" https://www.reddit.com/r/MildlyInteresting.json The options in front of the URL were used:  -s  forces curl to run in unattended mode, so that no output is displayed except the data from reddit servers. Using the Next Option and Parameter  - A "Reddit Scraper Example"  sets a custom user agent string that helps Reddit identify the service that accesses its data. The Reddit API servers apply rate limits based on the user agent string. By setting a custom value, Reddit segments our rate limit away from other callers and reduces the likelihood of exceeding an HTTP 429 rate limit. 

The output should fill the terminal window and look something like this: [1

9659011]   Scratch a Subreddit from Bash

The output data contains many fields, but we're only interested in title, permalink, and url. For a complete list of types and their fields, see Reddit's API documentation page: https://github.com/reddit-archive/reddit/wiki/JSON[19659014DrainingdatafromJSONoutput

We would like to extract title, permalink, and URL from the output data and save it in a tab-delimited file. We can use word processing tools such as sed and grep but we have another tool that understands JSON data structures jq . On our first try, we want to print the issue nicely and color coded. We will use the same call as before, but this time redirect the output to jq and instruct them to parse and print the JSON data.

  curl -s A "reddit scraper example" https: //www.reddit.com/r/MildlyInteresting.json | jq. 

Note the period following the command. This expression simply analyzes the input and prints it as it is. The output looks nicely formatted and color-coded:

 Extract data from the JSON of a subreddit in Bash

Consider the structure of the JSON data we receive from Reddit. The root result is an object that contains two properties: type and data. The latter has a property called children that contains an array of posts about this subreddit.

Each element in the array is an object that also contains two fields called type and data. The properties we want to capture are in the data object. jq expects an expression that can be applied to the input data and produces the desired output. It must describe the content by its hierarchy and membership of an array, and describe how the data should be transformed. Let's run the entire command again with the correct expression:

  curl -s A "reddit scraper example" https://www.reddit.com/r/MildlyInteresting.json | jq ".data.children | [] | .data.title, .data.url, .data.permalink & # 39; 

The output shows title, URL, and permalink, each in its own line:

 Contents of a subreddit statement from the Linux Command Line ]

Let us delve into the command jq which we called:

  jq & # 39; .data.children | [] | .data.title, .data.url, .data.permalink '

There are three expressions in this command, separated by two pipe symbols. The results of each expression are passed to the next for further evaluation. The first expression filters everything except the array of reddit lists. This output is passed to the second expression and converted to an array. The third expression affects each element in the array and extracts three properties. For more information about jq and its expression syntax, see the official jq manual.

Everything in a Script

Let's summarize the API call and the JSON post in a script This will create a file with the posts you want. We will add support for retrieving posts from any subreddit, not just for /r/MildlyInteresting.[19659011()Opentheeditorandcopythecontentsofthissnippetintoafilenamescrape-redditsh

 #! / Bin / bash

if [-z "$ 1"]
then
echo "Please enter a subreddit"
Exit 1
fi

SUBREDDIT = $ 1
NOW = $ (date + "% m_% d_% y-% H_% M")
OUTPUT_FILE = "$ {SUBREDDIT} _ $ {NOW} .txt"

curl -s -A "bash-scrape-topics" https://www.reddit.com/r/${SUBREDDIT}.json | 
jq & # 39; .data.children | [] | .data.title, .data.url, .data.permalink & # 39; | 
when reading -r TITLE; do
read -r URL
Read -r PERMALINK
echo -e "$ {TITLE}  t $ {URL}  t $ {PERMALINK}" | tr --delete  ">> $ {OUTPUT_FILE}
done

This script first checks if the user has specified a subreddit name. If not, it is terminated with an error message and a non-zero return code.

Next, it saves the first argument as a subreditit name and creates a dated filename that stores the output. [19659011] The action begins when curl is called with a custom header and the URL of the subreddit to scrape. The output is forwarded to jq where it is analyzed and reduced to three fields: Title, URL and Permalink. These lines are read individually and stored with the read command in a variable, all within a while loop, and continue until no more lines can be read. The last line of the inner while block reproduces the three fields separated by a tab character and then passes them through the command tr so that the quotation marks can be removed. The output is then attached to a file.

Before we can run this script, we need to make sure it has execute permission. Use the command chmod to apply these permissions to the file:

  chmod u + x scrape-reddit.sh 

Finally, run the script with a subreddit name:

 . / scrape-reddit.sh MildlyInteresting 

An output file will be generated in the same directory, and its contents will look something like this:

 Scrape and View topics from a subreddit in Bash

Each line contains The three fields we searched for are separated by a tab character.

Going Further

Reddit is a goldmine of interesting content and media that is easily accessible through the JSON API. Now that you have a way to access this data and process the results, you can do the following:

  • Get the latest headlines from / r / WorldNews and send them to your desktop with notify-send.
  • Integrate the Best Jokes of / r / DadJokes into Your System's Message-Of-The-Day
  • Get the best image from / r / aww and make it your desktop wallpaper.

All this is possible with the data provided and the tools available system. Have fun chopping!




Source link