« Back to Data processing: Frequently Asked Questions

How to find all unique words in a file?

How to find all unique words in a file?

Use the grep command with a regular expression and filter out the words, followed by a sort and making them unique.

Linux systems have a wide range of useful utilities to do data processing, including searching through files. If you need to search for all unique words in a file, the grep command can perform this task very quickly. Let’s have a look at how to grep, followed by sorting and making them unique.

grep --only-matching --extended-regexp '[a-zA-Z]+' NEWS | sort | uniq

Explanation

We start the grep command with the --only-matching option to tell it to only show the results that match. Next step is to define what we are looking for. As we are not searching for a particular string, we use --extended-regexp to initiate a search with a regular expression. Next step is defining the regular expression ([a-zA-Z]+), meaning all words starting with a small or capital letter, with one or more occurrences.

The output of this file is a lot of words, including duplicates. To reduce the output, sorting can be done with sort, so uniq can filter all duplicates and make each line in the output unique.

Relevant commands in this article

Like to learn more about the commands that were used in this article? Have a look, for some there is also a cheat sheet available.

  • grep
  • sort
  • uniq

Other questions related to Data processing

Related articles

Like to learn more? Here is a list of articles within the same category or having similar tags.

Feedback

Small picture of Michael Boelen

This article has been written by our Linux security expert Michael Boelen. With focus on creating high-quality articles and relevant examples, he wants to improve the field of Linux security. No more web full of copy-pasted blog posts.

Discovered outdated information or have a question? Share your thoughts. Thanks for your contribution!

Mastodon icon