How to find all unique words in a file?
How to find all unique words in a file?
Use the grep command with a regular expression and filter out the words, followed by a sort and making them unique.
Linux systems have a wide range of useful utilities to do data processing, including searching through files. If you need to search for all unique words in a file, the grep command can perform this task very quickly. Let’s have a look at how to grep, followed by sorting and making them unique.
grep --only-matching --extended-regexp '[a-zA-Z]+' NEWS | sort | uniq
Explanation
We start the grep
command with the --only-matching option to tell it to only show the results that match. Next step is to define what we are looking for. As we are not searching for a particular string, we use --extended-regexp to initiate a search with a regular expression. Next step is defining the regular expression ([a-zA-Z]+), meaning all words starting with a small or capital letter, with one or more occurrences.
The output of this file is a lot of words, including duplicates. To reduce the output, sorting can be done with sort, so uniq can filter all duplicates and make each line in the output unique.