Count words in string python

Python: Count Words in a String or File

How to Use Python to Count Words and Generate Word Frequencies Cover Image

In this tutorial, you’ll learn how to use Python to count the number of words and word frequencies in both a string and a text file. Being able to count words and word frequencies is a useful skill. For example, knowing how to do this can be important in text classification machine learning algorithms.

By the end of this tutorial, you’ll have learned:

  • How to count the number of words in a string
  • How to count the number of words in a text file
  • How to calculate word frequencies using Python

Reading a Text File in Python

The processes to count words and calculate word frequencies shown below are the same for whether you’re considering a string or an entire text file. Because of this, this section will briefly describe how to read a text file in Python.

If you want a more in-depth guide on how to read a text file in Python, check out this tutorial here. Here is a quick piece of code that you can use to load the contents of a text file into a Python string:

# Reading a Text File in Python file_path = '/Users/datagy/Desktop/sample_text.txt' with open(file_path) as file: text = file.read()

I encourage you to check out the tutorial to learn why and how this approach works. However, if you’re in a hurry, just know that the process opens the file, reads its contents, and then closes the file again.

Count Number of Words In Python Using split()

One of the simplest ways to count the number of words in a Python string is by using the split() function. The split function looks like this:

# Understanding the split() function str.split( sep=None # The delimiter to split on maxsplit=-1 # The number of times to split )

By default, Python will consider runs of consecutive whitespace to be a single separator. This means that if our string had multiple spaces, they’d only be considered a single delimiter. Let’s see what this method returns:

# Splitting a string with .split() text = 'Welcome to datagy! Here you will learn Python and data science.' print(text.split()) # Returns: ['Welcome', 'to', 'datagy!', 'Here', 'you', 'will', 'learn', 'Python', 'and', 'data', 'science.']

We can see that the method now returns a list of items. Because we can use the len() function to count the number of items in a list, we’re able to generate a word count. Let’s see what this looks like:

# Counting words with .split() text = 'Welcome to datagy! Here you will learn Python and data science.' print(len(text.split())) # Returns: 11

Count Number of Words In Python Using Regex

Another simple way to count the number of words in a Python string is to use the regular expressions library, re . The library comes with a function, findall() , which lets you search for different patterns of strings.

Читайте также:  Python setup failed one or more

Because we can use regular expression to search for patterns, we must first define our pattern. In this case, we want patterns of alphanumeric characters that are separated by whitespace.

For this, we can use the pattern \w+ , where \w represents any alphanumeric character and the + denotes one or more occurrences. Once the pattern encounters whitespace, such as a space, it will stop the pattern there.

Let’s see how we can use this method to generate a word count using the regular expressions library, re :

# Counting words with regular expressions import re text = 'Welcome to datagy! Here you will learn Python and data science.' print(len(re.findall(r'\w+', text))) # Returns: 11

Calculating Word Frequencies in Python

In order to calculate word frequencies, we can use either the defaultdict class or the Counter class. Word frequencies represent how often a given word appears in a piece of text.

Using defaultdict To Calculate Word Frequencies in Python

Let’s see how we can use defaultdict to calculate word frequencies in Python. The defaultdict extend on the regular Python dictionary by providing helpful functions to initialize missing keys.

Because of this, we can loop over a piece of text and count the occurrences of each word. Let’s see how we can use it to create word frequencies for a given string:

# Creating word frequencies with defaultdict from collections import defaultdict import re text = 'welcome to datagy! datagy will teach data. data is fun. data data data!' counts = defaultdict(int) for word in re.findall('\w+', text): counts[word] += 1 print(counts) # Returns: # defaultdict(, )

Let’s break down what we did here:

  1. We imported both the defaultdict function and the re library
  2. We loaded some text and instantiated a defaultdict using the int factory function
  3. We then looped over each word in the word list and added one for each time it occurred

Using Counter to Create Word Frequencies in Python

Another way to do this is to use the Counter class. The benefit of this approach is that we can even easily identify the most frequent word. Let’s see how we can use this approach:

# Creating word frequencies with Counter from collections import Counter import re text = 'welcome to datagy! datagy will teach data. data is fun. data data data!' counts = Counter(re.findall('\w+', text)) print(counts) # Returns: # Counter()

Let’s break down what we did here:

  1. We imported our required libraries and classes
  2. We passed the resulting list from the findall() function into the Counter class
  3. We printed the result of this class

One of the perks of this is that we can easily find the most common word by using the .most_common() function. The function returns a sorted list of tuples, ordering the items from most common to least common. Because of this, we can simply access the 0th index to find the most common word:

# Finding the Most Common Word from collections import Counter import re text = 'welcome to datagy! datagy will teach data. data is fun. data data data!' counts = Counter(re.findall('\w+', text)) print(counts.most_common()[0]) # Returns: # ('data', 5)

Conclusion

In this tutorial, you learned how to generate word counts and word frequencies using Python. You learned a number of different ways to count words including using the .split() method and the re library. Then, you learned different ways to generate word frequencies using defaultdict and Counter . Using the Counter method, you were able to find the most frequent word in a string.

Читайте также:  Html style no break line

Additional Resources

To learn more about related topics, check out the tutorials below:

Источник

Count Words in String in Python

Count Words in String in Python

  1. Use the split() and len() Methods to Count Words in Python String
  2. Use RegEx Module to Count Words in Python String
  3. Use sum() , strip() and split() Methods to Count Words in Python String
  4. Use the count() Method to Count Words in Python String Python

This tutorial will introduce how to count words in string Python.

Use the split() and len() Methods to Count Words in Python String

  • separator (optional) — It acts as a delimiter (e.g. commas, semicolon, quotes, or slashes). Specifies the boundary on which to separate in the string. The default separator is any whitespace (space, newline, tab, etc.) if the separator is not specified.
  • maxsplit (optional) — It defines the maximum number of splits. The default value of maxsplit if not defined is -1 , which means that it has no limits and will split the string into multiple chunks.
str.split(separator, maxsplit) 

len() is also a Python built-in method, which returns the number of strings in an array or counts the length of items in an object. This method only accepts one parameter: a string, bytes, list, object, set, or a collection. It will raise a TypeError exception if the argument is missing or invalid.

Let’s see how the split() and len() methods counts the words in a string.

Example 1: No Parameters

# initialize string text = 'The quick brown fox jumps over the lazy dog'  # default separator: space result = len(text.split())  print("There are " + str(result) + " words.") 

Example 2: With the separator Parameter

# initialize string bucket_list = 'Japan, Singapore, Maldives, Europe, Italy, Korea'  # comma delimiter result = len(bucket_list.split(','))  # Prints an array of strings print(bucket_list.split(','))  print("There are " + str(result) + " words.") 
['Japan', ' Singapore', ' Maldives', ' Europe', ' Italy', ' Korea'] There are 6 words. 

The split() method will return a new list of strings, and the len() counts the string inside the list.

Example 3: With the separator and maxsplit Parameters

# initialize string bucket_list = 'Japan, Singapore, Maldives, Europe, Italy, Korea'  # comma delimiter result = len(bucket_list.split(',', 3))  # Prints an array of strings print(bucket_list.split(',', 3))  print("There are " + str(result) + " words.") 
['Japan', ' Singapore', ' Maldives', ' Europe, Italy, Korea'] There are 4 words. 

maxsplit splits only the first three commas in the bucket_list . If you set the maxsplit , the list will have a maxsplit+1 item.

['Japan', ' Singapore', ' Maldives, Europe, Italy, Korea'] There are 3 words. 

The split() method breaks down large strings into smaller ones. Therefore, the counting of words in the array of strings will be based not exactly on the words but on how the split separator is defined.

Читайте также:  Java нужен или нет

Use RegEx Module to Count Words in Python String

Regular Expression, regex or regexp for short, is a very powerful tool in searching and manipulating text strings; this can be used for data preprocessing, validation purposes, finding a pattern in a text string, and so on. Regex can also help count words in a text string in scenarios where it has punctuation marks or special characters that are not needed. Regex is a Python built-in package, so we just need to import the package re to start using it.

# import regex module import re  # initialize string text = 'Python !! is the be1st $$ programming language @'  # using regex findall() result = len(re.findall(r'\w+', text))  print("There are " + str(result) + " words.") 

Use sum() , strip() and split() Methods to Count Words in Python String

This approach counts the words without using regex. The sum() , strip() , and split() are all built-in methods in Python. We’ll briefly discuss each method and its functionalities.

  • iterable (required) — a string, list, tuple, etc., to add up. These should be numbers.
  • start (optional) — A number added to the sum or the return value of the method.

Finally, the split() method, was already discussed before this approach.

Now, let’s use these methods together to count words in a string. First, we need to import the string , a Python built-in module, before using its functionalities.

import string  # initialize string text = 'Python !! is the be1st $$ programming language @'  # using the sum(), strip(), split() methods result = sum([i.strip(string.punctuation).isalpha() for i in text.split()])  print("There are " + str(result) + " words.") 

Use the count() Method to Count Words in Python String Python

  • substring (required) — a keyword to be searched in the string
  • start (option) — index as to where the search starts
  • end (option) — index as to where the search ends
string.count(substring, start, end) 

This method is different from the previous method since it does not return the total words found in the string but the number of occurrences found given the substring. Let’s see how this method works from the example below:

# initialize string text = "Python: How to count words in string Python" substring = "Python"  total_occurrences = text.count(substring)  print("There are " + str(total_occurrences) + " occurrences.") 

In this method, it doesn’t matter if the substring is a whole word, phrase, letter, or any combination of characters or numbers.

In summary, you can choose any of these approaches depends on your use case. For space-separated words, we can use the straightforward approach: the functions split() or len() . For filtering text strings to count words without special characters, use the regex module. Create a pattern that counts the words that do not include certain characters. Without using regex , use the alternative which is the combination of sum() + strip() + split() methods. Lastly, the count() method can also be used for counting the specific word found in the string.

Related Article — Python String

Copyright © 2023. All right reserved

Источник

Оцените статью