Unique words save to text file as a word per line

BBEng Source

[using Python 3.3.3]

I'm trying to analyse text files, clean them up, print the amount of unique words, then try to save the unique words list to a text file, one word per line with the amount of times each unique word appears in the cleaned up list of words. SO what i did was i took the text file (a speech from prime minister harper), cleaned it up by only counting valid alphabetical characters and single spaces, then i counted the amount of unique words, then i needed to make a saved text file of the unique words, with each unique word being on its own line and beside the word, the number of occurances of that word in the cleaned up list. Here's what i have.

def uniqueFrequency(newWords):
    '''Function returns a list of unique words with amount of occurances of that
word in the text file.'''
    unique = sorted(set(newWords.split()))
    for i in unique:
        unique = str(unique) + i + " " + str(newWords.count(i)) + "\n"
    return unique

def saveUniqueList(uniqueLines, filename):
    '''Function saves result of uniqueFrequency into a text file.'''
    outFile = open(filename, "w")
    outFile.write(uniqueLines)
    outFile.close

newWords is the cleaned up version of the text file, with only words and spaces, nothing else. So, I want each unique word in the newWords file to be saved to a text file, one word per line, and beside the word, have the # of occurances of that word in newWords (not in unique words list because then each word would have 1 occurance). What is wrong with my functions? Thank you!

pythonfilepython-3.xiotext-files

Answers

answered 3 years ago R Sahu #1

Based on

unique = sorted(set(newWords.split()))
for i in unique:
    unique = str(unique) + i + " " + str(newWords.count(i)) + "\n"

I am guessing that newWords is not a list of strings but a long string. If that's the case, newWords.count(i) will return 0 for every i.

Try:

def uniqueFrequency(newWords):
    '''Function returns a list of unique words with amount of occurances of that
word in the text file.'''
    wordList = newWords.split()
    unique = sorted(set(wordList))
    ret = ""
    for i in unique:
        ret = ret + i + " " + str(wordList.count(i)) + "\n"
    return ret

answered 3 years ago Apoorv #2

unique = str(unique) + i + " " + str(newWords.count(i)) + "\n"

The line above, is appending at the end of the existing set - "unique", if you use some other variable name instead, like "var", that should return correctly.

def uniqueFrequency(newWords):
    '''Function returns a list of unique words with amount of occurances of that
word in the text file.'''
    var = "";
    unique = sorted(set(newWords.split()))
    for i in unique:
        var = str(var) + i + " " + str(newWords.count(i)) + "\n"
    return var

answered 3 years ago Roland Smith #3

Try collections.Counter instead. It is made for situations like this.

Demonstration in IPython below:

In [1]: from collections import Counter

In [2]: txt = """I'm trying to analyse text files, clean them up, print the amount of unique words, then try to save the unique words list to a text file, one word per line with the amount of times each unique word appears in the cleaned up list of words. SO what i did was i took the text file (a speech from prime minister harper), cleaned it up by only counting valid alphabetical characters and single spaces, then i counted the amount of unique words, then i needed to make a saved text file of the unique words, with each unique word being on its own line and beside the word, the number of occurances of that word in the cleaned up list. Here's what i have."""

In [3]: Counter(txt.split())
Out[3]: Counter({'the': 10, 'of': 7, 'unique': 6, 'i': 5, 'to': 4, 'text': 4, 'word': 4, 'then': 3, 'cleaned': 3, 'up': 3, 'amount': 3, 'words,': 3, 'a': 2, 'with': 2, 'file': 2, 'in': 2, 'line': 2, 'list': 2, 'and': 2, 'each': 2, 'what': 2, 'did': 1, 'took': 1, 'from': 1, 'words.': 1, '(a': 1, 'only': 1, 'harper),': 1, 'was': 1, 'analyse': 1, 'one': 1, 'number': 1, 'them': 1, 'appears': 1, 'it': 1, 'have.': 1, 'characters': 1, 'counted': 1, 'list.': 1, 'its': 1, "I'm": 1, 'own': 1, 'by': 1, 'save': 1, 'spaces,': 1, 'being': 1, 'clean': 1, 'occurances': 1, 'alphabetical': 1, 'files,': 1, 'counting': 1, 'needed': 1, 'that': 1, 'make': 1, "Here's": 1, 'times': 1, 'print': 1, 'up,': 1, 'beside': 1, 'trying': 1, 'on': 1, 'try': 1, 'valid': 1, 'per': 1, 'minister': 1, 'file,': 1, 'saved': 1, 'single': 1, 'words': 1, 'SO': 1, 'prime': 1, 'speech': 1, 'word,': 1})

(Note that this solution isn't perfect yet; it hasn't removed the commas from the words. hint; use str.replace.)

The Counter is a specialized dict, with a word as the key, and the count as the value. So you can use it like this:

 cnts = Counter(txt)
 with open('counts.txt', 'w') as outfile:
     for c in counts:
         outfile.write("{} {}\n".format(c, cnts[c]))

Note that in this solution I used some nice-to-know Python concepts;

comments powered by Disqus