I'm trying to calculate the average number of characters per word in a list using the following definitions and a helper function
split()on a line of a file
!?.or the EOF. A sentence excludes whitespace on both ends and is not an empty string.
def clean_up(s): """ (str) -> str Return a new string based on s in which all letters have been converted to lowercase and punctuation characters have been stripped from both ends. Inner punctuation is left untouched. >>> clean_up('Happy Birthday!!!') 'happy birthday' >>> clean_up("-> It's on your left-hand side.") " it's on your left-hand side" """ punctuation = """!"',;:.-?)(<>*#\n\t\r""" result = s.lower().strip(punctuation) return result
My code is:
def avg_word_length(text): """ (list of str) -> float Precondition: text is non-empty. Each str in text ends with \n and text contains at least one word. Return the average length of all words in text. >>> text = ['James Fennimore Cooper\n', 'Peter, Paul and Mary\n'] >>> avg_word_length(text): 5.142857142857143 """ a = '' for i in range(len(text)): a = a + clean_up(text[i]) words = a.split() for word in words: average = sum(len(word) for word in words)/len(words) return average
I'm getting a value of 6.16666... as my answer.
I'm using Python 3
You have two considerable logical errors in your code.
First, in clean_up, you are removing separators from the start and end of a string, but not consecutive ones within a string. You are also not splitting on the same separators you are stripping; the result is that
"Peter," makes it through as a word with one character more than it should.
Secondly, you are concatenating lines after stripping, with
a = a + clean_up(text[i]). This means you ensure that you have too long and few words, since the last word of one line combines with the first word of the next; in this case, you get
"CooperPeter," as one word.
Both of these problems are rather evident if you just print
words before the second loop (which has no reason to be a loop, considering the generator expression within the sum() call).
Personally I'd probably use the
re module to find words with a single consistent definition (such as
r"\w+") and tally their lengths rather than collecting a string with their contents.