# Average num of characters per word in a list

Ivyy Source

I'm trying to calculate the average number of characters per word in a list using the following definitions and a helper function `clean_up`.

Definitions:

• A token is a string that you get from calling `split()` on a line of a file
• A word is a non-empty token that is not completely made up of punctuation
• A sentence is a sequence of characters that is terminated by, but does not include, the characters `!?.` or the EOF. A sentence excludes whitespace on both ends and is not an empty string.
``````def clean_up(s):
""" (str) -> str

Return a new string based on s in which all letters have been
converted to lowercase and punctuation characters have been stripped
from both ends. Inner punctuation is left untouched.

>>> clean_up('Happy Birthday!!!')
'happy birthday'
>>> clean_up("-> It's on your left-hand side.")
" it's on your left-hand side"
"""

punctuation = """!"',;:.-?)([]<>*#\n\t\r"""
result = s.lower().strip(punctuation)
return result
``````

My code is:

``````def avg_word_length(text):
""" (list of str) -> float

Precondition: text is non-empty. Each str in text ends with \n and
text contains at least one word.

Return the average length of all words in text.

>>> text = ['James Fennimore Cooper\n', 'Peter, Paul and Mary\n']
>>> avg_word_length(text):
5.142857142857143
"""

a = ''
for i in range(len(text)):
a = a + clean_up(text[i])
words = a.split()
for word in words:
average = sum(len(word) for word in words)/len(words)
return average
``````

I'm getting a value of 6.16666... as my answer.
I'm using Python 3

pythonstringpython-3.x

answered 4 years ago Yann Vernier #1

You have two considerable logical errors in your code.

First, in clean_up, you are removing separators from the start and end of a string, but not consecutive ones within a string. You are also not splitting on the same separators you are stripping; the result is that `"Peter,"` makes it through as a word with one character more than it should.

Secondly, you are concatenating lines after stripping, with `a = a + clean_up(text[i])`. This means you ensure that you have too long and few words, since the last word of one line combines with the first word of the next; in this case, you get `"CooperPeter,"` as one word.

Both of these problems are rather evident if you just print `words` before the second loop (which has no reason to be a loop, considering the generator expression within the sum() call).

Personally I'd probably use the `re` module to find words with a single consistent definition (such as `r"\w+"`) and tally their lengths rather than collecting a string with their contents.