Average num of characters per word in a list

Ivyy Source

I'm trying to calculate the average number of characters per word in a list using the following definitions and a helper function clean_up.

Definitions:

  • A token is a string that you get from calling split() on a line of a file
  • A word is a non-empty token that is not completely made up of punctuation
  • A sentence is a sequence of characters that is terminated by, but does not include, the characters !?. or the EOF. A sentence excludes whitespace on both ends and is not an empty string.
def clean_up(s):
    """ (str) -> str

    Return a new string based on s in which all letters have been
    converted to lowercase and punctuation characters have been stripped 
    from both ends. Inner punctuation is left untouched. 

    >>> clean_up('Happy Birthday!!!')
    'happy birthday'
    >>> clean_up("-> It's on your left-hand side.")
    " it's on your left-hand side"
    """

    punctuation = """!"',;:.-?)([]<>*#\n\t\r"""
    result = s.lower().strip(punctuation)
    return result

My code is:

def avg_word_length(text):
    """ (list of str) -> float

    Precondition: text is non-empty. Each str in text ends with \n and
    text contains at least one word.

    Return the average length of all words in text. 

    >>> text = ['James Fennimore Cooper\n', 'Peter, Paul and Mary\n']
    >>> avg_word_length(text):
    5.142857142857143 
    """

    a = ''
    for i in range(len(text)):
        a = a + clean_up(text[i])
        words = a.split()
    for word in words:
        average = sum(len(word) for word in words)/len(words)
    return average

I'm getting a value of 6.16666... as my answer.
I'm using Python 3

pythonstringpython-3.x

Answers

answered 3 years ago Yann Vernier #1

You have two considerable logical errors in your code.

First, in clean_up, you are removing separators from the start and end of a string, but not consecutive ones within a string. You are also not splitting on the same separators you are stripping; the result is that "Peter," makes it through as a word with one character more than it should.

Secondly, you are concatenating lines after stripping, with a = a + clean_up(text[i]). This means you ensure that you have too long and few words, since the last word of one line combines with the first word of the next; in this case, you get "CooperPeter," as one word.

Both of these problems are rather evident if you just print words before the second loop (which has no reason to be a loop, considering the generator expression within the sum() call).

Personally I'd probably use the re module to find words with a single consistent definition (such as r"\w+") and tally their lengths rather than collecting a string with their contents.

comments powered by Disqus