Creating an index of words

iKyriaki Source

I'm currently trying to create an index of words, reading each line from a text file and checking to see if the word is in that line. If so, it prints out the number line and continues the check. I've gotten it to work how I wanted to when printing each word and line number, but I'm not sure what storage system I could use to contain each number.

Code example:

def index(filename, wordList):
    'string, list(string) ==> string & int, returns an index of words with the line number\
    each word occurs in'
    indexDict = {}
    res = []
    infile = open(filename, 'r')
    count = 0
    line = infile.readline()
    while line != '':
        count += 1
        for word in wordList:
            if word in line:
                #indexDict[word] = [count]
                print(word, count)
        line = infile.readline()
    #return indexDict

This prints the word and whatever the count is at the time (line number), but what I'm trying to do is store the numbers so that later on I can make it print out

word linenumber

word2 linenumber, linenumber

And so on. I felt a dictionary would work for this if I put each line number inside a list so each key can contain more than one value, but the closest I got was this:

{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [120], 'evil': [106], 'demon': [122]}

When I wanted it to show up as:

{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [44, 53, 55, 64, 78, 97, 104, 111, 118, 120], 'evil': [99, 106], 'demon': [122]}

Any ideas?



answered 6 years ago tobias_k #1

Try something like this:

import collections
def index(filename, wordList):
    indexDict = collections.defaultdict(list)
    with open(filename) as infile:
        for (i, line) in enumerate(infile.readlines()):
            for word in wordList:
                if word in line:
    return indexDict

This yields the exact same results as in your example (using Poe's Raven).

Alternatively, you might consider using a normal dict instead of a defaultdict and initialize it with all the words in the list; to make sure that the indexDict contains an entry even for words that are not in the text.

Also, note the use of enumerate. This builtin function is very useful for iterating over both the index and the item at that index of some list (like the lines in the file).

answered 6 years ago Adam Barthelson #2

There is probably a more pythonic way to write this, but just for readability you could try this (a simple example):

dict = {1: [], 2: [], 3: []}

list = [1,2,2,2,3,3]

for k in dict.keys():
    for i in list:
        if i == k:

In [7]: dict
Out[7]: {1: [1], 2: [2, 2, 2], 3: [3, 3]}

answered 6 years ago Martijn Pieters #3

You need to append your next item to the list, if the list already exists.

The easiest way to have the list already be there even for the first time you find a word, is to use the collections.defaultdict class to track your word-to-lines mapping:

from collections import defaultdict

def index(filename, wordList):
    indexDict = defaultdict(list)
    with open(filename, 'r') as infile:
        for i, line in enumerate(infile):
            for word in wordList:
                if word in line:
                    print(word, i)

    return indexDict

I've simplified your code a little using best practices; opening the file as a context manager so it'll close automatically when done, and using enumerate() to create line numbers on the fly.

You could speed this up a little further still (and make it more accurate) if you turned your lines into a set of words (set(line.split()) perhaps, but that won't remove punctuation), as then you could use set intersection tests against wordList (also a set), which could be considerably faster to find matching words.

answered 6 years ago octref #4

You are replacing the old value by this line

indexDict[word] = [count]

Changing it to

indexDict[word] = indexDict.setdefault(word, []) + [count]

Will yield the answer you want. It'll get the current value of indexDict[word] and append the new count to it, if there is no indexDict[word], it creates a new empty list and append count to it.

comments powered by Disqus