Delete column from pandas DataFrame using del df.column_name

John Source

When deleting a column in a DataFrame I use:

del df['column_name']

And this works great. Why can't I use the following?

del df.column_name

As you can access the column/Series as df.column_name, I expect this to work.

pythonpandasdataframe

Answers

answered 6 years ago Andy Hayden #1

It's good practice to always use the [] notation. One reason is that attribute notation (df.column_name) does not work for numbered indices:

In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])

In [2]: df[1]
Out[2]:
0    2
1    5
Name: 1

In [3]: df.1
  File "<ipython-input-3-e4803c0d1066>", line 1
    df.1
       ^
SyntaxError: invalid syntax

answered 6 years ago Wes McKinney #2

It's difficult to make del df.column_name work simply as the result of syntactic limitations in Python. del df[name] gets translated to df.__delitem__(name) under the covers by Python.

answered 5 years ago LondonRob #3

The best way to do this in pandas is to use drop:

df = df.drop('column_name', 1)

where 1 is the axis number (0 for rows and 1 for columns.)

To delete the column without having to reassign df you can do:

df.drop('column_name', axis=1, inplace=True)

Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:

df.drop(df.columns[[0, 1, 3]], axis=1)  # df.columns is zero-based pd.Index 

answered 5 years ago Krishna Sankar #4

Use:

columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)

This will delete one or more columns in-place. Note that inplace=True was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:

df = df.drop(columns, axis=1)

answered 3 years ago jezrael #5

Drop by index

Delete first, second and fourth columns:

df.drop(df.columns[[0,1,3]], axis=1, inplace=True)

Delete first column:

df.drop(df.columns[[0]], axis=1, inplace=True)

There is an optional parameter inplace so that the original data can be modified without creating a copy.

Popped

Column selection, addition, deletion

Delete column column-name:

df.pop('column-name')

Examples:

df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])

print df:

   one  two  three
A    1    2      3
B    4    5      6
C    7    8      9

df.drop(df.columns[[0]], axis=1, inplace=True) print df:

   two  three
A    2      3
B    5      6
C    8      9

three = df.pop('three') print df:

   two
A    2
B    5
C    8

answered 3 years ago eiTan LaVi #6

A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:

Simply add errors='ignore', for example.:

df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
  • This is new from pandas 0.16.1 onward. Documentation is here.

answered 3 years ago Alexander #7

In pandas 0.16.1+ you can drop columns only if they exist per the solution posted by @eiTanLaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:

df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df], 
        axis=1, inplace=True)

answered 2 years ago Doctor #8

The dot syntax works in JavaScript, but not in Python.

  • Python: del df['column_name']
  • JavaScript: del df['column_name'] or del df.column_name

answered 2 years ago sushmit #9

from version 0.16.1 you can do

df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')

answered 2 years ago firelynx #10

The actual question posed, missed by most answers here is:

Why can't I use del df.column_name?

At first we need to understand the problem, which requires us to dive into python magic methods.

As Wes points out in his answer del df['column'] maps to the python magic method df.__delitem__('column') which is implemented in pandas to drop the column

However, as pointed out in the link above about python magic methods:

In fact, del should almost never be used because of the precarious circumstances under which it is called; use it with caution!

You could argue that del df['column_name'] should not be used or encouraged, and thereby del df.column_name should not even be considered.

However, in theory, del df.column_name could be implemeted to work in pandas using the magic method __delattr__. This does however introduce certain problems, problems which the del df['column_name'] implementation already has, but in lesser degree.

Example Problem

What if I define a column in a dataframe called "dtypes" or "columns".

Then assume I want to delete these columns.

del df.dtypes would make the __delattr__ method confused as if it should delete the "dtypes" attribute or the "dtypes" column.

Architectural questions behind this problem

  1. Is a dataframe a collection of columns?
  2. Is a dataframe a collection of rows?
  3. Is a column an attribute of a dataframe?

Pandas answers:

  1. Yes, in all ways
  2. No, but if you want it to be, you can use the .ix, .loc or .iloc methods.
  3. Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.

TLDR;

You cannot do del df.column_name because pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.

Protip:

Don't use df.column_name, It may be pretty, but it causes cognitive dissonance

Zen of Python quotes that fits in here:

There are multiple ways of deleting a column.

There should be one-- and preferably only one --obvious way to do it.

Columns are sometimes attributes but sometimes not.

Special cases aren't special enough to break the rules.

Does del df.dtypes delete the dtypes attribute or the dtypes column?

In the face of ambiguity, refuse the temptation to guess.

answered 1 year ago piRSquared #11

TL;DR

A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')

df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)

Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.

I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.

Using these solutions are general and will work for the simple case as well.


Setup
Consider the pd.DataFrame df and list to delete dlst

df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')

df

   A  B  C  D  E  F  G  H  I   J
0  1  2  3  4  5  6  7  8  9  10
1  1  2  3  4  5  6  7  8  9  10
2  1  2  3  4  5  6  7  8  9  10

dlst

['H', 'I', 'J', 'K', 'L', 'M']

The result should look like:

df.drop(dlst, 1, errors='ignore')

   A  B  C  D  E  F  G
0  1  2  3  4  5  6  7
1  1  2  3  4  5  6  7
2  1  2  3  4  5  6  7

Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:

  1. Label selection
  2. Boolean selection

Label Selection

We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.

  1. df.columns.difference(dlst)

    Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
    
  2. np.setdiff1d(df.columns.values, dlst)

    array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
    
  3. df.columns.drop(dlst, errors='ignore')

    Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
    
  4. list(set(df.columns.values.tolist()).difference(dlst))

    # does not preserve order
    ['E', 'D', 'B', 'F', 'G', 'A', 'C']
    
  5. [x for x in df.columns.values.tolist() if x not in dlst]

    ['A', 'B', 'C', 'D', 'E', 'F', 'G']
    

Columns from Labels
For the sake of comparing the selection process, assume:

 cols = [x for x in df.columns.values.tolist() if x not in dlst]

Then we can evaluate

  1. df.loc[:, cols]
  2. df[cols]
  3. df.reindex(columns=cols)
  4. df.reindex_axis(cols, 1)

Which all evaluate to:

   A  B  C  D  E  F  G
0  1  2  3  4  5  6  7
1  1  2  3  4  5  6  7
2  1  2  3  4  5  6  7

Boolean Slice

We can construct an array/list of booleans for slicing

  1. ~df.columns.isin(dlst)
  2. ~np.in1d(df.columns.values, dlst)
  3. [x not in dlst for x in df.columns.values.tolist()]
  4. (df.columns.values[:, None] != dlst).all(1)

Columns from Boolean
For the sake of comparison

bools = [x not in dlst for x in df.columns.values.tolist()]
  1. df.loc[: bools]

Which all evaluate to:

   A  B  C  D  E  F  G
0  1  2  3  4  5  6  7
1  1  2  3  4  5  6  7
2  1  2  3  4  5  6  7

Robust Timing

Functions

setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]

loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)

isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)

Testing

res1 = pd.DataFrame(
    index=pd.MultiIndex.from_product([
        'loc slc ridx ridxa'.split(),
        'setdiff1d difference columndrop setdifflst comprehension'.split(),
    ], names=['Select', 'Label']),
    columns=[10, 30, 100, 300, 1000],
    dtype=float
)

res2 = pd.DataFrame(
    index=pd.MultiIndex.from_product([
        'loc'.split(),
        'isin in1d comp brod'.split(),
    ], names=['Select', 'Label']),
    columns=[10, 30, 100, 300, 1000],
    dtype=float
)

res = res1.append(res2).sort_index()

dres = pd.Series(index=res.columns, name='drop')

for j in res.columns:
    dlst = list(range(j))
    cols = list(range(j // 2, j + j // 2))
    d = pd.DataFrame(1, range(10), cols)
    dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
    for s, l in res.index:
        stmt = '{}(d, {}(d, dlst))'.format(s, l)
        setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
        res.at[(s, l), j] = timeit(stmt, setp, number=100)

rs = res / dres

rs

                          10        30        100       300        1000
Select Label                                                           
loc    brod           0.747373  0.861979  0.891144  1.284235   3.872157
       columndrop     1.193983  1.292843  1.396841  1.484429   1.335733
       comp           0.802036  0.732326  1.149397  3.473283  25.565922
       comprehension  1.463503  1.568395  1.866441  4.421639  26.552276
       difference     1.413010  1.460863  1.587594  1.568571   1.569735
       in1d           0.818502  0.844374  0.994093  1.042360   1.076255
       isin           1.008874  0.879706  1.021712  1.001119   0.964327
       setdiff1d      1.352828  1.274061  1.483380  1.459986   1.466575
       setdifflst     1.233332  1.444521  1.714199  1.797241   1.876425
ridx   columndrop     0.903013  0.832814  0.949234  0.976366   0.982888
       comprehension  0.777445  0.827151  1.108028  3.473164  25.528879
       difference     1.086859  1.081396  1.293132  1.173044   1.237613
       setdiff1d      0.946009  0.873169  0.900185  0.908194   1.036124
       setdifflst     0.732964  0.823218  0.819748  0.990315   1.050910
ridxa  columndrop     0.835254  0.774701  0.907105  0.908006   0.932754
       comprehension  0.697749  0.762556  1.215225  3.510226  25.041832
       difference     1.055099  1.010208  1.122005  1.119575   1.383065
       setdiff1d      0.760716  0.725386  0.849949  0.879425   0.946460
       setdifflst     0.710008  0.668108  0.778060  0.871766   0.939537
slc    columndrop     1.268191  1.521264  2.646687  1.919423   1.981091
       comprehension  0.856893  0.870365  1.290730  3.564219  26.208937
       difference     1.470095  1.747211  2.886581  2.254690   2.050536
       setdiff1d      1.098427  1.133476  1.466029  2.045965   3.123452
       setdifflst     0.833700  0.846652  1.013061  1.110352   1.287831

fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
    ax = axes[i // 2, i % 2]
    g.plot.bar(ax=ax, title=n)
    ax.legend_.remove()
fig.tight_layout()

This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.

enter image description here

If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.

rs.idxmin().pipe(
    lambda x: pd.DataFrame(
        dict(idx=x.values, val=rs.lookup(x.values, x.index)),
        x.index
    )
)

                      idx       val
10     (ridx, setdifflst)  0.653431
30    (ridxa, setdifflst)  0.746143
100   (ridxa, setdifflst)  0.816207
300    (ridx, setdifflst)  0.780157
1000  (ridxa, setdifflst)  0.861622

answered 11 months ago Ted Petrou #12

Pandas 0.21+ answer

Pandas version 0.21 has changed the drop method slightly to include both the index and columns parameters to match the signature of the rename and reindex methods.

df.drop(columns=['column_a', 'column_c'])

Personally, I prefer using the axis parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.

answered 2 weeks ago Daksh #13

Another way of Deleting a Column in Pandas DataFrame

if you're not looking for In-Place deletion then you can create a new DataFrame by specifying the columns using DataFrame(...) function as

my_dict = { 'name' : ['a','b','c','d'], 'age' : [10,20,25,22], 'designation' : ['CEO', 'VP', 'MD', 'CEO']}

df = pd.DataFrame(my_dict)

Create a new DataFrame as

newdf = pd.DataFrame(df, columns=['name', 'age'])

You get a result as good as what you get with del / drop

comments powered by Disqus