Unstack a column into dataframe

Yang Source

I have some messy sensor reading data looking like this. Each record (not the same length) is separated by a "----" and stacked together. Is there any way to flatten it into a dataframe in which every row is a record?

test = pd.DataFrame({"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----","21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5","----"]})
test

    Messy
0   21/12/2017 11:12:48
1   Port:4
2   Reading 1: 1
3   ----
4   21/12/2017 11:13:48
5   Port:4
6   Reading 1: 2
7   Reading 2: 2.5
8   ----

What I want to have is something like this:

target = pd.DataFrame({"Time":["21/12/2017 11:12:48","21/12/2017 11:13:48"],"Port":["Port:4","Port:4"],"Field1":['Reading 1: 1','Reading 1: 2'],"Field2":['','Reading 2: 2.5']})
target

   Field1         Feild2           Port      Time
0  Reading 1: 1                    Port:4    21/12/2017 11:12:48
1  Reading 1: 2   Reading 2: 2.5   Port:4    21/12/2017 11:13:48
pythonpython-3.xpandas

Answers

answered 6 months ago jezrael #1

Obviously it is really data dependent, but you can try:

#check separator
m = test['Messy'].str.startswith('----')
#create groups
test['g'] = m.cumsum()
#filter separator rows
df = test[~m].copy()
#count groups
df['c'] = df.groupby('g').cumcount()
print (df)
                 Messy  g  c
0  21/12/2017 11:12:48  0  0
1               Port:4  0  1
2         Reading 1: 1  0  2
4  21/12/2017 11:13:48  1  0
5               Port:4  1  1
6         Reading 1: 2  1  2
7       Reading 2: 2.5  1  3

#pivoting
df = df.pivot('g','c','Messy')
print (df)
c                    0       1             2               3
g                                                           
0  21/12/2017 11:12:48  Port:4  Reading 1: 1            None
1  21/12/2017 11:13:48  Port:4  Reading 1: 2  Reading 2: 2.5

answered 6 months ago jpp #2

Below is one solution. Your data is messy. This method assumes that all your data is structured in groups of 4 columns.

import numpy as np, pandas as pd

test = pd.DataFrame({"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----","21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5","----"]})

lst = [np.hstack(np.hstack(i)) for i in zip((test.iloc[4*i:4*i+4].values \
                               for i in range(int(len(test.index)/4))))]

df = pd.DataFrame(lst, columns=['Date', 'Port', 'Field1', 'Field2']).replace({'----': ''})

#                   Date    Port        Field1          Field2
# 0  21/12/2017 11:12:48  Port:4  Reading 1: 1                
# 1  21/12/2017 11:13:48  Port:4  Reading 1: 2  Reading 2: 2.5

answered 6 months ago O.Suleiman #3

Assuming you have a maximum of 4 columns and all records are coming in the same order, here is another solution using re, io and pandas:

import pandas as pd
import io
import re
d = {"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----",
            "21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5",
            "----"]}

test = pd.read_csv(io.StringIO(re.sub(r',----,?','\n', ','.join(d['Messy']))),
                   names=['Time','Port','Field1','Field2'])


In [13]: 
print(test)

Out[13]:
    Time                Port    Field1          Field2
0   21/12/2017 11:12:48 Port:4  Reading 1: 1    NaN
1   21/12/2017 11:13:48 Port:4  Reading 1: 2    Reading 2: 2.5

You can scale this solution by adding more column names in the names list attribute in the pd.read_csv() function, e.g. if you have 10 columns maximum in a record in your data, just map them to 10 column names.

comments powered by Disqus