I'm using this code to get standard output from an external program:
>>> from subprocess import * >>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()
The communicate() method returns an array of bytes:
>>> command_stdout b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n'
However, I'd like to work with the output as a normal Python string. So that I could print it like this:
>>> print(command_stdout) -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1 -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
I thought that's what the binascii.b2a_qp() method is for, but when I tried it, I got the same byte array again:
>>> binascii.b2a_qp(command_stdout) b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n'
Does anybody know how to convert the bytes value back to string? I mean, using the "batteries" instead of doing it manually. And I'd like it to be ok with Python 3.pythonstringpython-3.x
You need to decode the bytes object to produce a string:
>>> b"abcde" b'abcde' # utf-8 is used here because it is a very common encoding, but you # need to use the encoding your data is actually in. >>> b"abcde".decode("utf-8") 'abcde'
You need to decode the byte string and turn it in to a character (unicode) string.
I think what you actually want is this:
>>> from subprocess import * >>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate() >>> command_text = command_stdout.decode(encoding='windows-1252')
Aaron's answer was correct, except that you need to know WHICH encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ascii) characters in your content, but then it will make a difference.
By the way, the fact that it DOES matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).
I think this way is easy:
bytes = [112, 52, 52] "".join(map(chr, bytes)) >> p44
To write or read binary data from/to the standard streams, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').
Set universal_newlines to True, i.e.
command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()
If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use ancient MS-DOS cp437 encoding:
PY3K = sys.version_info >= (3, 0) lines =  for line in stream: if not PY3K: lines.append(line) else: lines.append(line.decode('cp437'))
Because encoding is unknown, expect non-English symbols to translate to characters of
cp437 (English chars are not translated, because they match in most single byte encodings and UTF-8).
Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:
>>> b'\x00\x01\xffsd'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte
The same applies to
latin-1, which was popular (default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous
ordinal not in range.
UPDATE 20150604: There are rumors that Python 3 has
surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests
[binary] -> [str] -> [binary] to validate both performance and reliability.
UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with
backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:
PY3K = sys.version_info >= (3, 0) lines =  for line in stream: if not PY3K: lines.append(line) else: lines.append(line.decode('utf-8', 'backslashreplace'))
UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower that
cp437 solution, but it should produce identical results on every Python version.
# --- preparation import codecs def slashescape(err): """ codecs error handler. err is UnicodeDecode instance. return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue""" #print err, dir(err), err.start, err.end, err.object[:err.start] thebyte = err.object[err.start:err.end] repl = u'\\x'+hex(ord(thebyte))[2:] return (repl, err.end) codecs.register_error('slashescape', slashescape) # --- processing stream = [b'\x80abc'] lines =  for line in stream: lines.append(line.decode('utf-8', 'slashescape'))
I made a function to clean a list
def cleanLists(self, lista): lista = [x.strip() for x in lista] lista = [x.replace('\n', '') for x in lista] lista = [x.replace('\b', '') for x in lista] lista = [x.encode('utf8') for x in lista] lista = [x.decode('utf8') for x in lista] return lista
In Python 3 you can use directly:
which is equivalent to
here the default encoding is "utf-8", or you can check it by:
>> import sys >> sys.getdefaultencoding()
To interpret a byte sequence as a text, you have to know the corresponding character encoding:
unicode_text = bytestring.decode(character_encoding)
>>> b'\xc2\xb5'.decode('utf-8') 'µ'
ls command may produce output that can't be interpreted as text. File names
on Unix may be any sequence of bytes except slash
b'/' and zero
>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()
Trying to decode such byte soup using utf-8 encoding raises
It can be worse. The decoding may fail silently and produce mojibake if you use a wrong incompatible encoding:
>>> '—'.encode('utf-8').decode('cp1252') 'â€”'
The data is corrupted but your program remains unaware that a failure has occurred.
In general, what character encoding to use is not embedded in the byte sequence itself. You have to communicate this info out-of-band. Some outcomes are more likely than others and therefore
chardet module exists that can guess the character encoding. A single Python script may use multiple character encodings in different places.
ls output can be converted to a Python string using
function that succeeds even for undecodable
filenames (it uses
surrogateescape error handler on
import os import subprocess output = os.fsdecode(subprocess.check_output('ls'))
To get the original bytes, you could use
If you pass
universal_newlines=True parameter then
locale.getpreferredencoding(False) to decode bytes e.g., it can be
cp1252 on Windows.
Different commands may use different character encodings for their
dir internal command (
cmd) may use cp437. To decode its
output, you could pass the encoding explicitly (Python 3.6+):
output = subprocess.check_output('dir', shell=True, encoding='cp437')
The filenames may differ from
os.listdir() (which uses Windows
Unicode API) e.g.,
'\xb6' can be substituted with
cp437 codec maps
b'\x14' to control character U+0014 instead of
U+00B6 (¶). To support filenames with arbitrary Unicode characters, see Decode poweshell output possibly containing non-ascii unicode characters into a python string
For Python 3,this is a much safer and Pythonic approach to convert from
def byte_to_str(bytes_or_str): if isinstance(bytes_or_str, bytes): #check if its in bytes print(bytes_or_str.decode('utf-8')) else: print("Object not of byte type") byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n')
total 0 -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1 -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
If you should get the following by trying
AttributeError: 'str' object has no attribute 'decode'
You can also specify the encoding type straight in a cast:
>>> my_byte_str b'Hello World' >>> str(my_byte_str, 'utf-8') 'Hello World'
When working with data from Windows systems (with
\r\n line endings), my answer is
String = Bytes.decode("utf-8").replace("\r\n", "\n")
Why? Try this with a multiline Input.txt:
Bytes = open("Input.txt", "rb").read() String = Bytes.decode("utf-8") open("Output.txt", "w").write(String)
All your line endings will be doubled (to
\r\r\n), leading to extra empty lines. Python's text-read functions usually normalize line endings so that strings use only
\n. If you receive binary data from a Windows system, Python does not have a chance to do that. Thus,
Bytes = open("Input.txt", "rb").read() String = Bytes.decode("utf-8").replace("\r\n", "\n") open("Output.txt", "w").write(String)
will replicate your original file.