Clean invalid characters from data held in a Spark RDD

Dave Poole Source

I have a PySpark RDD imported from JSON files. The data elements contain a number of values that have characters that are not desirable. For the sake of argument only those characters that are string.printable should be in those JSON files.

Given that there are a large number of elements that contain text information I have been trying to find a way of mapping the incoming RDD to a function to clean the data and returning a cleansed RDD as output. I can find ways of printing a cleansed element from the RDD but not the entire collection of elements and returning then as an RDD.

An example document might be as show below and undesirable characters might creep into the userAgent, marketingReference and pageTags elements or indeed any of the text elements.

    "documentId": "abcdef12-1234-5678-fedc-cba9876543210",
    "documentType": "contentSummary",
    "dateTimeCreated": "2017-01-01T03:00:22.478Z"
    "body": {
        "requestUrl": "",
        "requestMethod": "GET",
        "responseCode": "200",
        "userAgent": "Mozilla/5.0 etc",
        "requestHeaders": {
            "connection": "close",
            "host": "",
            "accept-language": "en-gb",
            "via": "1.1",
            "user-agent": "Mozilla/5.0 etc",
            "x-forwarded-proto": "https",
            "clientIp": "",
            "referer": "",
            "accept-encoding": "gzip, deflate",
            "incap-client-ip": ""
        "body": {
            "pageId": "/content/our-web-site/en-gb/holidays/interstitial",
            "pageVersion": "1.0",

            "pageClassification": "product-page",
            "pageTags": "spark, python, rdd, other words",
            "MarketingReference": "BUYMEPLEASE",
            "referrer": "",
            "webSessionId": "abcdef12-1234-5678-fedc-cba9876543210"


answered 1 year ago Dave Poole #1

The problem was trying to clean up data downstream for which poor (or totally absent) data quality practices existed upstream.

Eventually it was accepted that we were attempting to address a symptom and not the cause. The cost of retrospectively fixing data was proven to be massively more than the cost of handling data properly in the first place.

comments powered by Disqus