Converting Geospatial Coordinate Encoding in Snowflake

Data

Converting Geospatial Coordinate Encoding in Snowflake

This series shows you the various ways you can use Python within Snowflake.

Recently, I was involved in a project which required using geospatial (basically maps) data for an analytical application. Now, with ESRI shape files being around for more than three decades, any sincere BI application like Tableau can easily read and work with these spatial files. But Snowflake has geospatial data types available, too; and when there is a need to combine the geospatial data with other data or to process it in a performant manner, Snowflake has the edge rather than storing and referring these files locally. Besides, recently Snowflake has made reading files dynamically in a Python Snowpark function possible. Daria Rostovtseva from Snowflake has written a crisp article about loading shape file archives using Python Snowpark UDTF. So, I thought I would make use of it right away and make life (and analysis) simpler. Little did I know that I had embarked on a small adventure.

Loading the Shape Data

As the project required UK geographical data, my colleague Max provided me with some excellent publicly available datasets from 2011 UK census. Following the above article, it is fairly straightforward to load the data. The entire shape file ZIP archive is required to decipher the data and not just the .shp file. Simply create an internal stage, load the shape data archive there and then use the Snowpark Python table function py_load_geofile given in blog to load the data into a table.

create int_stage;
put file://infuse_dist_lyr_2011.zip @int_stage/shape_files parallel = 8 auto_compress = false source_compression = auto_detect;

create or replace table local_area_districts_staged as
select 
    properties,
    to_geography(wkb, True) as geo_features
from table(
    py_load_geofile(
        build_scoped_file_url(@int_stage, 'shape_files/infuse_dist_lyr_2011.zip')
                                        , 'infuse_dist_lyr_2011.shp'
    )
);

It loaded uneventfully; but when referred in Tableau, Tableau didn’t like it, although the data types were what they should have been.

Loading Shape Data into Snowflake

The first suspect for a data engineer is always the data. Snowflake provides a neat function, st_isvalid, to check whether the underlying geographical data is valid or not.

select st_isvalid(geography)
from local_area_districts_staged;

To my shock, every value in the geography column was invalid!

Geography column invalid

But in this case, data couldn’t be doubted as it was officially published governmental data. It left me thinking, and when I started looking for clues, I came across this statement in Daria’s blog which I had overlooked before:

It’s important to be aware of the shapefile’s spatial reference. If geographic coordinates are unprojected (i.e., representing locations on the Earth’s surface as points on a 3D sphere using WGS84 geographic coordinate system), it is appropriate to parse the geographic coordinates into Snowflake’s geography data type. In my example, the data is in WGS84, and parsed using to_geography function. If the coordinates are projected (representing locations on the Earth’s surface using a two-dimensional Cartesian coordinate system), they should be parsed into geometry data type using one of the geometry functions. To figure out if the shapefile is projected or unprojected, you can look at the .prj file inside the zipped shapefile archive.

The CTAS statement to import the data was thus updated to load a column of type geometry instead of geography as follows.

create or replace table local_area_districts_staged as
select 
    properties,
    to_geometry(wkb, False) as geo_features
from table(
    py_load_geofile(
        build_scoped_file_url(@int_stage, 'shape_files/infuse_dist_lyr_2011.zip')
                                        , 'infuse_dist_lyr_2011.shp'
    )
);

Please note the call to the to_geometry function. The value of the second parameter, <allow_invalid>, had now been set to False to perform stricter loading and to reject invalid shapes. Keen readers would notice that in the first version of the CTAS statement (see above), this parameter value to to_geography function was True, which is why Snowflake accepted the invalid geographical coordinates in first place. However, this was far from a solution as Tableau would still require the geographical coordinates in longitude/latitude format.

Finding and Converting the Coordinate Encoding

My knowledge about geospatial data and systems being less than elementary, that was a real puzzle for me now. But a close look at some data points (for example, (291902, 196289)) told me that these are not longitude/latitude data! WSG84 meant longitude/latitude data. Now a word (or a paragraph rather!) on geographical coordinate systems is necessary here. WGS or World Geodetic System is a “standard used in cartographygeodesy and satellite navigation including GPS,” according to Wikipedia, and 84 is the current version of the standard so the longitude/latitude numbers in layman’s terms become WGS84 encoded coordinates. The .prj file in the data archive had some really cryptic information out of which following seemed interesting. “OSGB_1936_British_National_Grid.

Using a Built-in Snowflake Function

Fortunately, Snowflake has recently released a st_transform function into GA. With this function, it is easy to convert the spatial reference system of any coordinates into any other system. When the “from” and “to” SRID of the coordinate systems are known, the conversion is simply a function call on the column which contains the geospatial data.

ST_TRANSFORM( <geometry_expression> [ , <from_srid> ] , <to_srid> );

A simple search for the SRIDs for OSGB_1936_British_National_Grid and WGS84 showed that their SRIDs are 27700 and 4326 respectively (WGS84 has many coordinate systems, but one that is relevant to it, is 4326, as per the Snowflake documentation.). That means the conversion could be simply:

select st_transform(geometry, 27700, 4326) from local_area_districts_staged

But there is one things to keep in mind. The st_transform function returns GEOMETRY data type and not GEOGRAPHY data type. So the final table creation statement became,

create or replace table local_area_districts
as
select properties, try_to_geography(st_aswkb(st_transform(geometry, 27700, 4326))) as geography
from local_area_districts_staged;

A quick test with Tableau showed that Tableau was happy about this data and could read it as geospatial map data (screenshot similar to one in the end of this post).

Now, this could very well be the end of this post as the goal was achieved, but something was still on my mind: The .prj file that came with the data, which I previously mentioned, had much more descriptive content. Following to be precise.

PROJCS[
  "OSGB_1936_British_National_Grid",
  GEOGCS["GCS_OSGB 1936",
    DATUM[
      "D_OSGB_1936",
      SPHEROID["Airy_1830",6377563.396,299.3249646]
    ],
    PRIMEM["Greenwich",0],
    UNIT["Degree",0.017453292519943295]
  ],
  PROJECTION["Transverse_Mercator"],
  PARAMETER["latitude_of_origin",49],
  PARAMETER["central_meridian",-2],
  PARAMETER["scale_factor",0.9996012717],
  PARAMETER["false_easting",400000],
  PARAMETER["false_northing",-100000],
  UNIT["Meter",1]
]

There was a lot of information there for which the st_transform function had no regard. In all fairness, it did solve my purpose, but what if all this extra information is required to achieve projection or conversion accuracy? How to use this information as there is no way to provide this information to st_transform function. That time it clicked me, Python has solution for everything! If I could find a pythonic way of coordinate conversion, could Snowpark help me?

Using Snowpark Python UDF to Convert Coordinates

pyproj Transformer

The brainwave I got was not going to make me a geoinformatics expert in one day, but thankfully these days we have a very powerful assistance available. I sought help of ChatGPT.

ChatGPT Python Help

The biggest help from ChatGPT was determining which Python package to use and the boilerplate code. Credit should be given where it is due, but although it was a great help to set me on the right course, it was by no means a simple copy and paste work. A lot more needed to be done in order to upgrade this code into “production” grade code. It’s also worth noting that British Geological Survey provides a webservice as well as a detailed SQL process for bulk conversion of coordinates. Both of which were unsuitable for this use case due to various factors like dependencies and scale. A native Python Snowpark UDF was better choice and the crux of this new solution therefore remained the conversion function convert_bng_to_latlon provided (and so beautifully drafted) by ChatGPT.

The transformer.transform function suggested by ChatGPT has been deprecated since 2.6.1. version of pyproj library. The newest version (3.6.0 as of writing this article) has class named Transformer, which is recommended for all these conversion tasks. There are multiple ways to create object of Transformer class. One can simply specify the EPSG string or create a pyproj.crs.CRS object from elaborate  settings like given above.

That means, however, the following code is sufficient and would work:

from pyproj import Transformer

transformer = Transformer.from_crs(
  "EPSG:27700" # British National Grid (OSGB)
  , "EPSG:4326" # WGS84 (latitude and longitude)
)

This next way of creating a Transformer object is much better way to preserve all the characteristics of the source CRS:

from pyproj import Transformer, CRS

crs_str = '''
PROJCS[
  "OSGB_1936_British_National_Grid"
  ,GEOGCS[
    "GCS_OSGB 1936"
    ,DATUM[
      "D_OSGB_1936"
      ,SPHEROID["Airy_1830",6377563.396,299.3249646]
    ]
    ,PRIMEM["Greenwich",0]
    ,UNIT["Degree",0.017453292519943295]
  ]
  ,PROJECTION["Transverse_Mercator"]
  ,PARAMETER["latitude_of_origin",49]
  ,PARAMETER["central_meridian",-2]
  ,PARAMETER["scale_factor",0.9996012717]
  ,PARAMETER["false_easting",400000]
  ,PARAMETER["false_northing",-100000]
  ,UNIT["Meter",1]
]
'''

# Create an explicit CRS object
in_crs = CRS(crs_str)
# Define the coordinate systems
transformer = Transformer.from_crs(
  crs_from=in_crs
  , crs_to='EPSG:4326'
  , always_xy=True
)

There is a plethora of other parameters which the from_crs function accepts, which are explained well in the pyproj Transformer documentation.

Snowflake requires the geographic data in longitude and latitude format and not in latitude/longitude format as the Transformer.transform function would normally provide. The always_xy parameter takes care of swapping the result values in x and y format.

Maintaining the Structure of the Shape Data

The geospatial objects in any format (WKT, WKB, GeoJSON, etc.) are collection of coordinate values for various shapes. The actual nature of the shape would determine how many coordinates are required and even further rules can be applicable. For example, a Point shape would only require a pair of coordinates, whereas a LineString shape would require two (or more) coordinate pairs. Shapes like Polygon and MultiPolygon not only require several pairs of coordinates, they also require the first and last coordinate to nicely complete the shape and preferably no overlaps in the edges that the resultant shape would form.

The data in question contained mostly Polygons and MultiPolygons, which are represented as lists or multi-lists of tuples. For MultiPolygons, there is one more nesting of lists. This structure of the data must be kept intact while converting the values to maintain the validity of the shape. In the example below, notice one more level of nesting for the type MultiPolygon. See what all things the built in function st_transform does for us!

{
  "coordinates": [
    [
      [
        5.319848470000000e+05,
        1.818774940000000e+05
      ],
      [
        5.319886580000001e+05,
        1.818791160000000e+05
      ],
      ...
   ]
  ],
  "type": "Polygon"
}

{
  "coordinates": [
    [
      [
        [
          2.402629060000000e+05,
          1.271040940000000e+05
        ],
        [
          2.402652040000000e+05,
          1.271073980000000e+05
        ],
        ...
      ]
    ]
  ],
  "type": "MultiPolygon"
}

Next thing to consider was what data type a Python UDF can accept. As per SQL-Python data type mapping given in Snowflake documentation, a Python UDF can natively accept GEOGRAPHY data type as a parameter but not GEOMETRY parameter. Which meant a conversion of the column geo_features of table local_area_districts_staged to GeoJSON format was necessary before sending it to the Python UDF.

Python UDF for Geospatial Coordinate Conversion

Considering all the above points, the code for the final conversion Python UDF looked as follows:

create or replace function py_convert_to_wgs84(FROM_CRS string, GEOJSON object)
returns object
language python
runtime_version = '3.10'
packages = ('pyproj')
handler = 'convert_coordinates_to_lonlat'
as
$$
from pyproj import Transformer, CRS

def convert_coordinates_to_lonlat(FROM_CRS, GEOJSON):
  # Define the input CRS
  in_crs = CRS(FROM_CRS)
  
  # Output GeoJSON object
  out_geojson = {}  
  out_geojson['type'] = GEOJSON['type']
  # Define the coordinate systems
  transformer = Transformer.from_crs(
    crs_from = in_crs # Input CRS provided as parameter
    , crs_to = "EPSG:4326" # WGS84 (latitude and longitude)
    , always_xy = True # If true, the transform method will accept as input and return as output coordinates using the traditional GIS order, that is longitude, latitude for geographic CRS and easting, northing for most projected CRS.
  )
  
  # (Multi)List to store the converted coordinates
  wgs84_coordinates = []
  for coord_list in GEOJSON['coordinates']:
    if len(coord_list) > 1: # For Polygon type coord_list is a direct list of coordinate tuples
      # Convert each BNG coordinate to longitude and latitude
      lon_lat_coords = [transformer.transform(xycoord[0], xycoord[1]) for xycoord in coord_list]
      wgs84_coordinates.append(lon_lat_coords)
    else: # For MultiPolygon type coord_list should be a list of list of coordinates
      new_coord_list = []
      for coordinates in coord_list:
        # Convert each BNG coordinate to longitude and latitude
        lon_lat_coords = [transformer.transform(xycoord[0], xycoord[1]) for xycoord in coordinates]
        new_coord_list.append(lon_lat_coords)
      wgs84_coordinates.append(new_coord_list)

  out_geojson['coordinates'] = wgs84_coordinates

  return out_geojson
$$;

Notice, how handling for Polygon and MultiPolygon types is different. As the call to this UDF would require first converting the data type of geo_features column to something that the UDF can accept, it first needed to be converted to GeoJSON and then to an OBJECT data type. It would also use try_to_geography instead of to_geography system function to avoid failures occurred due to coordinate conversion. The SQL statement for that would be as follows.

create or replace table local_area_districts
as
select 
  properties
  , try_to_geography(
      py_convert_to_wgs84(
        'PROJCS["OSGB_1936_British_National_Grid",GEOGCS["GCS_OSGB 1936",DATUM["D_OSGB_1936",SPHEROID["Airy_1830",6377563.396,299.3249646]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",49],PARAMETER["central_meridian",-2],PARAMETER["scale_factor",0.9996012717],PARAMETER["false_easting",400000],PARAMETER["false_northing",-100000],UNIT["Meter",1]]'
        , to_object(st_asgeojson(geo_features)))
      ) as geography
from local_area_districts_staged
;

The benefit of accepting the CRS settings as string is that one can provide a detailed specification, even a custom specification, which is not possible to provide to st_transform function. But at the same time, a simple EPSG code would also suffice. That means in my case, following code would also have achieved the same result.

create or replace table local_area_districts
as
select 
  properties
  , try_to_geography(
      py_convert_to_wgs84(
        'EPSG:27700'
        , to_object(st_asgeojson(geo_features)))
      ) as geography
from local_area_districts_staged
;

This flexibility is the power of using Python and Snowpark!

Not only Tableau recognized the shape (GEOMETRY) data created by any of the method above, but also it was able to plot it flawlessly, finally making me and my colleague Max happy.

Geospatial Data Visualized in Tableau

Concluding Remarks

Snowflake is a powerful platform. Out of the box it provides a way to convert coordinates from one geospatial reference system to other. But if you want to go further than that and support custom specifications or perform some extra steps during/after conversion, you have the entire power of Python at your disposal through Snowpark. Choice is yours!

Check out the blog of my colleague, Max Giegerich, who plays with this data further to show some interesting possibilities.

More About the Author

Chaitanya Joshi

Services Lead
Converting Geospatial Coordinate Encoding in Snowflake Recently, I was involved in a project which required using geospatial (basically maps) data for an analytical application. Now, with ...
All Data, All Workloads: The Future of Data Processing with Snowflake Disclaimer: The author is solely responsible for views and opinions in this article. Snowflake Summit 2022 was a success. Enough has ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072