Using R and Python with Vantage | Part 1: Overview


Hello, and welcome to another edition of Teradata TechBytes. My name is Tim Miller, and I work in Teradata’s Technology and Innovation Office. Today, we’re going to talk about how Teradata’s Vantage platforms support the R and Python languages for data science and analytics. This is the first of a five-part TechByte series on this topic and will provide you with an overview of all the capabilities in Teradata’s Vantage platform for R and Python. My cohost for this series is Alexander Kolovos, an analytic architect in Teradata’s product development organization. First, let’s talk about the assumptions being made regarding the content of this TechByte as well as the key takeaways we hope all listeners walk away with. First off, you should have a good understanding of either the R or Python language, ideally both. Second, you should understand the Teradata Vantage platform. Please see the TechByte episode, “The Next Step in Analytics, Teradata Vantage” from my colleague, Rob Armstrong, if you have not yet been exposed to Vantage. The key takeaways from this TechByte include understanding the motivation behind Teradata’s Vantage R and Python integration, understanding the three ways to utilize R and Python on Vantage and the two ways Vantage provides performance and scalability to get you to a path to production, understanding that R or Python can operate externally or directly on the Vantage platform and that there are pros and cons to each approach and finally that the Vantage libraries for R and Python leverage the best of open source and Teradata technology. The three most important reasons for using R and Python with Vantage are performance, scalability, and operationalization. Although adoption of both languages in the data science community is staggering, they are not without their issues. R is single threaded and memory bound. They both have the overhead of being an interpretive language. Complex analytics on large datasets can run very slow or not at all as a result of that, and many times, productization of R and Python analytic workflows can be very difficult. Teradata has worked with many customers on these concerns. You see two such examples here in retail and pharma that really illustrate these statements I just made. Most importantly, because of these reasons, as well as others, many analytics teams struggle with putting these complex workloads into production, as eloquently stated by James Taylor. Because of this, oftentimes R and Python are on the outside looking in when it comes to pervasive data intelligence. With the Vantage platform, Teradata provides the foundation for the performance and scalability you need to operationalize your analytics. There are three ways that R and Python programmers can leverage Teradata’s Vantage platform. In the first two scenarios, the language interpreters and packages are running on a client of Vantage. We call these User Experience One and User Experience Two. In the other, the language interpreters and packages are running directly on the Vantage platform. This is User Experience Three. Let’s take a look at each in a little more detail. The first user experience is status quo, what many of you might be doing today with Teradata or other database platforms. In this case, we’re establishing a connection from R or Python through ODBC or JDBC and moving tables or portions of tables into R or Pandas DataFrame for further processing. The pros of this user experience are that all libraries and functions are possible, package installation is seamless, it’s the environment that most data scientists are familiar with, and it’s simpler to code and debug. On the negative side, though, data transfer is required, the processing power is limited to that of the client or server, there is no parallel processing and a general lack of scalability. It’s not suitable for big datasets. And finally. It’s not easily operationalized. In User Experience 2, as with the status quo, the language interpreters and packages are running outside of Vantage on a separately attached client or server. Additionally, one of two packages are installed on that same client or server. For Python, that’s the teradataml package, and for R, it’s the tdplyr package. These packages provide in-database analytics via R or Python interfaces to Vantage analytic functions that automatically generate the SQL that culls Vantage NewSQL, Machine Learning, or Graph Engine functions. Additionally, for Python, you get SQLAlchemy functionality; and for R, you get dplyr functionality. The benefits of this approach are that these libraries give you R and Python interfaces to parallel, scalable, Vantage analytics. The SQL is automatically generated for you. Because it’s already in your production environment, it has an easier path to operationalization. On the negative side, there is a limited number of analytics when you compare it to open source packages, and the debugging might be slightly more complex. In the third and final user experience, the language interpreters and corresponding R or Python add-on packages are installed on each NewSQL Engine node. And database analytics are provided via R or Python scripts invoked through SQL and executing on every unit of parallelism. In this case, analytic tests parallelization depends on data locale. You get out-of-the-box parallelism for row-based operations, such as model scoring, or partition-based operations, such as simultaneously building many models based on a data partition. For system-wide operations, however, the programmer must code their scripts in a MapReduce style because, as you can see, there are multiple result sets streaming out of the Vantage platform from every unit of parallelism. On the pro side, you get parallel processing for scoring and scaler operations, as well as the ability to simultaneously build many models, all libraries and functions are possible because the packagers are installed directly on the NewSQL Engine nodes, and it provides you an additional path to operationalization. On the negative side, languages and libraries have to be installed on every node, global operations require MapReduce style programming, debugging may be slightly more complex, and you must code in SQL. To summarize, here are three user experiences. Next, let’s dive a little deeper into User Experience #2 and #3, starting with User Experience #2, when R and Python are running on Teradata Vantage client systems. Tdplyr is a freely available R add-on library for Vantage that provides a SQL back end to connect to and push processing into the Teradata Vantage platform. You can use your favorite IDE for R to access Vantage, which is RStudio, the command line, Markdown, or even Jupyter. The end user codes in R using dplyr interfaces. It works in conjunction with the Teradata SQL driver for R or Teradata ODBC. It contains interfaces for connection management and context as well as data management and exposes hundreds of Vantage Machine Learning and thousands of NewSQL Engine functions as R interfaces. tdplyr follows R data frame and dplyr conventions by using the dplyr tbl function to create a virtual R data frame that points to a table or view in the NewSQL Engine. It integrates dplyr verbs for data frame use, including base methods such as Select, Rename, Filter, Mutate, Transmute, Arrange, Group By, and Summarize. Additionally, R expressions that are mapped to SQL fall into the following categories, math operators and functions, logical and comparison operators, conversion functions, character functions, aggregate functions, windowed functions, bit functions. The results are stored in Vantage’s temporary tables, which are cleaned up when the session context is cleared. teradataml is a freely available Python add-on library for Vantage that also acts as a SQL back end to connect to and push processing to Teradata Vantages like tdplyr. Use your favorite Python IDE texts as Vantage, such as Jupyter Notebooks on your client, or JupyterLab, JupyterHub, et cetera. The end user codes in Python using teradataml interfaces. teradataml works in conjunction with the Teradata SQL driver for Python and contains interfaces for connection management and context as well as data management. It exposes hundreds of Vantage Machine Learning and thousands of Vantage NewSQL Engine functions as Python interfaces. It follows pandas DataFrame and SQLAlchemy conventions by providing a Teradata data frame which mimics a pandas DataFrame. It supports SQLAlchemy methods, aggregates, and operators, and it also supports table and data frame conversion. For example, To SQL creates a table in the NewSQL Engine based on a teradataml data frame, Copy To SQL creates a table in the NewSQL Engine based on a teradataml or pandas DataFrame, and To Pandas creates a pandas DataFrame from a Teradata data frame. As with tdplyr, results are stored in Vantage as temporary tables and are cleaned up when the context is closed. Next, let’s talk about some of the dependencies. Both the R and Python packages support the 64-bit operating systems such as Windows 7, OS X 10.9, and all the Linux variants you see listed here. For driver connectivity on the R side, the Teradata SQL Driver for R can be used as well as the ODBC driver, while on the Python side, the Teradata SQL Driver for Python must be used along with the SQL driver dialect for SQLAlchemy. Both the packages support their versions 3.4 and later, and as far as required packages, on the R side you have dplyr, dbplyr, dbi, and if you are using ODBC, then the ODBC package, and on the Python side, SQLAlchemy along with Pandas are required. You can get the R package tdplyr from downloads.teradata.com, while teradataml is available from PyPI. All the latest software documentation is available on Teradata’s new, interactive HTML website, docs.teradata.com. As I previously indicated, by building on top of well-known open source packages, R and Python users will be able to leverage a tremendous amount of capability within the Vantage NewSQL Engine itself. I do apologize for the eye chart, and I will not read all of these functions that are available. However, the purpose of the slide is to illustrate the volume of functions that are available to both R and Python users. For the R users, in addition to table operations and connection management, all the dplyr verbs are supported along with the SQL translation you see listed for aggregation and windowed aggregation functions, math functions and operations, logical and comparison operations, as well as conversion, windowed, character and bit functions. For Python users, in addition to the subset of Pandas DataFrame methods supported, you get SQLAlchemy object relational model methods for aggregation, assignment, description, filtering, joining, and merging. For a list of the latest available functions, please see our website. In addition to the open source capabilities I just mentioned, both the R and Python packages support the Machine Learning and Graph Engine functions of Vantage for statistics, data transformations, path pattern and time series, association, clustering, decision trees, Naive Bayes, text analytics, geospatial, and graph and visualization functions. Again, refer to Teradata’s interactive documentation website for the latest on all open source and Teradata functionality available for R and Python within our packages. Next, let’s take a look at User Experience #3, where R and Python are running in Teradata Vantage nodes. Next, let’s take a look at User Experience #3, where R and Python are running in Teradata Vantage nodes. The first method is via the script table operator, which pipelines the database I/O to the command line of a Linux OS shell child process. Note that usage can be secured and limited to executing scripting languages of interest, such as R and Python. To execute an R or Python script with a script table operator, the user first installs their script and registers it to Vantage via an external stored procedure call. The user then executes a SQL query which calls the script table operator and invokes the script. The script executes in a Linux forked child process and reads from standard in and writes to standard out. The script runs on every unit of parallelism. In the NewSQL engine, these units of parallelism are called AMPs or access module processors. So this technique provides out-of-the-box parallelization for row- or partition-based operations, as previously mentioned. These are sometimes called single-AMP operations and are akin to a MAP phase. A nested call is required to combine partial results that create an all-AMP operation, which is akin to a Reduce phase. The next method is offered through another table operator specific to R called the ExecR table operator. It provides in-database support for R program execution and runs in protected mode within a process in the NewSQL Engine called the GPL secure server. It interacts with the NewSQL engine through a set of APIs – It’s the same API used for standard Teradata user-defined functions. These R APIs are used to read and write rows of data and read metadata from the Teradata data dictionary. In order to execute a R script using ExecR, a SQL statement must be constructed with one or more ON clauses specifying one or more input table or objects. Note that this is an enhancement that script does not have. It can only handle one ON clause or on input. The R code for a contract and operator are passed to ExecR using an USING clauses. The contract to find your resulting schema. However as you can see on the code snippet on the screen, you can optionally use RETURNS clause as opposed to a contract. The operator itself is the R code that’s to be run in parallel. Note that ExecR has the same parallel processing considerations as script we discussed previously. One final note, in Vantage 1.1, Teradata will release a set of packages for both R and Python that will updated periodically and available for download at Teradata’s website. Next, let’s look at R and Python IDEs that are supported by Teradata Vantage. First, JupyterLab. JupyterLab is going to be available as a component of Teradata AppCenter where it runs under Kubernetes in a containerized environment. It includes a Teradata SQL kernel which allows seamless connectivity to Vantage as well as the use of SQL magic commands. It has the ability to browse the Vantage dictionary and get basic descriptive statistics on any data element of interest. It also has a variety of plotting capabilities. Obviously, both the tdplyr and Teradata ML packages are also supported, in addition to many great features from our friends from Project Jupyter. RStudio also supports the Vantage platform. It can run on a Teradata SUSE Linux system as a Teradata managed server. It has a Vantage dictionary browser view, as you can see in the upper right quadrant there, powerful plotting capabilities, and obviously supports tdplyr and its dependency packages and in general, has just a tremendous amount of other functionality from our friends at RStudio. As we come to the conclusion of this TechByte episode, let’s summarize and talk about the rest of the series. We’ve discussed the three methods that R and Python programmers can use to leverage the Vantage environment. The first is local processing, where data is pulled from Vantage into a connected client or server, and we’ve talked about the pros and cons of that user experience, as our User Experience #1 In the second user experience, in-database processing is supported in Vantage from a connected client or service through the tdplyr or teradataml packages. Again, the pros and cons were discussed, and they’re listed on the screen there. Finally, in-database processing is supported directly on the Vantage NewSQL Engine, where the R or Python interpreters base and add-on package are installed on every NewSQL Engine node with the pros and cons listed on the screen. As previously mentioned, this is the first in a series of five TechByte episodes on Teradata’s Vantage R and Python support. TechBytes is hosted on the YouTube Teradata channel. The other parts of the series will be mostly demonstrations of the capability of each of the user experiences we have discussed today, and I’ll be joined by my colleague, Alexander Kolovos. We encourage you to watch the remaining four episodes to get a more complete picture of how your R and Python needs can be met on the Teradata Vantage platform. Thank you for tuning in and have a wonderful rest of your day.

Leave a Reply

Your email address will not be published. Required fields are marked *