RDBMS is an integral part of building most analytical layers on top of a data warehouse. Similarly, Spark is well known and vastly being used across the organizations to build a robust ETL pipeline that can do massive computing in memory and load to the various platforms including RDBMS.
Many times, we face issues when we try to do DB operations switching between Scala and Spark. When we just need a single value we may not want to use spark to create a data frame for us. Similarly, many SQL queries like UPDATE/DELETE, etc. are not supported directly in Spark…
Thanks to the opensource community, Spark has been continuously improving itself & helping data/ETL engineers and scientists write powerful frameworks to process big data & loading that into datastores with simplicity.
We’re going to split this potential large topic into smaller parts to help us design it in stepwise fashions. I will update all the future topic links on this page later.
On this very first topic, we are going to discuss the critical step to write strong yet simple methods to create wrapper shell scripts to design both batch and real-time processes beyond the only ETL.
Whenever I get a chance to work in Scala, I always love its simplicity like Python, handiness like Perl, and running like Java.
Scala’s implicit ways are another way of writing syntactical sugary flavors of writing compact code in your own style which makes it easier to call from external class or code blocks. Here, I’m keeping a few custom definitions I use frequently:
I. CamelCase :
A camel case implicit function can be handy while storing strings like First & Last Name. It can be also easily called from Spark using a UDF as shown below:
Hadoop Developer |BigData/ETL Engineer| Techincal Architect| And a Student.