29 June 2023
As digital technology becomes ever more omnipresent, the efficient use of data — how it is collected, stored, handled, and processed — has become an important consideration for any company, from the smallest business to the largest multinational. But when it comes to companies that use ‘big data’, this efficiency is much more than a consideration — it is business critical.
smartclip’s adtech operations involve processing several terabytes of raw data every day, collected from multiple sources including the smartx ad server. This raw data and its subsequent use in a diverse range of functions, such as reporting, analysis, and export to clients, is integral to the business and services it provides to broadcasters and publishers. An agile and robust data transformation tool is therefore vital to ensure that smartclip can aggregate and transform raw data effectively and maximise its value as it flows through its operations.
For many years, smartclip used a proprietary data aggregation application. Although it was specifically customised for smartclip and fulfilled its initial purpose, it struggled to handle large datasets. This issue created the need for a more powerful, adaptable, and scalable build tool for data transformation.
smartclippers are always encouraged to think outside the box and present their ideas — whether unconventional or conservative — to foster development within and beyond their teams. With this in mind, Kaya Kupferschmidt, Systems Architect at smartclip and founder of dimajix, built a new solution for data integration and transformation called Datatool in collaboration with smartclip’s Platform Data and Trading Platforms team.
“At dimajix, we work with companies of all shapes and sizes, and across a variety of industries, but they all face similar challenges when it comes to handling big data. Many struggle with complicated coding, inflexible software, and insufficient capacity. This results in increased workloads and higher development costs, and leads to ad-hoc solutions and system patches that may pose problems in the long term. So when smartclip — a company where reliable, fast, and efficient data management is imperative to its business — decided to streamline its data transformation system, it was an opportunity to draw on our combined expertise to create a new tool that would bridge the gap between the complexities of processing big data and the simplicity these companies need.”Kaya Kupferschmidt, Software Developer at smartclip
Datatool was highly effective for smartclip, but it also had the potential to be beneficial for data teams in various industries. This inspired Kaya to take the initiative and start another project. He reimagined the Datatool concept, removing the smartclip specific features and replacing them with alternatives that could be applied to various types of companies and projects. This resulted in a more advanced, open source version of Datatool called Flowman. As an open source tool, it not only meets smartclip’s requirements now but can also evolve to meet its requirements in the future.
Introducing Flowman — a powerful, open source data transformation tool
Flowman is a declarative data transformation tool based on Apache Spark that simplifies the act of writing data transformation applications. Developed to replace smartclip’s Datatool, it brings a modern approach to ETL that results in robust data transformation pipelines.
As can be seen from the chart below, Flowman works by collecting raw data from multiple sources — this could be internally collected data or data added from other external sources. It then extracts, transforms, enriches, and aggregates the data according to the inputted business or transformation logic, before outputting it in the required format. For example, at smartclip, the source could be the smartx ad server logs and the output could be pre-aggregated data for financial reporting or custom raw data exports for selected partners.
Central to Flowman’s simplified user experience is its modern declarative approach; developers only need to focus on the higher-level details of what needs to be done — the business or transformation logic — rather than the lower-level details of how it is done. Developers simply instruct the application on what targets or jobs need to be accomplished by describing the desired data flow within declarative YAML files (any text editor that supports YAML files is compatible). Flowman takes care of the imperative coding that governs the technical details of how those tasks need to be accomplished, removing the need to write custom Spark code.
Flowman aligns with smartclip’s open source philosophy
Flowman is particularly unique not just because of its powerful features and great user experience, but also because it is an open source tool. This means that the responsibility for the tool’s development, maintenance, and enhancement is not limited to smartclippers alone, but is a collaborative effort involving the entire development community with access to the tool’s source code. With open source codes, developers can reap the benefits that come with using a constantly evolving system while enjoying the thrill that comes with being able to solve both complex and minor problems — which the code may encounter — on their own.
smartclip follows a ‘no closed source’ software philosophy, which means that we strongly advocate for the use of open source technology throughout the company. This, among other advantages, made the adoption of Flowman at smartclip relatively easy. Within weeks, the entire team was on board with the new tool and pleased with the results it produced.
“At smartclip, we love to dig into tech issues and solve problems on our own. One of the reasons why open source is so important for us is that it gives us the opportunity to dig into the code and fix errors independently.”René Wagner, Chief Innovation Officer at smartclip
Why our platform data team loves Flowman
Flowman has now been the central element of smartclip’s data pipeline for several years, and its creation, introduction, and ongoing development has brought a wide range of benefits:
- Powerful, fast processing: Flowman is highly effective at integrating large datasets from many sources easily and quickly.
- Robust and sustainable data quality: Unique features such as automated data checks on incoming and outgoing data as well as data quality metrics ensure a consistently high level of data quality.
- Simple and easy to use: The simplicity of Flowman removes all the heavy lifting involved in handling data systems from the shoulders of the developers. By taking care of all the imperative coding as well as the technical details, Flowman enables developers and business experts to focus purely on the higher-level business and transformation logic.
- Established, reliable data export processes: With Flowman, smartclip can establish formal, reliable processes for exporting high-quality data for internal or external use. This provides stability, security, and reassurance for the company and its clients.
- Clearer documentation and management: Flowman can automatically generate a comprehensive documentation including column-level descriptions of all involved data sets that are either read or written by the software, removing the need to manually maintain separate documentation elsewhere.
- Faster, more flexible development cycles: With the complex coding mechanics out of the equation, smartclip is able to quickly implement data processing jobs and make adjustments for specific projects.
- Better code quality and security: Since Flowman is an open source software, it offers complete transparency; thousands of external developers can access the source code and find and fix issues, resulting in better code quality and security.
- Independently developed new features and enhancements: The community of external developers that have access to Flowman as an open source software are also able to add new features and enhancements to it. Other companies that use Flowman besides smartclip or dimajix can also contribute independently to the development of the tool. This constantly adds value and scope to the tool without smartclip or dimajix incurring additional expense or development time.
“After a smooth transition from our old technology, we immediately noticed the many benefits of using declarative data build software that is specially tailored to our needs, and as the software matures, it continues to exceed all our expectations. Flowman has given us a strong and robust data transformation tool that not only aggregates and transforms our raw data but also exports it for a variety of purposes, giving us the ability to standardise processes for outgoing exports to our clients based on their unique requirements”Nadine Wagner, Director Data Products at smartclip Europe
A win-win for smartclip and smartclippers
Collaborating with Kaya to make Flowman available as a fully transparent, open source software application is just one example of how smartclip is committed to empowering its employees to contribute their unique ideas and make a real impact. This single initiative has now opened up new opportunities for data teams across the globe to benefit from a powerful data transformation and integration solution.
smartclip firmly believes that the success of individual team members is a direct reflection of the growth and development of the entire organisation. Our commitment to fostering creativity and innovation remains an integral part of our core values. By empowering our employees and embracing collaboration, we are confident in our ability to continue pushing the boundaries of what is possible and providing cutting-edge solutions that not only meet the evolving needs of our clients but also drive the adtech industry forward.