Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>> Code as ETL was a huge industry shift

No it’s not. Try and see what banks, retail, manufacturing, various large enterprises still use. They need scale, observability, modularity, and maintainability.



Gui Etl makes a promise that you dont need a programmer to wield it. But it is a false promise.

> They need scale, observability, modularity, and maintainability.

Seems orthogonal to code-vs-gui dimension.


No, unfortunately those factors are all very related.

Once you have GUI ETL tools, in my experience, you can't modularise because the ETL tool makes assumptions about where the boundaries are that are different from what suit the domain in question. Observability falls over because you're now limited to the ETL tool instead of the domain. Scale suffers because now the ETL data model needs to be preserved and high-performance tricks might need the entire tool to be worked around, etc, etc.

Code is the highest-performance environment we have for working with huge complex systems made of if statements and loops. Giving that up to go to a tool doesn't actually yield any advantages; there needs to be an abstraction with huge practical benifits and a DAG isn't it. Modeling a DAG in a true programming language isn't hard enough to justify moving away from an IDE.

An ETL pipeline in practice is still uncomfortably close to a big spaghetti of if-thens and loops, tooling and extra models create patterns that often block a lot of the useful properties you list. The real gains come from not writing a custom scheduler, but splitting out the valuable scheduler from the ETL tool means that you have a scheduler, not an ETL tool. Sometimes there is an ecosystem of adaptors that makes a big difference, but if that doesn't meet your engineering requirements then the tool is useless (because you don't have any real levers to pull on the scale/observability/modularity and maintainability front).


I work in manufacturing (large industrial plant) and the data processes we have are honestly not great - mostly it is because there are a heap of legacy system and not a lot of commonality between our data sources we have a hideous mashup of Oracle, DB2, Microsoft SQL Server etc and different versions of the different databases. There's also more bespoke industry stuff like time series historians and SCADA systems/PLCs (ABB, Citect etc) to complicate the process.

From my experience SQL is basically the lowest common denominator everything speaks and even then the Oracle SQL dialect is subtly different to Microsoft SQL for example - things are subtly different enough it introduces frustrations.

There has been movement in last couple of years to hoist everything into a common "datalake" but my understanding has been that ingestion into this lake is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).


> [some process] is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).

Sounds like an ideal fit for on-prem/co-located systems. The big problem with on-prem is the egress costs from wherever all your data resides.

With on-prem, doubling your hardware doesn't double your ops expenses, so it makes sense, if you already have a server-room, to fill it to capacity.


I have no experience in the manufacturing domain but it fascinates me as a data engineer. I do have experience building data lakes at scale with sub-day (microbatch/“realtime”) latency and with disparate sources. I don’t think this needs to be as complicated or painful as you expect but I don’t know enough about your data or needs to be sure. If you want to discuss specifics send me an email at the domain in my profile, I’d love to know more.


I just started using sqlglot to convert Microsoft SQL Server code to Databricks SQL, and it has been able to automate 80% of the translation (assuming it's just a select statement). You might take a look.

https://github.com/tobymao/sqlglot


Are you trying to consume historical or real-time data? In my experience this greatly influences the approach.

Node-RED is a common ETL approach in the scenario you described, but I find it too limiting beyond basic examples.


You may be interested in semantic web technologies as a means of modelling your different data sources and how they relate.


Completely agree.

I would also add that in my 25 years in the industry I have never actually come across an ETL workflow that was complex enough that it required people working it in code. Those opinions seem to occur before proper analysis happens.


So you have never needed to write SQL in your ETLs? I guess it depends on whether you consider that code, but at least 50% of the time I have to use SQL for one reason or another rather than just pointing the tool at a source and target.


Honestly? If you are making the transformation in SQL you've lost the T to the database server. Does it mean we don't do it? Is water wet?

The reality is that we shouldn't be making the transformation in written data, now you have staging tables, procs, views, resulting tables procs, views, a staging database (often called staging or etl), and an unhappy DBA who yells at you every time you cross-apply incorrectly.

ETL should be done before the data lands in SQL.


I think you're describing ELT, an approach that I agree with and try to implement as often as possible.

I think that's orthogonal to the occasional need to use SQL to extract source data. And it's hard to beat the performance of RDBMS for your transformations. It's almost their whole raison d'être.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: