Lets imagine a typical engineering scenario: In this article I will descibe how an engineering team can manage, develop and publish DAGS after running a full CI/CD build pipeline using google cloud build. ![]() Google cloud composer is a managed apache airflow service that helps create, schedule, monitor and manage workflows.Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure. What advantages does a managed Airflow service, e.g.Cloud composer orchestration via cloud build To pay for it because of the convenience and there’s no otherĪlternative? Or was it an intentional design decision and Task executions and hence not running up cost? Is this a flaw inĪirflow/Cloud Composer’s architecture but people are still willing Or some other mechanism that can completely turn on and off between Why is Cloud Composer using GKE under the hood instead of Cloud Run, So here are some questions that I really appreciate your input: But, I am really struggling to convince him and counter his arguments. ![]() My view is that reinventing the wheel, building our own data engineering tool, when there’s a tried and tested solution for our need, is completely unnecessary, especially since the pubic available solution is proven to be scalable and reliable. So in such a system, the fixed cost is much much lower and the vCPU/memory cost is only incurring when tasks are actually running. He contends that we can build a much cheaper and equally capable system ourself, say using Cloud Scheduler, Cloud Run/Cloud Functions and taking advantage of background trigger functionalities in Google Cloud document store (i.e., Firestore), e.g., onCreate(), onUpdate() and etc to trigger dependencies between tasks. He thinks this is clearly a flaw in Airflow’s implementation/design, that nodes are running and cost is building when nothing is being processed. To him, the dollar cost per task execution time is just ridiculous. However my teammate just can't get over the fact that Cloud Composer is set up such that it's running 24/7 regardless of whether there's task running or not and we are getting charged for all those idling minutes. I am very much in favor of using a technology that’s future-proof. Despite the unfortunate cost structure at the moment, I still think Airflow/Cloud Composer is the best solution for building and managing data pipelines and going forward we will certainly have more DAGs and more frequently running DAGs so the value proposition will surely improve significantly. Since you get charged for the environment which is running 24/7, not just when the tasks are running, the value proposition is not great currently for us, as you can see.īecause of this, I am in quite a bit of contention with my teammate. So our Cloud Composer environment is actually just sitting idlely the vast majority of the time. We don't currently have many DAGs running and each dag runs just daily for around 5 to 10 minutes. ![]() For even a small environment that autoscales from 1 to 3 (in fact using just 1 worker most of the time), it's quite expensive at ~$350/month. We recently started using Cloud Composer for our data engineering pipelines. This question was migrated from Stack Overflow because it can be answered on Server Fault.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |