2024-10-06 Personal Sunday Hackathon: DV automation end to end, an idea and ChatGPT
There is a lot of automation tools for Data Vault on the market already. So why wouldn't I create another one? (sarcasm should be evident here :-) If you can convert a Business Object Model or ERD (which is not the same, but for an experiment that will do) to a Data Vault model, fill the necessary AutomateDV metadata in dbt, then model creation and loading is automated end to end. Expensive tools like WhereScape could do that as I understand. The point here, I want to have that without paying too much. The only challenge is the first step as converting Business Object Model to a Data Vault model is not trivial. But as we all know, a good developer is lazy developer. So why wouldn't I use OpenAI API's to accomplish just that ? It sounds like a lot of fun to make. Surprisingly, the first question to ChatGPT, Can you convert this simple ERD model to Data Vault model was answered correctly. I have to confess, I talk to ChatGPT daily and spend more time with it then with any human :-). It is like having an intelligent and a very experienced developer as colleague ready available at any moment. Someone who can give advice on a lot of possible options, sometimes inaccurate, but in the end, it is far better than if you would do it alone. Seriously, my productivity as far as things I really find interesting and not has skyrocketed!
... few hours later... no it is even more exciting! I can do much more with AI and I have a couple of idea's. Can not share them yet, though.
General thoughts about AI. As far as development as we know it, designing, modeling, coding testing etc. it all can and will be done completely by AI. It doesn't mean it will replace humans though. There are two reasons. Firstly, you have to have someone to formulate input and validate output. I think it is called Prompt Engineering. Secondly, someone has to be accountable. You can not say, sorry your business is destroyed because of our software, but AI is to blame! Or, the patient is unfortunately passed away because AI misdiagnosed.
2024-09-29 Personal Sunday Hackathon: DBT & AutomateDV package for Data Vault 2.0
From my experience, you have to have some kind of an automation framework in place if you are working with a DataVault, otherwise you will have to deal with a lot of almost identical and hard to maintain code. That's where DBT with AutomateDV package can be very useful. Although I worked with more user friendly (and much more expensive!) visual DataVault generators, for a developer, DBT in combination with AutomateDV package is excellent choice. More of a declarative, not imperative way of developing, it brings more structure, and as a consequence, it spares a lot of time and effort. AutomateDV supoorts DataVault 2.0 standards, it is an Insert Only full featured DataVault model and loading implementation tool with macro's for supporting hash keys and hash diffs, PIT tables, Multi Active Sats, Bridge tables, Effective period Sats, Ref tables etc. Besides that, it supports automated testing and documentation which of cause, awesome. We all know how those two components are often shifted towards the very end of a sprint or project and are the first thing to be sacrificed in the face of a hard deadline.
2024-09-15 Personal Sunday Hackathon: DBT & Databricks
DBT in the context of Databricks is coming up quite frequently recently. Being the T in Extract, Load, Transform, this tool makes Databricks more accessible if your Data Engineering team is SQL minded. With Data Scientists who often are used to Python and Pyspark Notebooks for Machine Learning, Databricks becomes an easy choice for companies which are looking for a unified platform.
Connecting from DBT to a Unity Catalog in Databricks was not straight forward though as it involved setting inbound firewall rules on the Security Group.which are overruled by Deny Assignment by Databricks which follows the Best Practices. Luckily there is Microsoft Support which was great so far :-)
2024-09-01 Personal Sunday Hackaton: RESTfull API's in Databricks
Added a couple of Databricks Notebooks to ingest CBS data using OData RESTFul API's. Some useful information about OData and REST (sorry, it is mainly for myself :-))
- REST is an architectural style for exchanging information via the HTTP protocol, while OData builds on top of REST to define best practices for building REST APIs.
- Both REST and OData adhere to web-based architecture, statelessness, data format flexibility, resource-oriented design, and interoperability.
- OData offers enhanced query capabilities and standardized metadata, promoting discoverability and efficient data retrieval.
- REST provides simplicity, flexibility, wide adoption, and platform independence, making it suitable for a broad range of applications.
- Considerations for choosing between OData and REST include complexity, interoperability, and alignment with project requirements.
2024-07-14 Personal Saturday Hackathon: A dive into Fabric... or not...
Finding out how it all work within Fabric was quite time consuming but in the end it all made sense. To find out how different capacities (PowerBi Pro, Premium and Fabric) relate to capabilities and limitations was also a bit of work but quite interesting to figure out. By accident I stumbled upon this article which confirmed my previous experiences with MS Cloud data platform.
A few words about the Databricks part. As mentioned before, thats where the data from DeGiro is ingested, the model is trained and predictions are generated. These actions are a part of a scheduled Databricks Workflow which runs daily. The functionality includes usage of streaming APIs, however it is desabled, so for now 3 notebooks are integrated in the Workflow: Batch ingest DeGiro data, Train model, Predict values and write to ADLS2. The results in all intermediate steps like the training set, to predict set and the output are written to Delta format tables.
What interesting is to see how predictons compare to real values. Although it has never been the goal (i was primarily busy with technologies) it good to see that it al least shows some similar trend as reality (red line)
I did a dynamic version of the test report. Don´t mind the content, functionally its basically meaningless. Objective here was the technology. I was wondering if it could be done without a PowerBI desktop. To my surprise there was only the CSV, XLS or Published semantic model options available as a data source. ´If more needed please get PowerBI desktop.´ Right.. And I didnt want to use Fabric capacity, because it costs more money :-) So thats the steps done.
- Previously I created Databricks notebooks which ingest data, train ML model and write result on a mounted ADLS location. Within Synapse Analytics I defined an external tables within SQL database that look at that ADLS location. I was thinking about using the Synapse Lake Database but then you would need Spark pool which I would need to manage.
- I created a DataFlow within PowerBI. It doesnt do anything except connecting to Synapse Analytics. I thought I would need a pipeline, but luckily none was needed. In that proces a Semantic model is created. Correction: semantc model is created when I connected to ADLS2 location, not to the external tables in Synaps Analytics. There is no way to connect without PowerBI Desctop or Anlalysis Services it seens :-(
- That semantic model I used to create a report and published it to this Web Site
2024-06-22 About certifications
In the ever accelerating tempo with which modern technology comes with new things it becomes more difficult to determine in what tools and technologies exactly one should invest time and money to learn. If you want to achieve a decent level in mastering a skill it will cost time (=money). Ten years ago I would get certificates thinking that was a good investment. But now I am not so sure about that any more. For instance, Microsoft Synapse which came a couple of years ago replacing SQL Datawarehouse will be gradually but surely phased-out by Fabric. You could argue, that this is an evolution of the platform whereby the core features and principles stay the same. It is partially true, but is it worth investing time and effort if a significant part of knowledge you acquire like concrete version specific features wont be relevant in one or two years ? My answer is No.
2024-06-22 Personal Saturday Hackathon: Databricks, Fabric, PowerBI
Published to web PowerBI report powered by Fabric. Databricks notebooks are used to extract data from DeGiro, train a Machine Learning model and predict the value of Roblox share (that one is my son's favorite). The result is written to an ADLS mount. Form there it is picked up by PowerBI report created in Fabric and refreshed daily. Except some connectivity and authorization plumbing it all went smoothly. Beneath is not a screenshot, it is a PowerBI report :-)
2024-04-19 Databricks LakeHouse, Delta Tables and Medallion architecture
Just like the whole data world is transitioning form traditional Datawarehousing to Data Lakes and to LakeHouses I am moving too. Frankly speaking, otherwise it would be simply boring! So that is always exciting, it gives you energy. On the other hand, you have to learn non stop and in ever increasing pace.
I am currently working on an on-premise Data Lake. So the logical step forward is to combine that knowledge with the years of experience with traditional Datawarehouses and start building a Lake House. I am working already on an idea to use Tigor.nl which I own for decades as a platform for it.
2024-03-23 Saturday Hackathon: streaming data, Delta Tables, Machine Learning n Databricks notebooks found from 2 years ago.
A couple of Saturdays were dedicated to get streaming system going using Structured Streaming APIs with FlightRadar24 data in Databricks. In that proces I found notebooks I created two years ago. Actually, I was surprised by things I could do :-). Using degiroapi get data from the DeGiro site, the stock price of Roblox stock and using pyspark and Random Forest model I predicted the stock price. And it still works! although I had to hack degiroapi a bit.
Oh yeah, and I bought Microsoft Support for my subscription. My first experience was actually great. I was immedately helped by 2 support people and afterwards I was interviewed by their manager and he asked if I was satisfied. Yes, I was!
2024-03-10 Databricks and DBT
Lately I have been looking at Databricks again. With my recent experiences with PySpark I want to use it in a modern scalable and easy to use framework. To my surprise Databaricks has SQL datawarehouse en integraion with DBT. Basically they want to make it easy who know SQL (which is basically everyone) to use Databricks which totally makes sense. The only thing which is funny about it is that you dont need PysPark at all, at least if you are a Data Engineer. OKay. But I want something else than SQL. Otherwise, it is just like Snowflake, which is great by the way.
But I want to play with PySPark :-). Somehow, being busy with no Low Code developement in Visual Studio Code and Azure Data Studio makes me feel like developer again. And thats a good feeling. Whith my recent interactions with ChatGPT it also is a lot of fun. I guess you have to be Data Scientist to use Python and PySpark. Thats where Databricks is superior to Snowflake and Synapse. So be it.
I am going to go ahead and devote my Saturday Hackaton (probably more than one) to making a system which takes a streaming data and predicts something.
To be continued...
2023-04-06 What if ?
(unsolicited advice of an old man to those who decide)
What if I was a Head BI and I have to choose an architecture, BI-platform and tooling ? Green field, so to speak. What would be the winning combination ?
Having worked with a number of Dutch companies. Having hands-on experience with software and technologies like SAS, Microsoft Azure, Snowfake, Oracle, Hadoop, Spark, Informatica etc. Having done that on-prem and in the Cloud. Having build datawarehouses on relational databases and Data Lakes. Having modeled with Data Vault and without it.
it's not an easy choice to make!
Goal
If I have to sum up what I (as a hypothetical Head of BI) would have wanted to achieve:
- Maximum output: you want to be able to have tangible results as quick as it is possible
- Minimum cost: that includes time (=money) to learn new skills, actually develop and deliver products
- Maintainability: you can replace people easily and you don't have to keep them because they know how one app works
- Continuity: make sure whatever you choose and buy, you don''t have to reconsider within one year
Challenges
Why is that so difficult?
- There is no Silver Bullet. There is no single solution for all. But there is probably objectively speaking one optimal choice for your specific situation.
- Datawarehouse, Data Lake or something else? That's an important one. Will talk about it in a separate chapter.
- The number of tools is growing. New vendors coming up continuously, little real experience with those tools is available, but they promise wonders
- Tools are evaluating. What seemed like an optimal choice yesterday can be not so obvious tomorrow
- Information about those tools is sometimes contradictory and depends on where it comes from. Benchmarks are often biased
Basic principles
Of course, if you have to make a decision of this magnitude, it has to be based on a clear vision and principles and if possible, real life facts and experience
- Hire and keep a good architect :-) Errors in design are very expensive to correct if the system is already build
- Keep it simple when it comes to structuring and modeling your Datawarehouse or Data Lake. Do you understand what a Data Vault is, what problem it solves en do you have those problems at all?
- Keep it simple when organizing development process. Being developer myself for 20+ years I love making things. I love developing, low code or not, making code generators that make code. It is all good until someone else has to take over. For them it's often much less fun. Besides, your company suddenly confronted with the fact that it created critical software which it depends on and it has to maintain and which is not the primary business at all. So avoid it at all cost!
- Make or buy: definitely, if its available, BUY! see previous point.
- No one has canceled basic principles if you have to develop:
- Enforce strict rules and guidelines for development. It does not matter if you do coding or use low code. Otherwise you will get the well known spaghetti and it can be fatal.
- Maintain reusable code in one place. Do not copy code.
- SQL is the only skill here to stay. Organize your development process around it and use it as much as possible. For instance use SparkSQL. (Correction: that one I personally abandoned myself. PySpark is to much fun and power!
- Open Source is only good when you pay for it, and you will, one way or another.
- You don't want to depend on a group of technical people who love what they are doing (like myself) but who are the only ones who know things. So go for a managed cloud service whenever possible. Yes, Cloud is not cheap, but in the end, it is cheaper! Yes, go Cloud.
to be continued...
2023-04-06 Unfortunately, Betfair is no longer available in NL putting my MMA -betting project on hold indefinitely. Instead, I am working on an automatic prediction and ordering system based on Degiro. This is done by endless number of people before. What I am aiming for is to use Azure managed cloud services to full extend. I wonder what it would mean in terms of development effort, maintainability and cost,
- Getting historical training data - Azure Function (Python). The main argument to use this technology besides the obvious leaning effect was of course the serverless nature of Azure functions. As the whole point was to leverage the power of Cloud managed services this is a logical choice. My observations so far:
- If you use python, you cant edit in the browser. Developing in Visual Studio Code locally is an excellent alternative though
- To make things work after deployment like importing modules was quite painful even after reading numerous docs and blogs, I would like it to be much easier to handle, It all worked in the end, but it costed to much time. I want to get a result and I don't care what path I have to set or config I have to fill. It all should be hidden and just work!
- in the case of degiroapi I had to alter request method (otherwise, the site thinks its a bot and refuses connection). Also here it felt like time wasted if it wasn't for the learning effect of it.
- Row data layer - we store collected data in Azure Cosmos DB, JSON files as documents, later I will probably transition to pure Data Lake solution.
- Permanent storage, historization for tranformed data will go to Delta Lake - exciting to use time travel and ACID features
- Data preparation - Synapse, pyspark notebooks, in my opinion, this can be used as alternative to Databricks (for the purpose in mind) and Azure Data Factory. The latter functionality is integrated in Synapse. Dedicated pools I wont use because of the cost. Mainly, its the combination of files in Data Lake as storage and Spark pools for compute.
- Training a model and generate inference model - Azure Automated ML as low code as possible( later may be Databricks and Azure ML SDK)
- Getting instances to predict - Azure Function (Python)
- Ad hoc analysis - Azure Synapse Serverless pool
- Visualization - PowerBI, integrated with Synapse
- Versioning and deployment - Azure Devops
2022-01-02 An idea how the Cloud version of my application is going to be.
- WebScraping of training data - Azure Function (Python)
- Store collected data - Azure Cosmos DB
- Data preparation - Databricks and Azure Data Factory
- Training a model and generate inference model - Azure Automated ML as low code as possible( later may be Databricks and Azure ML SDK)
- Getting instances to predict - Azure Function (Python)
- Ad hoc analysis - Azure Synapse Serverless pool
- Visualization - PowerBI
- Versioning and deployment - Azure Devops
2022-01-01 A good day to begin with something new! So I am going to write down here whatever I think is worth remembering. In the free time between two jobs I did what I find more entertaining than watching Netflix or even playing chess. I am busy with converting my MMA classifier to Azure, The first step is to use serverless Azure functions for web-scraping instead of locally running the Python code. I use Visual Code for development, Azure Devops repo voor versioning and Azure pipeline for deployment. After quite a number of attempts, mainly struggling with Python dependencies install at deployment, its finally working. What a joy! Azure Function App scrapes Sherdog.com for UFC events and load the data into a contaner in CosmosDB. Why the heck there? Why not? I must say I find loading JSON into a CosmosDB container without worrying about defining the schema first somewhat liberating. Schema on Read is the way to go. Or may be I am just tired of thinking before I can load data in a database like SQL Server. "Just do it!".