Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership. In DZone’s 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business. This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
In January 2021, I wrote that Event Driven Architecture (EDA) should be "Always Asynchronous." At precisely the same time, I argued that "the concept that we are actually trying to explain when we use the term 'Asynchronous' — within the context of EDA — might be better termed: 'Monologueous Communication'" (where no response is ever required). Nonetheless, 'Asynchronous' remains the dominant term in use today, and it was never certain that 'Monologueous Communication' was a good replacement. As the months have passed, however, I see more and more clearly how using the term 'Asynchronous' represents a major obstacle to our shared comprehension of EDA. Let me give you an almost unbelievable example that I witnessed from the CTO of the global computing giant AWS at its recent re: Invent Summit in December 2022. Of course, it goes without saying that the CTO of AWS is as well placed as anyone in the world to explain to an ignorant public was 'Asynchronous' actually means, for which reason I find this example so incredibly useful (please do not take it as a pointless exercise in AWS-bashing). Many of those there for Dr. Werner VOGELS' keynote in Las Vegas became quickly mesmerized while watching computer-generated images of a starling 'murmuration.' The long-time CTO of AWS went on to inform us that "sometimes the world looks synchronous, and one of those examples is bird murmurations… And it looks like a centralized controller tells all these starlings where to go and what to do [– but there isn't]. Our ancestors actually thought that there was telepathy in the birds… [However,] This is not a singular system because it's nature, and nature is asynchronous. So all the actions that the birds take are based on local observations… So what you see here is … purely driven by local decision-making. The bird is completely autonomous, even though the overall system looks like it's working in synchrony. But it's not. This is a purely asynchronous system. The world is asynchronous." As someone that has been writing about the fundamentals of Event-Driven Architecture (EDA) for several years now, I was astounded to hear from Dr. VOGELS that "There is absolutely nothing that is synchronous in this world." Astounded by the calamity that was unrolling before my eyes, acted out by a company with almost unlimited resources. There are many reasons to declare the pronouncements mentioned above as a calamity. Still, I think it's best to start with the actual dictionary definition of the term 'Asynchronous': "not happening … at the same time or speed" (Cambridge). To describe the individual movements of starlings in a single murmuration as "not happening … at the same time or speed" is about as large an error as one could possibly ever imagine hearing, but this was far from the only problem. Even if ample leeway is provided for the fact that modern usage of the term 'Asynchronous' often departs from its dictionary definition, there remain several other reasons to declare the pronouncements made by Dr. VOGELS as newsworthy. We were also told in the CTO's Keynote that "if you look at an asynchronous system, that actually is driven by an event broker… There is no tight coupling between any of the components". I found this statement a little problematic, given that I could not see any magical event broker floating in the sky between the birds. Likewise, given that Dr. VOGELS had earlier stated that "it looks like there's a centralized controller that tells all these starlings where to go and what to do" – but that there probably isn't – it would appear that he was equally unable to perceive of the flying event broker. The reality is, of course, that there is no "asynchronous system … driven by an event broker" in the example of starling murmurations, any more than there is "telepathy." Instead, as Dr. VOGELS rightly notes, "all the actions that the birds take are based on local observations"; their own observations, not those of an omniscient event broker. We are even shown an image of an individual bird, with a number of lines-of-sight 'connecting' it to those birds closest to it: it has no concept whatsoever of what the individual birds/components on the far side of the flock/system are doing, despite the fact that a capacity to subscribe to every system component's published events is a fundamental characteristic of any 'Decoupled,' Event-Driven Architecture. What we also remark on in the dazzling visualization provided by AWS is that the displayed lines-of-sight emanating from the central example starling very obviously represent a 'point-to-point' architecture: one of the most obvious tells of Synchronous communication there is. These last three paragraphs alone should be more than enough to make the case that a starling murmuration is very obviously 'Synchronous' – i.e., "happening or done at the same time or speed" (Cambridge) – but to this, I would like to add a very simple rebuttal of my own to the claim that" nature is asynchronous": nature is not stupid. Nature would never impose the unavoidable latency hit of 'Decoupled' (Asynchronous) communication upon its systems if that latency could not be supported. I'm willing to bet that a delay of even a few extra microseconds between a starling moving off to the side and its neighbor doing precisely the same would be more than enough to see this particular natural system crash to the ground in its entirety. One thing that Dr. VOGELS would hopefully be willing to concede, despite his earlier assertions, is that the movement between the individual starlings in a murmuration is absolutely NOT 'Decoupled.' This is precisely why we need to stop describing Events as 'Asynchronous' (despite the fact that they are almost always triggered via Synchronous API calls) and need instead describe Events – and EDA more generally – as 'Decoupled.'
Generative AI is transforming software development by automating tasks, enhancing collaboration, and accelerating innovation. This cutting-edge technology is poised to augment various software roles, creating diverse perspectives and opportunities for co-innovation. In this article, I will delve into the future of Generative AI in software development, discuss the ethical considerations, and summarize the potential impact on the industry. Developers: AI-Powered Code Generation and Collaboration Generative AI will enable developers to focus on more complex, creative tasks by automating mundane coding activities. AI-powered code generation will help developers solve intricate problems more efficiently and accurately. In addition, Generative AI will enhance collaboration among team members by suggesting code snippets or assisting with debugging, making it easier for developers to work together on large-scale projects. While AI-generated code promises increased productivity, developers must remain vigilant in reviewing and verifying its quality, ensuring adherence to best practices, and addressing potential biases or security vulnerabilities. QA Engineers: Intelligent Test Case Generation and Failure Prediction Quality Assurance Engineers will witness a significant shift in their role with the advent of Generative AI. AI-generated test cases, edge scenario identification, and failure prediction will allow QA engineers to focus on improving software quality, reliability, and security. The integration of Generative AI into QA processes will make testing more comprehensive and efficient, reducing human error and enhancing the overall user experience. QA engineers must ensure fairness in the AI-generated test results, mitigate biases, and maintain the integrity of the software. UI/UX Designers: AI-Enhanced Creativity and Inclusivity Generative AI will play a crucial role in augmenting UI/UX designers' creativity by providing design suggestions, generating UI components, and recommending user flow. This technology will enable designers to create more intuitive, visually appealing interfaces that cater to the needs and preferences of diverse user groups. AI-generated design elements can help designers experiment with various styles and layouts, fostering a more inclusive and accessible user experience. It's essential for designers to maintain a human-centric approach, address potential biases, and prioritize user well-being. Technical Writers: Streamlined Documentation and Code Examples Generative AI will simplify the lives of technical writers by assisting in drafting documentation, creating code examples, and keeping information up-to-date. With AI-generated content, technical writers can produce clear, concise, and comprehensive materials more efficiently, ensuring that both team members and users have access to accurate, relevant information. Technical writers must remain accountable for the content's quality, respect user privacy, and protect sensitive information. Project Managers: Data-Driven Decision-Making and Planning AI-generated insights for resource allocation, risk assessment, and project planning will enable project managers to make better data-driven decisions, keeping projects on track and under budget. Generative AI can help project managers monitor progress and adjust plans in real-time, considering various factors like team dynamics, changing priorities, and unforeseen challenges. However, project managers should remain responsible for the final decisions, ensuring AI-generated insights align with ethical principles and account for human factors. DevOps Engineers: Streamlined CI/CD Pipelines and Performance Monitoring Generative AI will streamline CI/CD pipelines, monitor system performance, and automate deployments for DevOps engineers. AI-generated optimizations will help DevOps engineers identify bottlenecks, proactively address potential issues, and maintain system stability. DevOps engineers must implement robust security measures in AI-augmented pipelines and ensure that AI-generated solutions adhere to best practices and organizational standards. Architects: Optimal System Design and Scalability Generative AI will provide architects with insights for optimal system design, technology selection, and scalability. AI-generated architectural recommendations will help architects make informed decisions, ensuring that systems are robust, flexible, and scalable to meet future demands. Architects should consider the long-term implications of AI-generated suggestions and choose AI solutions that uphold ethical standards and align with organizational values. Opportunities for Co-Innovation Generative AI promises to unlock numerous co-innovation opportunities across the software development landscape. By augmenting human intelligence and creativity, Generative AI can facilitate the exploration of new ideas, techniques, and approaches that were previously unattainable or time-consuming. Collaboration between AI systems and human experts can lead to the development of groundbreaking solutions, enabling organizations to stay ahead of the competition and drive industry transformation. Ethical Considerations As Generative AI continues to permeate software development, ethical considerations become increasingly important. Ensuring transparency, explainability, fairness, and accountability is vital in fostering trust, creating equitable solutions, and promoting responsible AI adoption. Software professionals must be aware of potential biases, privacy concerns, and other ethical issues that may arise when integrating AI into their work and proactively address them. Summary The future of software development will see Generative AI augmenting various roles, streamlining processes, enhancing collaboration, and unlocking new avenues for co-innovation. As AI technology continues to advance, software professionals must adapt to these changes and embrace the opportunities they offer. By integrating Generative AI responsibly and upholding ethical principles, the software industry can harness the full potential of this transformative technology to elevate the entire development ecosystem and create a more sustainable, efficient, and innovative future.
Recommender systems have become an integral part of our daily lives, powering the personalized recommendations that we receive on social media, e-commerce platforms, and streaming services. These systems are designed to make our lives easier by suggesting products, services, and content that are relevant to our interests and preferences. However, as powerful as these systems are, they are not perfect, and there are concerns about their fairness, especially in terms of how they impact marginalized groups. In this article, we will explore the concept of fairness in recommender systems, the challenges involved in achieving fairness, and the approaches that have been proposed to address these challenges. What Is Fairness in Recommender Systems? Fairness is a complex concept that can be defined in many ways, depending on the context. In the case of recommender systems, fairness refers to the degree to which the recommendations generated by the system are unbiased and do not systematically favor or discriminate against certain groups of users. Fairness can be evaluated from different perspectives, including individual fairness, group fairness, and algorithmic fairness. Individual fairness refers to the idea that similar users should receive similar recommendations, while group fairness requires that the system's recommendations are equally distributed across different groups of users, regardless of their demographic characteristics. Algorithmic fairness, on the other hand, is concerned with ensuring that the underlying algorithms and data used to make recommendations do not perpetuate biases or discrimination. Challenges in Achieving Fairness in Recommender Systems Achieving fairness in recommender systems is not a trivial task, as there are several challenges that must be addressed. Some of these challenges include: Data Biases Recommender systems are trained on historical user data, which can contain biases and stereotypes. These biases can lead to recommendations that are unfair and discriminatory. For example, if a recommender system recommends mostly popular items, it may reinforce the status quo and perpetuate existing inequalities. To address this challenge, data preprocessing techniques can be used to remove or mitigate the effects of biases. Oversampling underrepresented groups, reweighting the data, or using techniques such as adversarial debiasing can help balance the data and reduce the impact of biases. Lack of Diversity Recommender systems can suffer from a lack of diversity, as they may recommend similar items to users with similar tastes, which can create filter bubbles and limit users' exposure to new and diverse content. To address this challenge, various techniques can be used to promote diversity, such as incorporating diversity metrics into the recommendation process or providing users with serendipitous recommendations that introduce them to new content. Cold Start Problem Recommender systems may struggle to provide personalized recommendations to new users who have little to no historical data, which can put them at a disadvantage compared to users with established profiles. One way to address this challenge is to use content-based recommendations that leverage the features of items to make recommendations, rather than relying solely on historical user data. Privacy Concerns Recommender systems require access to users' personal data to make recommendations, which can raise privacy concerns and undermine user trust in the system. To address this challenge, privacy-preserving techniques such as differential privacy can be used to protect users' data while still providing accurate recommendations. Approaches To Achieving Fairness in Recommender Systems Despite these challenges, there are several approaches that have been proposed to achieve fairness in recommender systems. Some of these approaches include: Algorithmic Modifications One approach to achieving fairness in recommender systems is to modify the algorithms used by the system to ensure fairness. For example, one could modify the objective function to explicitly include fairness constraints or incorporate diversity metrics into the recommendation process. User Feedback User feedback can be used to improve the fairness of the system by allowing users to provide explicit feedback on the recommendations they receive. This can help the system learn from its mistakes and improve its recommendations over time. Transparency and Accountability Another way to promote fairness in recommender systems is to increase transparency and accountability. This can be done by providing users with more information about how the system works, including the algorithms used and the data sources, and allowing users to opt out of certain types of recommendations. Hybrid Recommendations A hybrid approach that combines multiple recommendation techniques, such as collaborative filtering and content-based recommendations, can be used to provide a more diverse set of recommendations that are less likely to be biased. Conclusion Recommender systems have the potential to provide personalized and relevant recommendations to users, but they also raise concerns about fairness and discrimination. Achieving fairness in recommender systems is a complex and ongoing challenge that requires a multi-disciplinary approach.
Building a cluster of single-board mini-computers is an excellent way to explore and learn about distributed computing. With the scarcity of Raspberry Pi boards, and the prices starting to get prohibitive for some projects, alternatives such as Orange Pi have gained popularity. In this article, I’ll show you how to build a (surprisingly cheap) 4-node cluster packed with 16 cores and 4GB RAM to deploy a MariaDB replicated topology that includes three database servers and a database proxy, all running on a Docker Swarm cluster and automated with Ansible. This article was inspired by a member of the audience who asked my opinion about Orange Pi during a talk I gave in Colombia. I hope this completes the answer I gave you. What Is a Cluster? A cluster is a group of computers that work together to achieve a common goal. In the context of distributed computing, a cluster typically refers to a group of computers that are connected to each other and work together to perform computation tasks. Building a cluster allows you to harness the power of multiple computers to solve problems that a single computer cannot handle. For example, a database can be replicated in multiple nodes to achieve high availability—if one node fails, other nodes can take over. It can also be used to implement read/write splitting to make one node handle writes, and another reads in order to achieve horizontal scalability. What Is Orange Pi Zero2? The Orange Pi Zero2 is a small single-board computer that runs on the ARM Cortex-A53 quad-core processor. It has 512MB or 1GB of DDR3 RAM, 100Mbps Ethernet, Wi-Fi, and Bluetooth connectivity. The Orange Pi Zero2 is an excellent choice for building a cluster due to its low cost, small size, and good performance. The only downside I found was that the Wi-Fi connection didn’t seem to perform as well as with other single-board computers. From time to time, the boards disconnect from the network, so I had to place them close to a Wi-Fi repeater. This could be a problem with my setup or with the boards. I’m not entirely sure. Having said that, this is not a production environment, so it worked pretty well for my purposes. What You Need Here are the ingredients: Orange Pi Zero2: I recommend the 1GB RAM variant and try to get at least 4 of them. I recently bought 4 of them for €30 each. Not bad at all! Give it a try! MicroSD cards: One per board. Try to use fast ones — it will make quite a difference in performance! I recommend at least 16GB. For reference, I used SanDisk Extreme Pro Micro/SDXC with 32GB, which offers a write speed of 90 MB/s and reads at 170 MB/s. A USB power hub: To power the devices, I recommend a dedicated USB power supply. You could also just use individual chargers, but the setup will be messier and require a power strip with as many outlets as devices as you have. It’s better to use a USB multi-port power supply. I used an Anker PowerPort 6, but there are also good and cheaper alternatives. You’ll have to Google this too. Check that each port can supply 5V and at least 2.4A. USB cables: Each board needs to be powered via a USB-C port. You need a cable with one end of type USB-C and the other of the type your power hub accepts. Bolts and nuts: To stack up the boards. Heat sinks (optional): These boards can get hot. I recommend getting heat sinks to help with heat dissipation. Materials needed for building an Orange Pi Zero2 cluster Assembling the Cluster One of the fun parts of building this cluster is the physical assembly of the boards on a case or some kind of structure that makes them look like a single manageable unit. Since my objective here is to keep the budget as low as possible, I used cheap bolts and nuts to stack the boards one on top of the other. I didn’t find any ready-to-use cluster cases for the Orange Pi Zero2. One alternative is to 3D-print your own case. When stacking the boards together, keep an eye on the antenna placement. Avoid crushing the cable, especially if you installed heat sinks. An assembled Orange Pi Zero2 cluster with 4 nodes Installing the Operating System The second step is to install the operating system on each microSD card. I used Armbian bullseye legacy 4.9.318. Download the file and use a tool like balenaEtcher to make bootable microSD cards. Download and install this tool on your computer. Select the Armbian image file and the drive that corresponds to the micro SD card. Flash the image and repeat the process for each micro SD card. Configuring Orange Pi WiFi Connection (Headless) To configure the Wi-Fi connection, Armbian includes the /boot/armbian_first_run.txt.template file which allows you to configure the operating system when it runs for the first time. The template includes instructions, so it’s worth checking. You have to rename this file to armbian_first_run.txt. Here’s what I used: Plain Text FR_general_delete_this_file_after_completion=1 FR_net_change_defaults=1 FR_net_ethernet_enabled=0 FR_net_wifi_enabled=1 FR_net_wifi_ssid='my_connection_id>' FR_net_wifi_key='my_password' FR_net_wifi_countrycode='FI' FR_net_use_static=1 FR_net_static_gateway='192.168.1.1' FR_net_static_mask='255.255.255.0' FR_net_static_dns='192.168.1.1 8.8.8.8' FR_net_static_ip='192.168.1.181' Use your own Wi-Fi details, including connection name, password, country code, gateway, mask, and DNS. I wasn’t able to read the SD card from macOS. I had to use another laptop with Linux on it to make the changes to the configuration file on each SD card. To mount the SD card on Linux, run the following command before and after inserting the SD card and see what changes: Shell sudo fdisk -l I created a Bash script to automate the process. The script accepts as a parameter the IP to set. For example: Shell sudo ./armbian-setup.sh 192.168.1.181 I run this command on each of the four SD cards changing the IP address from 192.168.1.181 to 192.168.1.184. Connecting Through SSH Insert the flashed and configured micro SD cards on each board and turn the power supply on. Be patient! Give the small devices time to boot. It can take several minutes the first time you boot them. An Orange Pi cluster running Armbian Use the ping command to check whether the devices are ready and connected to the network: Shell ping 192.168.1.181 Once they respond, connect to the mini-computers through SSH using the root user and the IP address that you configured. For example: Shell ssh root@192.168.1.181 The default password is: Plain Text 1234 You’ll be presented with a wizard-like tool to complete the installation. Follow the steps to finish the configuration and repeat the process for each board. Installing Ansible Imagine you want to update the operating system on each machine. You’ll have to log into a machine and run the update command and end the remote session. Then repeat for each machine in the cluster. A tedious job even if you have only 4 nodes. Ansible is an automation tool that allows you to run a command on multiple machines using a single call. You can also create a playbook, a file that contains commands to be executed in a set of machines defined in an inventory. Install Ansible on your working computer and generate a configuration file: Shell sudo su ansible-config init --disabled -t all > /etc/ansible/ansible.cfg exit In the /etc/ansible/ansible.cfg file, set the following properties (enable them by removing the semicolon): Plain Text host_key_checking=False become_allow_same_user=True ask_pass=True This will make the whole process easier. Never do this in a production environment! You also need an inventory. Edit the /etc/ansible/hosts file and add the Orange Pi nodes as follows: Plain Text ############################################################################## # 4-node Orange Pi Zero 2 cluster ############################################################################## [opiesz] 192.168.1.181 ansible_user=orangepi hostname=opiz01 192.168.1.182 ansible_user=orangepi hostname=opiz02 192.168.1.183 ansible_user=orangepi hostname=opiz03 192.168.1.184 ansible_user=orangepi hostname=opiz04 [opiesz_manager] opiz01.local ansible_user=orangepi [opiesz_workers] opiz[02:04].local ansible_user=orangepi In the ansible_user variable, specify the username that you created during the installation of Armbian. Also, change the IP addresses if you used something different. Setting up a Cluster With Ansible Playbooks A key feature of a computer cluster is that the nodes should be somehow logically interconnected. Docker Swarm is a container orchestration tool that will convert your arrangement of Orange Pi devices into a real cluster. You can later deploy any kind of server software. Docker Swarm will automatically pick one of the machines to host the software. To make the process easier, I have created a set of Ansible playbooks to further configure the boards, update the packages, reboot or power off the machines, install Docker, set up Docker Swarm, and even install a MariaDB database with replication and a database cluster. Clone or download this GitHub repository: Shell git clone https://github.com/alejandro-du/orange-pi-zero-cluster-ansible-playbooks.git Let’s start by upgrading the Linux packages on all the boards: Shell ansible-playbook upgrade.yml --ask-become-pass Now configure the nodes to have an easy-to-remember hostname with the help of Avahi, and configure the LED activity (red LED activates on SD card activity): Shell ansible-playbook configure-hosts.yml --ask-become-pass Reboot all the boards: Shell ansible-playbook reboot.yml --ask-become-pass Install Docker: Shell ansible-playbook docker.yml --ask-become-pass Set up Docker Swarm: Shell ansible-playbook docker-swarm.yml --ask-become-pass Done! You have an Orange Pi cluster ready for fun! Deploying MariaDB on Docker Swarm I have to warn you here. I don’t recommend running a database on container orchestration software. That’s Docker Swarm, Kubernetes, and others. Unless you are willing to put a lot of effort into it. This article is a lab. A learning exercise. Don’t do this in production! Now let’s get back to the fun… Run the following to deploy one MariaDB primary server, two MariaDB replica servers, and one MaxScale proxy: Shell ansible-playbook mariadb-stack.yml --ask-become-pass The first time you do this, it will take some time. Be patient. SSH into the manager node: Shell ssh orangepi@opiz01.local Inspect the nodes in the Docker Swarm cluster: Shell docker node ls Inspect the MariaDB stack: Shell docker stack ps mariadb A cooler way to inspect the containers in the cluster is by using the Docker Swarm Visualizer. Deploy it as follows: Shell docker service create --name=viz --publish=9000:8080 --constraint=node.role==manager --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock alexellis2/visualizer-arm:latest On your working computer, open a web browser and go to this URL. You should see all the nodes in the cluster and the deployed containers. Docker Swarm Visualizer showing MariaDB deployed MaxScale is an intelligent database proxy with tons of features. For now, let’s see how to connect to the MariaDB cluster through this proxy. Use a tool like DBeaver, DbGate, or even a database extension for your favorite IDE. Create a new database connection using the following connection details: Host: opiz01.local Port: 4000 Username: user Password: password Create a new table: MariaDB SQL USE demo; CREATE TABLE messages( id INT PRIMARY KEY AUTO_INCREMENT, content TEXT NOT NULL ); Insert some data: MariaDB SQL INSERT INTO messages(content) VALUES ("It works!"), ("Hello, MariaDB"), ("Hello, Orange Pi"); When you execute this command, MaxScale sends it to the primary server. Now read the data: MariaDB SQL SELECT * FROM messages; When you execute this command, MaxScale sends it to one of the replicas. This division of reads and writes is called read-write splitting. The MaxScale UI showing a MariaDB cluster with replication and read-write splitting You can also access the MaxScale UI. Use the following credentials: Username: admin Password: mariadb Watch the following video if you want to learn more about MaxScale and its features. You won’t regret it!
Caches are very useful software components that all engineers must know. It is a transversal component that applies to all the tech areas and architecture layers such as operating systems, data platforms, backend, frontend, and other components. In this article, we are going to describe what is a cache and explain specific use cases focusing on the frontend and client side. What Is a Cache? A cache can be defined in a basic way as an intermediate memory between the data consumer and the data producer that stores and provides the data that will be accessed many times by the same/different consumers. It is a transparent layer for the data consumer in terms of user usability except to improve performance. Usually, the reusability of data provided by the data producer is the key to taking advantage of the benefits of a cache. Performance is the other reason to use a cache system such as in-memory databases to provide a high-performance solution with low latency, high throughput, and concurrency. For example, how many people query the weather on a daily basis and how many times do they repeat the same query? Let's suppose that there are 1,000 people in New York consulting the weather and 50% repeat the same query twice per day. In this scenario, if we can store the first query as close as possible to the user's device, we achieve two benefits increase the user experience because the data is provided faster and reduce the number of queries to the data producer/server side. The output is a better user experience and a solution that will support more concurrent users using the platform. At a high level, there are two caching strategies that we can apply in a complementary way: Client/Consumer Side: The data cached is stored on the consumer or user side, usually in the browser's memory when we are talking about web solutions (also called private cache). Server/Producer Side: The data cached is stored in the components of the data producer architecture. Caches like any other solution have a series of advantages that we are going to summarize: Application performance: Provide faster response times because can serve data more quickly. Reduce load on the server side: When we apply caches to the previous system and reuse a piece of data, we are avoiding queries/requests to the following layer. Scalability and cost improvement: As data caching gets closer to the consumer, we increase the scalability and performance of the solution at a lower cost. Components closer to the client side are more scalable and cheaper because three main reasons: These components are focused on performance and availability but have poor consistency. They have only part of the information: the data used more by the users. In the case of the browser's local cache, there is no cost for the data producer. The big challenges of cache are data consistency and data freshness, which means how the data is synchronized and up-to-date across the organization. Depending on the use case, we will have more or fewer requirements restrictions because it is so different from caching images than the inventory stock or sales behavior. Client-Side Caches Speaking about the client-side cache, we can have different types of cache that we are going to analyze a little bit in this article: HTTP Caching: This caching type is an intermediate cache system, as it depends partially on the server. Cache API: This is a browser API(s) that allows us to cache requests in the browser. Custom Local Cache: The front-end app controls the cache storage, expiration, invalidation, and update. HTTP Caching It caches the HTTP requests for any resource (CSS, HTML, images, video, etc.) in the browsers, and it manages all related to storage, expiration, validation, fetch, etc., from the front end. The application’s point of view is almost transparent as it makes a request in a regular way and the browser does all the “magic." The way of controlling the caching is by using HTTP Headers, in the server side, it adds cache-specific headers to the HTTP response, for example: "Expires: Tue, 30 Jul 2023 05:30:22 GMT," then the browser knows this resource can be cached, and the next time the client (application) requests the same resource if the request time is before the expiration date the request will not be done, the browser will return the local copy of the resource. It allows you to set the way the responses are disguised, as the same URL can generate different responses (and their cache should be handled in a different way). For example, in an API endpoint that returns some data (i.e., http://example.com/my-data) we could use the request header Content-type to specify if we want the response in JSON or CSV, etc. Therefore, the cache should be stored with the response depending on the request header(s). For that, the server should set the response header Vary: Accept-Language to let the browser know the cache depends on that value. There are a lot of different headers to control the cache flow and behavior, but it is not the goal of this article to go deep into it. It will probably be addressed in another article. As we mentioned before, this caching type needs the server to set the resources expiration, validation, etc. So this is not a pure frontend caching method or type, but it’s one of the simplest ways to cache the resources the front-end application uses, and it is complementary to the other way we will mention down below. Related to this cache type, as it is an intermediate cache, we can even delegate it in a “piece” between the client and the server; for example, a CDN, a reverse proxy (for example Varnish), etc. Cache API It is quite similar to the HTTP caching method, but in this case, we control which requests are stored or extracted from the cache. We have to manage the cache expiration (and it’s not easy, because those caches were thought to live “forever”). Even if these APIs are available in the windowed contexts are very oriented to their usage in a worker context. This cache is very oriented to use for offline applications. On the first request, we can get and cache all the resources need it (images, CSS, JS, etc.), allowing the application to work offline. It is very useful in mobile applications, for example with the use of maps for our GPS systems in addition to weather data. This allows us to have all the information for our hiking route even if we have no connection to the server. One example of how it works in a windowed context: const url = ‘https://catfact.ninja/breeds’ caches.open('v1').then((cache) => { cache.match((url).then((res) => { if (res) { console.log('it is in cache') console.log(res.json()) } else { console.log('it is NOT in cache') fetch(url) .then(res => { cache.put('test', res.clone()) }) } }) }) Custom Local Cache In some cases, we will need more control over the cached data and the invalidation (not just expiration). Cache invalidation is more than just checking the max-age of a cache entry. Imagine the weather app we mentioned above. This app allows the users to update the weather to reflect the real weather in a place. The app needs to do a request per city and transform the temperature values from F to ºC (this is a simple example: calculations can be more expensive in other use cases). To avoid doing requests to the server (even if it’s cached), we can do all the requests the first time, put all the data together in a data structure convenient for us, and store it in, for example in the browser’s IndexedDB, in the LocalStorage, SessionStorage or even in memory (not recommended). The next time we want to show the data, we can get it from the cache, not just the resource data (even the computation we did), saving network and computation time. We can control the expiration of the caches by adding the issue time next to the API, and we can also control the cache invalidation. Imagine now that the user adds a new cat in its browser. We can just invalidate the cache and do the requests and calculations next time, or go further, updating our local cache with the new data. Or, another user can change the value, and the server will send an event to notify the change to all clients. For example, using WebSockets, our front-end application can hear these events and invalidate the cache or just update the cache. This kind of cache requires work on our side to check the caches and handle events that can invalidate or update it, etc., but fits very well in a hexagonal architecture where the data is consumed from the API using a port adaptor (repository) that can hear domain events to react to the changes and invalidate or update some caches. This is not a cache generic solution. We need to think if it fits our use case as it requires work on the front-end application side to invalidate the caches or to emit and handle data change events. In most cases, the HTTP caching is enough. Conclusion Having a cache solution and good strategy should be a must in any software architecture, but our solution will be incomplete and probably not optimized. Caches are our best friends mostly in high-performance scenarios. It seems that the technical invalidation cache process is the challenge, but the biggest challenge is to understand the business scenarios and uses cases to identify what are the requirements in terms of data freshness and consistency that allow us to design and choose the best strategy. We will talk about other cache approaches for databases, backend, and in-memory databases in the next articles.
In its first week of launch, ChatGPT shattered Internet records by becoming extremely popular. As a person who works in QA automation, my initial thinking when I started looking into it was how to use this wonderful platform to make the jobs of testers for Web and UI automation simpler. ChatGPT may be used to write code in a variety of programming languages and technologies. After more investigation, I made the decision to create some scenarios using it. I have created some use cases around UI, API, and Cucumber feature file generation using ChatGPT. We can use ChatGPT to generate the Code but we can’t say that the generated code is perfectly fine. We have to do minor tweak in the generated code. The aim of this blog is: How can you set up ChatGPT To know ChatGPT is really helpful in generating Cypress/JavaScript Code. How Can we generate an automation script for: UI Automation Using Cypress /Javascript API Automation Using Cypress /Javascript Generate Cucumber Feature file What Is ChatGPT? ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) language model that is specifically designed for use in chatbots and conversational systems. Like other GPT models, ChatGPT is trained on a large dataset of human-generated text and uses this training to generate human-like responses to prompts. However, ChatGPT has been modified to better handle the specific requirements of chatbot systems, including the ability to handle short, informal prompts and responses and to maintain context and coherence over the course of a conversation. What Is Cypress? Cypress is a JavaScript-based test automation framework for web applications. It is designed to make it easier to write and run automated tests for web applications and includes features such as a command-line interface (CLI), real-time reloading, and the ability to run tests in parallel. Cypress is particularly useful for testing modern web applications that use complex client-side interactions and asynchronous behavior. It allows developers to write tests that closely mimic the way a user would interact with the application and provides detailed feedback on the results of those tests. Cypress is often used in conjunction with other tools, such as continuous integration (CI) platforms and testing frameworks, as part of a larger testing strategy. It is particularly popular among developers who are looking for a fast, reliable, and easy-to-use testing solution for their web applications. Set Up ChatGPT Steps: Search the text ChatGPT in Google. Open the first link in the search results. Click on “Try CHATGPT.” It will first check for a secure connection after that, it provides two options “Login” and “Sign up.” 4. Click on Signup and signup with a new email or existing account. 5. After the above step, it will ask you to enter your mobile number, and one code will be received. After entering the code user sign up for ChatGPT. The below screen opens finally once you click on the Next buttons in the wizard. Let’s Ask ChatGPT To Generate Automation Script In this section, I am going to cover some scenarios by asking ChatGPT to generate the script for UI Automation Script Using Cypress /Javascript API Automation Script Using Cypress /Javascript Generate Cucumber .feature file for different scenarios Generate UI Automation Script for Cypress/ Javascript Below are some scenarios that I have covered to generate the script In UI using Cypress/Javascript. I asked ChatGPT the below query to generate the script. Question 1: Write Cypress/ Javascript code for opening the URL of the site. Below is the generated Code by ChatGPT for the asked query: When I ran the above code, in reality, it was working fine without any issues. JavaScript describe('My test', () => { it('Visits the website', () => { cy.visit('http://qaautomationlabs.com'); }); }); Question 2: Write Cypress/ JavaScript code for opening the URL of the site and print the title. Below is the generated Code by ChatGPT for the asked query: When I ran the above code, in reality, it was working fine without any issues, and you can see its print title in the console. JavaScript describe("My Test Suite", function () { it("My Test Case", function () { cy.visit("http://qaautomationlabs.com/"); // open the URL cy.title().then(function (title) { console.log(title); // print the title }); }); }); Question 3 Write Cypress/ JavaScript code for opening the URL of the site and Click on the Link “Blogs.” When I ran the above code, in reality, it was not working. I have to modify my script; as I added { force: true }, it started working fine. JavaScript describe("Navigate to QAAutomationLabs website and click on Blogs link", () => { it("Visits the website and clicks on the Blogs link", () => { cy.visit("http://qaautomationlabs.com"); cy.get('a').contains('Blogs').click({ force: true }); }); }); Question 4 Login Scenario:Write Cypress/ Javascript code to Open the URL. Lcogin with Username standard_user Password secret_sauce and click on the login button. Below is the Generated Code by ChatGPT for the asked question: When I ran the above code, in reality, it was working fine without any issues, and I am able to log in with provided credentials. JavaScript const username = "standard_user"; const password = "secret_sauce"; describe("Login to Saucedemo", () => { it("should log in with valid credentials", () => { cy.visit("https://www.saucedemo.com/"); cy.get("#user-name").type(username); cy.get("#password").type(password); cy.get("#login-button").click(); }); }); Generate API Automation Script for Cypress/ JavaScript ChatGPT is very useful in generating the API automation code. I have asked to Generate code for a few API requests, and generated code looks good to me. Below are some scenarios that I have covered to generate the script In API using Cypress/Javascript. Question 1 Generate Cypress API code for GET request for link and verify status code 200. Below is the Generated Code by ChatGPT for the asked question: When I ran the above-Generated code, in reality, it worked fine, and I was able to verify the response with the status code “200.” JavaScript describe("Send GET request and verify status code", () => { it("should send a GET request and verify the status code", () => { cy.request("GET", "https://reqres.in/api/users?page=2").then((response) => { expect(response.status).to.eq(200); }); }); }); Question 2 Generate Cypress API code for POST request for link by passing name”: “morpheus”, job”: “leader” in the body and verify status code 201. Below is the Generated Code by ChatGPT for the asked question: When I ran the above-generated code, in reality, it worked fine, and I was able to verify the response with the status code “201.” JavaScript const data = { name: 'morpheus', job: 'leader', }; describe('Send POST request and verify status code', () => { it('should send a POST request and verify the status code', () => { cy.request('POST', 'https://reqres.in/api/users?page=2', data).then((response) => { expect(response.status).to.eq(201); }); }); }); Question 3 Generate Cypress API code for DELETE request for URL and verify status code 204. Below is the Generated Code by ChatGPT for the asked question: When I ran the above-generated code, in reality, it worked fine, and I was able to verify the response with the status code “204." JavaScript describe("Send DELETE request and verify status code", () => { it("should send a DELETE request and verify the status code", () => { cy.request({ method: "DELETE", url: "https://reqres.in/api/users?page=2", }).then((response) => { expect(response.status).to.eq(204); }); }); }); Generate Cucumber .Feature File Using ChatGPT ChatGPT is very useful in creating the .feature file. I have asked to Generate .feature file for some of the below scenarios, and the generated file looks good to me. Question 1 Generate the Cucumber feature file to Open the URL https://qaautomationlabs.com and click on the “Read more” button, and verify the text Running End-to-End Cypress Test cases In Google Cloud Build Pipeline Feature: Navigate to qaautomationlabs.com, click a button, and verify text As a user I want to visit qaautomationlabs.com, click a button on the page, and verify that certain text is displayed So that I can see the content that the button reveals and confirm that the expected text is present Scenario: Click the button and verify text Given I am on the qaautomationlabs.com homepage When I click the "Read more" button Then I should see the content that the button reveals And I should see the text "Running End-to-End Cypress Test cases In Google Cloud Build Pipeline" Question 2 Generate the Cucumber feature file to Open the URL. Login with Username standard_user Password secret_sauce and click on the LOGIN button. Below is the Generated Code by ChatGPT for the asked question: Feature: Navigate to qaautomationlabs.com and click a button As a user I want to visit qaautomationlabs.com and click a button on the page So that I can see the content that the button reveals Scenario: Click the button Given I am on the qaautomationlabs.com homepage When I click the "Read more" button Then I should see the content that the button reveals Question 3 Generate cucumber Data table. Below is the Generated Code by ChatGPT for the asked question: Feature: Login feature As a user, I want to be able to log in to the website so that I can access my account. Scenario Outline: Successful login Given I am on the login page When I enter my email address "<email>" and password "<password>" And I click the login button Then I should be logged in Examples: | email | password | | standard_user | secret_sauce | | locked_out_user | secret_sauce | | problem_user | secret_sauce | | performance_glitch_user | secret_sauce | Conclusion The ChatGPT natural language model has tremendous promise and is quite powerful. It has tremendous capabilities and will most likely set the standard for low-code testing solutions to a wide range of issues. We can use ChatGPT to generate the Code, but we can’t say that the generated code is perfectly fine. You have to do some modifications. But the good thing is, in most cases, generated code is pretty close to runnable code.
As the digital age progresses, the need for efficient and secure data governance practices becomes more crucial than ever. This article delves into the concept of User Data Governance and its implementation using serverless streaming. We will explore the benefits of using serverless streaming for processing user data and how it can lead to improved data governance and increased privacy protection. Additionally, we will provide code snippets to illustrate the practical implementation of serverless streaming for user data governance. Introduction User Data Governance refers to the management of user data, including its collection, storage, processing, and protection. With the ever-increasing amount of data generated daily, organizations must develop robust and efficient data governance practices to ensure data privacy, security, and compliance with relevant regulations. In recent years, serverless computing has emerged as a promising solution to the challenges of data governance. This paradigm shift allows organizations to build and run applications without managing the underlying infrastructure, enabling them to focus on their core business logic. Serverless streaming, in particular, has shown great potential in processing large volumes of user data in real time, with minimal latency and scalable performance. Serverless Streaming for User Data Processing Serverless streaming is a cloud-based architecture that enables real-time data processing without the need to provision or manage servers. It provides on-demand scalability and cost-effectiveness, making it an ideal choice for processing large volumes of user data. This section examines the key components of serverless streaming for user data governance. 1.1. Event Sources An event source is any system or application that generates data in real time. These sources can include user activity logs, IoT devices, social media feeds, and more. By leveraging serverless streaming, organizations can ingest data from these diverse sources without worrying about infrastructure management. For example, consider an AWS Kinesis data stream that ingests user activity logs: Python import boto3 kinesis_client = boto3.client('kinesis', region_name='us-west-2') response = kinesis_client.create_stream( StreamName='UserActivityStream', ShardCount=1 ) 1.2. Stream Processing Stream processing involves the real-time analysis of data as it is generated by event sources. Serverless platforms, such as AWS Lambda, Google Cloud Functions, and Azure Functions, enable developers to create functions that process data streams without managing the underlying infrastructure. These functions can be triggered by specific events, allowing for the real-time processing of user data. For instance, an AWS Lambda function that processes user activity logs from the Kinesis data stream: Python import json import boto3 def lambda_handler(event, context): for record in event['Records']: payload = json.loads(record['kinesis']['data']) process_user_activity(payload) def process_user_activity(activity): # Process user activity data here pass 1.3. Data Storage The processed data must be stored securely to ensure proper data governance. Serverless storage solutions, such as Amazon S3, Google Cloud Storage, and Azure Blob Storage, offer scalable and secure storage options that automatically scale with the size of the data. For example, storing processed user activity data in an Amazon S3 bucket: Python import boto3 s3_client = boto3.client('s3') def store_processed_data(data, key): s3_client.put_object( Bucket='my-processed-data-bucket', Key=key, Body=json.dumps(data) ) Benefits of Serverless Streaming for User Data Governance The serverless streaming architecture offers several benefits for user data governance, including: 2.1. Scalability One of the main advantages of serverless streaming is its ability to scale automatically based on the volume of incoming data. This ensures that organizations can handle fluctuating workloads, such as seasonal trends or unexpected surges in user activity, without the need to over-provision resources. 2.2. Cost-Effectiveness Serverless streaming follows a pay-as-you-go pricing model, meaning organizations only pay for the resources they actually consume. This eliminates the need for upfront investments in infrastructure and reduces overall operational costs. 2.3. Flexibility Serverless streaming allows organizations to process data from multiple event sources and adapt their data processing pipelines to changing business requirements quickly. This flexibility enables them to stay agile and responsive to evolving user data governance needs. 2.4. Security With serverless streaming, organizations can implement various security measures, such as encryption, data masking, and access control, to protect user data at rest and in transit. Additionally, serverless platforms typically offer built-in security features, such as automatic patching and monitoring, to ensure the highest level of data protection. Compliance and Privacy in Serverless Streaming As organizations adopt serverless streaming for user data governance, they must address several privacy and compliance concerns, including: 3.1. Data Sovereignty Data sovereignty refers to the concept that data should be stored and processed within the borders of the country where it was generated. Serverless streaming platforms must support multi-region deployment to comply with data sovereignty requirements and ensure proper user data governance. 3.2. GDPR and Other Data Protection Regulations Organizations must adhere to the General Data Protection Regulation (GDPR) and other data protection laws when processing user data. Serverless streaming platforms should provide features to facilitate compliance, such as data anonymization, deletion, and consent management. 3.3. Privacy by Design Privacy by Design is a proactive approach to data privacy that embeds privacy considerations into the design and architecture of systems and processes. Serverless streaming platforms should support Privacy by Design principles, enabling organizations to implement privacy-enhancing techniques and best practices. Best Practices for Implementing User Data Governance With Serverless Streaming To ensure robust user data governance using serverless streaming, organizations should follow these best practices: 4.1. Assess Data Sensitivity Before processing user data, organizations should evaluate the sensitivity of the data and apply appropriate security measures based on the data classification. 4.2. Encrypt Data at Rest and in Transit Data should be encrypted both at rest (when stored) and in transit (during processing and transmission) to protect against unauthorized access. 4.3. Implement Access Control Organizations should implement strict access control policies to limit who can access and process user data. This includes role-based access control (RBAC) and the principle of least privilege (POLP). 4.4. Monitor and Audit Continuous monitoring and auditing of serverless streaming platforms are essential to ensure data governance, detect security incidents, and maintain compliance with relevant regulations. 4.5. Leverage Data Retention Policies Organizations should implement data retention policies to ensure that user data is stored only for the duration necessary and is deleted when no longer needed. Conclusion User Data Governance is an essential aspect of modern digital businesses, and serverless streaming offers a promising approach to address its challenges. By leveraging the scalability, cost-effectiveness, and flexibility of serverless streaming, organizations can process and manage large volumes of user data more efficiently and securely. By adhering to best practices and regulatory requirements, organizations can ensure robust user data governance and privacy protection using serverless streaming.
As data becomes increasingly important for businesses, the need for scalable, efficient, and cost-effective data storage and processing solutions is more critical than ever. Data Lakehouses have emerged as a powerful tool to help organizations harness the benefits of both Data Lakes and Data Warehouses. In the first article, we highlighted key benefits of Data Lakehouses for businesses, while the second article delved into the architectural details. In this article, we will focus on three popular Data Lakehouse solutions: Delta Lake, Apache Hudi, and Apache Iceberg. We will explore the key features, strengths, and weaknesses of each solution to help you make an informed decision about the best fit for your organization's data management needs. Data Lakehouse Innovations: Exploring the Genesis and Features of Delta Lake, Apache Hudi, and Apache Iceberg The three Data Lakehouse solutions we will discuss in this article — Delta Lake, Apache Hudi, and Apache Iceberg — have all emerged to address the challenges of managing massive amounts of data and providing efficient query performance for big data workloads. Although they share some common goals and characteristics, each solution has its unique features, strengths, and weaknesses. Delta Lake was created by Databricks and is built on top of Apache Spark, a popular distributed computing system for big data processing. It was designed to bring ACID transactions, scalable metadata handling, and unification of batch and streaming data processing to Data Lakes. Delta Lake has quickly gained traction in the big data community due to its compatibility with a wide range of data platforms and tools, as well as its seamless integration with the Apache Spark ecosystem. Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source project developed by Uber to efficiently manage large-scale analytical datasets on Hadoop-compatible distributed storage systems. Hudi provides upserts and incremental processing capabilities to handle real-time data ingestion, allowing for faster data processing and improved query performance. With its flexible storage and indexing mechanisms, Hudi supports a wide range of analytical workloads and data processing pipelines. Apache Iceberg is an open table format for large-scale, high-performance data management, initially developed by Netflix. Iceberg aims to provide a more robust and efficient foundation for data lake storage, addressing the limitations of existing storage solutions like Apache Hive and Apache Parquet. One of its most significant innovations is the use of a flexible and powerful schema evolution mechanism, which allows users to evolve table schema without rewriting existing data. Iceberg also focuses on improving metadata management, making it scalable and efficient for very large datasets. Each of these solutions has evolved in response to specific needs and challenges in the big data landscape, and they all bring valuable innovations to the Data Lakehouse concept. In the following sections, we will delve into the technical aspects of each solution, examining their data storage and file formats, data versioning and history, data processing capabilities, query performance optimizations, and the technologies and infrastructure required for their deployment. Navigating Delta Lake: Key Aspects of Data Storage, Processing, and Access Delta Lake employs the open-source Parquet file format, a columnar storage format optimized for analytical workloads. It enhances the format by introducing an ACID transaction log, which maintains a record of all operations performed on the dataset. This transaction log, combined with the file storage structure, ensures reliability and consistency in the data. Data versioning and history are essential aspects of Delta Lake, enabling users to track changes and roll back to previous versions if necessary. The transaction log records every operation, thus providing a historical view of the data and allowing for time-travel queries. Delta Lake ensures efficient query performance by implementing various optimization techniques. One such technique is data compaction, which combines small files into larger ones to improve read performance. Furthermore, it employs a mechanism called Z-Ordering to optimize the organization of data on disk, which reduces the amount of data read during queries. For data access, Delta Lake provides a simple and unified API to read and query data from the tables. You can use time-travel queries to access historical versions of your data or perform complex analytical operations using the supported query engines. To store data in Delta Lake format, data must first be processed and saved in the appropriate file format. Here's an example code snippet for writing data to a Delta Lake table using Apache Spark with following reading: Python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Delta Lake Write and Read Example") \ .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \ .getOrCreate() # Read data from a source, e.g., a CSV file data = spark.read.format("csv").load("/path/to/csv-file") # Write data to a Delta Lake table data.write.format("delta") \ .mode("overwrite") \ .save("/path/to/delta-lake-table") # Read data from the Delta Lake table delta_data = spark.read.format("delta") \ .load("/path/to/delta-lake-table") # Perform some transformations and actions on the data result = delta_data.filter("some_condition").groupBy("some_column").count() result.show() In the code snippet above, we use the "delta" format to write and read data to and from a Delta Lake table. The Delta Lake library is included in the Spark session by adding the "io.delta:delta-core_2.12:2.3.0" package to the "spark.jars.packages" configuration. Delta Lake supports a wide range of query engines, including Apache Spark, Databricks Runtime, and Presto. It also provides APIs for programming languages such as Scala, Python, SQL, and Java, enabling seamless integration with existing data processing pipelines. Delta Lake integrates with various data platforms and tools, such as Apache Hive, Apache Flink, and Apache Kafka. In terms of deployment, it can be utilized in on-premises environments, as well as in cloud platforms like AWS, Azure, and GCP. For storage, Delta Lake can work with distributed file systems like HDFS or cloud-based storage services such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Data Management in Apache Hudi: Exploring Its Core Components Apache Hudi is another powerful Data Lakehouse solution that provides efficient data storage and querying capabilities. Like Delta Lake, it also uses Parquet as its underlying file format and adds a transaction log for ACID compliance. Hudi's storage management system enables upserts, incremental processing, and rollback support, allowing for efficient data ingestion and access. One of the key aspects of Apache Hudi is its built-in support for data partitioning, which helps optimize query performance by reducing the amount of data scanned during query execution. Hudi also provides a mechanism called "indexing" to enable fast record-level lookups, updates, and deletes. Hudi supports various query engines, including Apache Spark, Apache Hive, and Presto, and offers APIs for languages like Scala, Python, SQL, and Java. This flexibility ensures seamless integration with your existing data processing infrastructure. To write and read data using Apache Hudi, you can use the following code snippet: Python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Apache Hudi Write and Read Example") \ .config("spark.jars", "path/to/hudi-spark-bundle.jar") \ .getOrCreate() # Read data from a source, e.g., a CSV file data = spark.read.format("csv").load("/path/to/csv-file") # Write data to an Apache Hudi table data.write.format("org.apache.hudi") \ .options(get_hudi_options()) \ .mode("overwrite") \ .save("/path/to/hudi-table") # Read data from the Apache Hudi table hudi_data = spark.read.format("org.apache.hudi") \ .load("/path/to/hudi-table/*") # Perform some transformations and actions on the data result = hudi_data.filter("some_condition").groupBy("some_column").count() result.show() In the example above, the "org.apache.hudi" format is specified for writing data to an Apache Hudi table. The required Hudi library is added to the Spark session by specifying the "hudi-spark-bundle.jar" path in the "spark.jars" configuration. Apache Iceberg Basics: A Journey Through Data Management Fundamentals Apache Iceberg is a relatively new addition to the Data Lakehouse landscape. It is an open table format that provides strong consistency, snapshot isolation, and efficient query performance. Like Delta Lake and Apache Hudi, Iceberg also uses Parquet as its underlying file format and builds additional features on top of it. Iceberg's schema evolution mechanism is one of its most significant innovations. It allows users to evolve table schema without the need to rewrite existing data. This capability makes it possible to add, delete, or update columns in a table while preserving the existing data layout. Another key aspect of Iceberg is its scalable and efficient metadata management system. It uses a combination of manifest files and metadata tables to store information about table data, making it easier to manage large datasets. Iceberg optimizes query performance by employing techniques like predicate pushdown, which reduces the amount of data read during query execution. Iceberg supports a variety of query engines, including Apache Spark, Apache Flink, and Trino (formerly known as PrestoSQL). It also provides APIs for programming languages such as Scala, Python, SQL, and Java, ensuring seamless integration with your existing data processing infrastructure. To write and read data using Apache Iceberg, you can use the following code snippet: Python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Apache Iceberg Write and Read Demonstration") \ .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark3-runtime:0.13.2") \ .getOrCreate() # Load data from a source, such as a CSV file data = spark.read.format("csv").load("/path/to/csv-file") # Write data to an Apache Iceberg table data.write.format("iceberg") \ .mode("overwrite") \ .save("iceberg_catalog_namespace.table_name") # Load data from the Apache Iceberg table iceberg_data = spark.read.format("iceberg") \ .load("iceberg_catalog_namespace.table_name") # Apply transformations and actions to the data result = iceberg_data.filter("some_condition").groupBy("some_column").count() result.show() In the example above, the "iceberg" format is specified for writing data to an Apache Iceberg table. The Iceberg library is included in the Spark session by adding the "org.apache.iceberg:iceberg-spark3-runtime:0.13.2" package to the "spark.jars.packages" configuration. Iceberg can be deployed in on-premises environments or cloud platforms like AWS, Azure, and GCP. It supports various storage systems, including distributed file systems like HDFS and cloud-based storage services such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Weighing the Pros and Cons: Analyzing Delta Lake, Apache Hudi, and Apache Iceberg To help you make an informed decision about which Data Lakehouse solution is best for your organization, we have compared the features of Delta Lake, Apache Hudi, and Apache Iceberg using a set of factors. In the table below, each factor is evaluated as supported (+), unsupported (-), or partly supported (±). CRITERION DELTA LAKE APACHE HUDI APACHE ICEBERG ACID Transactions + + + Schema Evolution + + + Time Travel (Data Versioning) + + + Data Partitioning + + + Upserts and Deletes + + + Incremental Processing + + +- Data Deduplication +- + +- Metadata Scalability + +- + Compaction Management +- + + Merge on Read and Copy on Write Storage - + - Query Optimization Techniques + +- + Support for Multiple Query Engines + + + Integration with Other Data Platforms and Tools + + + Cloud-native Storage Compatibility + + + Ease of Deployment and Management + +- + This table provides a high-level comparison of the features supported by Delta Lake, Apache Hudi, and Apache Iceberg. It is important to note that each solution has its unique strengths and trade-offs, and the best choice for a specific use case depends on the organization's requirements, existing infrastructure, and familiarity with the technologies involved. Summing Up the Data Lakehouse Landscape: Key Insights and Analysis In conclusion, the Data Lakehouse concept has emerged as a promising solution to address the challenges of traditional data warehouses and data lakes, providing a unified platform for scalable, reliable, and efficient data management. As organizations strive to harness the power of their data, selecting the right Data Lakehouse solution becomes crucial for optimizing performance and adaptability. Throughout this comparison, we have examined the key aspects of three prominent Data Lakehouse solutions: Delta Lake, Apache Hudi, and Apache Iceberg. Each of these solutions has its unique strengths and trade-offs, catering to a variety of use cases and requirements. By assessing their data storage, processing, and access capabilities, as well as their integration with existing technologies and infrastructure, organizations can make informed decisions on which solution best aligns with their needs. While the comparison table highlights the high-level differences between Delta Lake, Apache Hudi, and Apache Iceberg, it is essential to consider the specific requirements and constraints of each organization. Factors such as ease of deployment, compatibility with current infrastructure, and familiarity with the underlying technologies can significantly impact the success of a Data Lakehouse implementation. In our next article, we will delve deeper into the technologies used for implementing Data Lakehouses, exploring the underlying mechanisms, tools, and best practices that can help organizations optimize their data management strategies. Ultimately, the choice between Delta Lake, Apache Hudi, and Apache Iceberg will depend on a careful evaluation of their respective features, trade-offs, and alignment with the organization's objectives. By thoroughly understanding the capabilities of each solution, organizations can ensure a future-proof Data Lakehouse infrastructure that facilitates data-driven decision-making and unlocks new insights to drive business growth.
Redpanda is an event streaming platform that is free and open source, similar to MariaDB and CockroachDB. It is compatible with Kafka APIs and is used by many as an alternative to Apache Kafka due to its performance and lightweight design. Kubernetes (K8s) is the defacto platform for cloud-native environments, so it’s not surprising that many developers choose it to manage their Redpanda clusters. But when things go wrong, it’s not as simple as “kill it, dump it, and rebuild” — much like with other data-intensive software, databases, messaging systems, and even Apache Kafka®. This is especially true when you’re streaming vast amounts of data with high throughput. This blog post uncovers all the nuts and bolts of running a Redpanda cluster in K8s. It covers install configurations such as resource allocation, network settings, and the location of the persistent store. And also what to do for day two operations. Finally, I've included a hands-on sandbox for you to play with. The Components Redpanda and Kubernetes use similar terms to describe their structure, which makes it all simpler to wrap your head around. Here’s a brief view of each component: The Architecture of Redpanda in Kubernetes I currently recommend using the Redpanda Helm chart to deploy a Redpanda cluster in Kubernetes. Helm chart makes it straightforward to deploy a multi-broker Redpanda cluster, and provides the Redpanda Console for easy administration. Regardless of the different ways to deploy (Helm or Operator), the fundamental components in Kubernetes won’t change. Redpanda broker: The single binary instance of Redpanda with built-in schema registry, HTTP proxy, and message broker capabilities. Redpanda cluster: One or more instances of Redpanda brokers, and aware of all members in the cluster. Provides scale, reliability, and coordination using the Raft consensus algorithm. K8s worker node: A physical or virtual machine that runs the containers and does any work assigned to them by the K8s control plane. K8s cluster: Group of K8s worker nodes and control plane nodes that orchestrate containers running on top with defined CPU, memory, network, and storage resources. Pod: A runtime deployment of the container that encapsulates Redpanda broker — ephemeral by nature, and shares storage and network resources in the same K8s cluster. Here’s a diagram of what your K8s cluster will look like after deployment: Let’s focus on the Redpanda broker pods for a minute. You can break them down into three logical sections: Redpanda broker: The actual engine of the streaming platform. It’s in charge of streaming events from producers and consumers, writing data into storage tiers, and replicating partitions. The networks: These are for communicating between clients outside of the K8s cluster and the broker. Persistent storage: For storing events and cache shadow index, if Tiered Storage is turned on. You can change the Redpanda configuration via customizing settings in Helm charts. I recommend defining configuration settings in your own values.yaml file when installing the chart. There’s a lot more to learn about these three sections, so let’s dig a little deeper into each one. Redpanda Brokers in StatefulSet Redpanda relies on StatefulSet to guarantee the uniqueness of a broker. Partitions are replicated and distributed evenly across brokers, so every broker will need to keep state (and remember who they were) even after the restart to pick up where it left off. It specified K8s rules for not scheduling the broker pods to run together on the same K8s node. This ensures: Redpanda brokers don’t compete for the same CPU cores and memory resources on a machine. Multiple brokers won’t go down at once if the underlying K8s worker node goes offline. For the best performance, allocate at least 80% of K8s’ node memory and CPU cores to each container. Leave the remaining for utilities and K8s processes. The way Redpanda handles cluster and topic-level configuration is no different from other ways of deployment. Settings are replicated across the Redpanda cluster and stored internally inside it. The best way to configure them is through the use of rpk — Redpanda’s command line interface (CLI) utility. The broker-specific configurations are mounted from configmap into local storage under the folder /etc/redpanda. If TLS is enabled, Redpanda takes advantage of the CertManager operator to generate the certificates, (which helps with automatic renewal). By default, we set Letsencrypt as the certificate issuer. You can overwrite it with your preferred certificate issuer. The issued cert is then stored in a secret, mounted to the broker pod, and then used by the broker for encryption. The Networks K8s distributes network traffic among multiple pods of the same service. But for solutions like Kafka or Redpanda, it just won’t work. The clients will have to reach the broker that hosts the leader of the partition directly. The headless service gives each pod running Redpanda broker a unique ID. Intra-broker communication, such as leader election and partition replication, and clients living inside the same K8s cluster will use the internal K8s address/IP for the pod to communicate. For external communication, Redpanda uses NodePort service by default. This will expose the listeners to a static port on the K8s node IP. It’s best to customize a domain name and set up DNS to mask the IPs. Alternatively, you can set up a LoadBalancer service for each broker, where the network traffic will be routed via the internal K8s controller. You also will have to configure the right advertised Kafka address for clients, so the client can locate the brokers correctly. For a deep dive into how it works, check out our blog on “What is advertised Kafka address?”. Pay special attention to the section about using Kafka address and advertised Kafka address in K8s. Persistent Storage Storage plays the most important role. It’s defined as Persistent Volumes (PV) claimed via a Persistent Volume Claim (PVC) for each pod running a Redpanda broker. The PVs can be predefined or dynamically provisioned, and they determine the size of your storage as well as the types of storage (SSD, NVMe SSD). Storage placement heavily impacts IOPS capacity and the location of your data. Here’s a good rule of thumb to follow when choosing between local or remote storage: the further away from the K8s cluster, the higher the latency. You can choose to use local, ephemeral storage on the K8s worker node, but there’s a risk of data loss in the case of node failure. If this happens, Redpanda will automatically attempt to replicate the data from other brokers to minimize that risk. Always consider enabling Tiered Storage to leverage its benefits, including reduced storage cost and improved recoverability. Tiered Storage can help avoid “disk full” problems by offloading old data into the cloud object store automatically, which in turn reduces the local storage requirements for the Redpanda cluster. One thing to keep in mind with Tiered Storage is the partition cache. The cache is a portion of the disk space dedicated to temporarily holding data from the object store. Make sure you have taken account of that while planning. In the case where you absolutely need to expand the size of your storage, please note that not all storage types can support volume expansion. Volume types that allow the users to resize are, for example, gcePersistentDisk, awsElasticBlockStore, and Azure Disk. Maintaining, Monitoring, and Optimizing Redpanda in Kubernetes Deploying your Redpanda cluster in K8s doesn’t mean it’s time to put your feet up. To keep the cluster running smoothly, you want to continuously maintain, monitor, and optimize the system — known as “day two operations”. Day two activities typically include: Scaling the cluster Upgrading to access new features and bug fixes Changing the replication or partition setting Modifying the segmentation size Setting back pressure Tweaking data retention policies One important “day two” operation is the rolling upgrade, which lets you safely move the software release forward while minimizing client impact. Let’s take a closer look at how to manage rolling upgrades in K8s. Rolling Upgrades With Kubernetes K8s adds a layer of complexity when doing upgrades. Here’s what you need to do: Turn off some of its self-healing features so it won’t interfere with the broker’s plans for partition relocations. Make sure you have the latest Helm chart from Redpanda, then delete the Statefulset with cascade set to orphan so it will keep the old brokers running. Use Helm to upgrade, where it will redeploy a new statefulset with the latest version of Redpanda configured (with OnDelete strategy so it doesn’t immediately restart the brokers). Once it’s set, carry on with how you would normally upgrade a broker: put one of them into maintenance mode, wait for it to drain, let the cluster rebalance, and delete the pod. The new statefulset will schedule a new pod running the latest version of the broker. With the PV being a separate entity, the data still remains and can be reused. This will allow Redpanda to recover faster and spring back to 100% capacity.
This is part 2 of a multi-part sample application that I've been building using Kafka and Apache Pinot. In the first part, we built an NFC Badge Reader to record when badges are swiped. In this part, we will begin ingesting some environmental data from a sensor into the mix. I'm not going to spoil the fun by telling you where this is ultimately headed, but feel free to reach out to me on Twitter and see if you can guess! As always, all of the code is available on GitHub. Hardware Hardware is part of this demo because I like building hardware, and it’s always more fun to have something that you can actually interact with. The hardware we’ll be building is a CO2 sensor that reads the CO2 concentration in the air, the temperature, and the relative humidity, and sends it to a Kafka topic. The sensor is based on the SCD-30 sensor, which is a non-dispersive infrared (NDIR) sensor that measures the CO2 concentration in the air. It’s a bit expensive, but it is also an industrial-quality sensor that can be relied on to give accurate readings. You will see a lot of projects based on cheaper “eCO2” sensors, but they are not as accurate as the SCD-30. I have done extensive testing of a variety of CO2 sensors and the eCO2 sensors can be off by as much as 200%, which is not something you want to rely on if accuracy matters. The sensor is connected to a SparkFun ESP32 Thing Plus - Qwiic, which reads the sensor data and sends it to a Kafka topic. I used these 2 components specifically because they both implement the Qwiic connector, which makes it easy to connect them together. The Qwiic connector is a standard 4-pin connector that uses I2C and is used by a number of different sensors and boards. It eliminates the need for soldering since you can just plug the sensor into the board and start coding. The ESP-32 Thing Plus is what’s referred to as a "Feather" because of the form factor. Really, you can use any ESP-32 board for this but the Feather is a great board for this project because it has built-in WiFi and Bluetooth, which makes it easy to connect to the Internet. It also has a built-in battery charger, which makes it easy to power the sensor with a battery if you want to place it somewhere out of reach. I programmed it using the Arduino IDE, which is a great tool for prototyping. C++ #include <ArduinoJson.h> // Serialize and deserialize JSON #include "EspMQTTClient.h" // handle MQTT #include <Wire.h> // 2Wire protocol #include "SparkFun_SCD30_Arduino_Library.h" // The CO2 Sensor EspMQTTClient client( "SSID", // Your SSID "PASSWD", // Your WiFi Password "mqtt-broker", // MQTT Broker server ip or hostname "PinotClient", // Client name that uniquely identify your device 8883 // The MQTT port, default to 1883. this line can be omitted ); SCD30 airSensor; // our sensor void setup() { Serial.begin(115200); // Optional functionalities of EspMQTTClient client.enableDebuggingMessages(); // Enable debugging messages sent to serial output Wire.begin(); if (airSensor.begin() == false) { Serial.println("Air sensor not detected. Please check wiring. Freezing..."); while (1) // infinite loop of nothingness ; } } // This function is called once everything is connected (Wifi and MQTT) // WARNING : YOU MUST IMPLEMENT IT IF YOU USE EspMQTTClient void onConnectionEstablished() { while (true) { // do this forever while (airSensor.dataAvailable()) { int co2 = airSensor.getCO2(); float temp = airSensor.getTemperature(); float hum = airSensor.getHumidity(); StaticJsonDocument<96> doc; doc["sensor"] = "SCD-30"; doc["co2"] = co2; doc["temp"] = temp; doc["humid"] = hum; char buff[128]; serializeJson(doc, buff); serializeJson(doc, Serial); // this is for debugging Serial.println(); client.publish("co2", buff); // publish the data to the co2 topic } } } void loop() { client.loop(); } That’s the entire Arduino sketch. It connects to WiFi, connects to an MQTT broker, and then reads the sensor data and publishes it to the co2 topic. You can find the code on GitHub. The SCD-30 sensor can really only provide about one reading/second, so there’s no need to do anything fancy to make it faster. Getting the Readings Into Kafka As it turns out there isn’t a good way to get data straight from an Arduino device into Kafka which is why we sent it to the MQTT broker above. Now that it’s in the MQTT broker we have to get it out and feed it into our Kafka topic. One of the other drawbacks of using an Arduino device is that they are not very good at keeping time. It is possible to use Network Time Protocol (NTP) to keep the time, but it’s not very reliable. To get around these two problems, I wrote a small program in Go that reads the data from the MQTT broker, gives it an accurate timestamp, and then publishes it to a Kafka topic. To make things easier, we will reuse some code from our Badge Reader project. Go package main import ( "bufio" "crypto/tls" "encoding/json" "fmt" "os" "strings" "time" "github.com/confluentinc/confluent-kafka-go/kafka" MQTT "github.com/eclipse/paho.mqtt.golang" ) /* Struct to hold the data from the MQTT message */ type env struct { Sensor string `json:"sensor"` CO2 int32 `json:"co2"` Temp float32 `json:"temp"` Humid float32 `json:"humid"` Time int32 `json:"time"` } That struct should look familiar as it’s almost identical to the one we used in the Arduino code. The only difference is the Time field, which we will use to store the timestamp. We will also re-use all the code from the Badge Reader project to connect to Kafka and publish the data to the topic, so I won’t reproduce it here. Go /* Read MQTT messages from the specified broker and topic and send them to the kafka broker */ func ReadMQTTMessages(brokerUri string, username string, password string, topic string, qos byte) error { // Create an MQTT client options object opts := MQTT.NewClientOptions() // Set the broker URI, username, and password opts.AddBroker(brokerUri) opts.SetUsername(username) opts.SetPassword(password) opts.SetTLSConfig(&tls.Config{InsecureSkipVerify: true}) // Create an MQTT client client := MQTT.NewClient(opts) if token := client.Connect(); token.Wait() && token.Error() != nil { return token.Error() } // Subscribe to the specified topic with the specified QoS level if token := client.Subscribe(topic, qos, func(client MQTT.Client, message MQTT.Message) { envData := env{} json.Unmarshal(message.Payload(), &envData) envData.Time = int32(time.Now().UnixMilli()) mess, _ := json.Marshal(envData) fmt.Printf("Received message: %s\n", mess) sendToKafka("co_2", string(mess)) }); token.Wait() && token.Error() != nil { return token.Error() } // Wait for messages to arrive for { select {} } } func main() { err := ReadMQTTMessages("tcp://broker-address:8883", "", "", "co2", 0) if err != nil { fmt.Printf("Error reading MQTT messages: %v\n", err) } } The main() function calls the ReadMQTTMessages() function which connects to the MQTT broker and subscribes to the co2 topic. When a message is received it is parsed into the env struct, the timestamp is added, and then it is published to the Kafka topic co_2. Consuming the Data Now that we have the data in Kafka, we can begin consuming it into StarTree Cloud. Since this project is building on the previous Badge Reader project, I’m going to reuse that StarTree Cloud instance and add a new table to it. Since I’ve already been sending data to the Kafka topic, once I add the source (with the right credentials) StarTree Cloud should show me some data that it has read from the topic. After we click through to the table we can see the data that has been read from the Kafka topic. From there, we can go to the Pinot Query Editor and start writing queries. Conclusion In this post, we looked at how to use an Arduino device to read data from a sensor and send it to an MQTT broker. We then used a Go program to read the data from the MQTT broker, add a timestamp, and send it to a Kafka topic. Finally, we used StarTree Cloud to read the data from the Kafka topic and make it available for querying in Apache Pinot.