Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
In this tutorial, we will explore the exciting world of MicroStream, a powerful open-source platform that enables ultrafast data processing and storage. Specifically, we will explore how MicroStream can leverage the new Jakarta Data and NoSQL specifications, which offer cutting-edge solutions for handling data in modern applications. With MicroStream, you can use these advanced features and supercharge your data processing capabilities while enjoying a simple and intuitive development experience. So whether you're a seasoned developer looking to expand your skill set or just starting in the field, this tutorial will provide you with a comprehensive guide to using MicroStream to explore the latest in data and NoSQL technology. MicroStream is a high-performance, in-memory, NoSQL database platform for ultrafast data processing and storage. One of the critical benefits of MicroStream is its ability to achieve lightning-fast data access times, thanks to its unique architecture that eliminates the need for disk-based storage and minimizes overhead. With MicroStream, you can easily store and retrieve large amounts of data in real-time, making it an ideal choice for applications that require rapid data processing and analysis, such as financial trading systems, gaming platforms, and real-time analytics engines. MicroStream provides a simple and intuitive programming model, making it easy to integrate into your existing applications and workflows. One of the main differences between MicroStream and other databases is its focus on in-memory storage. While traditional databases rely on disk-based storage, which can lead to slower performance due to disk access times, MicroStream keeps all data in memory, allowing for much faster access. Additionally, MicroStream's unique architecture allows it to achieve excellent compression rates, further reducing the memory footprint and making it possible to store even more data in a given amount of memory. Finally, MicroStream is designed with simplicity and ease of use. It provides a developer-friendly interface and minimal dependencies, making integrating into your existing development workflow easy. MicroStream Eliminates Mismatch Impedance Object-relational impedance mismatch refers to the challenge of mapping data between object-oriented programming languages and relational databases. Object-oriented programming languages like Java or Python represent data using objects and classes, whereas relational databases store data in tables, rows, and columns. This fundamental difference in data representation can lead to challenges in designing and implementing database systems that work well with object-oriented languages. One of the trade-offs of the object-relational impedance mismatch is that it can be challenging to maintain consistency between the object-oriented and relational databases. For example, suppose an object in an object-oriented system has attributes related to one another. In that case, mapping those relationships to the relational database schema may be challenging. Additionally, object-oriented systems often support inheritance, which can be tough to represent in a relational database schema. While various techniques and patterns can be used to address the object-relational impedance mismatch, such as object-relational mapping (ORM) tools or database design patterns, these solutions often come with their trade-offs. They may introduce additional complexity to the system. Ultimately, achieving a balance between object-oriented programming and relational database design requires careful consideration of the specific needs and constraints of the application at hand. MicroStream can help reduce the object-relational impedance mismatch by eliminating the need for a conversion layer between object-oriented programming languages and relational databases. Since MicroStream is an in-memory, NoSQL database platform that stores data as objects, it provides a natural fit for object-oriented programming languages, eliminating the need to map between object-oriented data structures and relational database tables. With MicroStream, developers can work directly with objects in their code without worrying about the complexities of mapping data to a relational database schema. It can result in increased productivity and improved performance, as there is no need for an additional conversion layer that can introduce overhead and complexity. Moreover, MicroStream's in-memory storage model ensures fast and efficient data access without expensive disk I/O operations. Data can be stored and retrieved quickly and efficiently, allowing for rapid processing and analysis of large amounts of data. Overall, by eliminating the object-relational impedance mismatch and providing a simple, efficient, and performant way to store and access data, MicroStream can help developers focus on building great applications rather than worrying about database architecture and design. MicroStream could guarantee a better performance by reducing the conversion to/from objects. The next step on your journey, let's create a MicroProfile application. MicroStream Faces Jakarta Specifications Now that I have explained about MicroStream, let's create our microservices application using Eclipse MicroProfile. The first step is going to the Eclipse MicroProfile Starter, where you can define configurations to your initial scope. Your application will be a simple library service using Open Liberty running with Java 17 and MicroStream. With the project downloaded, we must add the dependency integration between MicroProfile and MicroStream. This project dependency will change later internally to MicroStream, so this is a temporary house of this integration: XML <dependency> <groupId>expert.os.integration</groupId> <artifactId>microstream-jakarta-data</artifactId> <version>${microstream.data.version}</version> </dependency> The beauty of this integration is that it works with any vendors that work with MicroProfile 5 or higher. Currently, we're using Open Liberty. This integration enables both Jakarta persistence specifications: NoSQL and Data. Jakarta NoSQL and Jakarta Data are two related specifications developed by the Jakarta EE Working Group, aimed at providing standard APIs for working with NoSQL databases and managing data in Java-based applications. With the project defined, let's create a Book entity. The code below shows the annotation. We currently use the Jakarta NoSQL annotations. Java @Entity public class Book { @Id private String isbn; @Column("title") private String title; @Column("year") private int year; @JsonbCreator public Book(@JsonbProperty("isbn") String isbn, @JsonbProperty("title") String title, @JsonbProperty("year") int year) { this.isbn = isbn; this.title = title; this.year = year; } } The next step is the Jakarta Data part, where you can define a single interface with several capabilities with this database. Java @Repository public interface Library extends CrudRepository<Book, String> { } The last step is the resource, where we'll have service available. Java @Path("/library") @ApplicationScoped @Consumes(MediaType.APPLICATION_JSON) @Produces(MediaType.APPLICATION_JSON) public class LibraryResource { private final Library library; @Inject public LibraryResource(Library library) { this.library = library; } @GET public List<Book> allBooks() { return this.library.findAll().collect(Collectors.toUnmodifiableList()); } @GET @Path("{id}") public Book findById(@PathParam("id") String id) { return this.library.findById(id) .orElseThrow(() -> new WebApplicationException(Response.Status.NOT_FOUND)); } @PUT public Book save(Book book) { return this.library.save(book); } @Path("{id}") public void deleteBy(@PathParam("id") String id) { this.library.deleteById(id); } } Conclusion Jakarta NoSQL and Jakarta Data are critical specifications that provide a standard set of APIs and tools for managing data in Java-based applications. Jakarta NoSQL enables developers to interact with various NoSQL databases using a familiar interface, while Jakarta Data provides APIs for working with data in multiple formats. These specifications help reduce the complexity and costs of application development and maintenance, enabling developers to achieve greater interoperability and portability across different NoSQL databases and data formats. Furthermore, MicroStream provides a high-performance, in-memory NoSQL database platform that eliminates the need for a conversion layer between object-oriented programming languages and relational databases, reducing the object-relational impedance mismatch and increasing productivity and performance. By combining the power of MicroStream with the standard APIs provided by Jakarta NoSQL and Jakarta Data, developers can create robust and scalable applications that can easily handle large amounts of data. The MicroStream, Jakarta NoSQL, and Jakarta Data combination offer robust tools and specifications for managing data in modern Java-based applications. These technologies help streamline the development process and enable developers to focus on building great applications that meet the needs of their users. Find the source code on GitHub.
In the previous article, we discussed the emergence of Date Lakehouses as the next-generation data management solution designed to address the limitations of traditional data warehouses and Data Lakes. Data Lakehouses combines the strengths of both approaches, providing a unified platform for storing, processing and analyzing diverse data types. This innovative approach offers flexibility, scalability, and advanced analytics capabilities that are essential for businesses to remain competitive in today's data-driven landscape. In this article, we will delve deeper into the architecture and components of Data Lakehouses, exploring the interconnected technologies that power this groundbreaking solution. The Pillars of Data Lakehouse Architecture A Data Lakehouse is a comprehensive data management solution that combines the best aspects of data warehouses and Data Lakes, offering a unified platform for storing, processing, and analyzing diverse data types. The Data Lakehouse architecture is built upon a system of interconnected components that work together seamlessly to provide a robust and flexible data management solution. In this section, we discuss the fundamental components of the Data Lakehouse architecture and how they come together to create an effective and convenient solution for the end user. At the core of the Data Lakehouse lies unified data storage. This element is designed to handle various data types and formats, including structured, semi-structured, and unstructured data. The storage layer's flexibility is enabled through storage formats such as Apache Parquet, ORC, and Delta Lake, which are compatible with distributed computing frameworks and cloud-based object storage services. By unifying data storage, Data Lakehouses allow organizations to easily ingest and analyze diverse data sources without extensive data transformation or schema modifications. Another essential aspect of the Data Lakehouse architecture is data integration and transformation. Data Lakehouses excel at handling data ingestion and transformation from various sources by incorporating built-in connectors and support for a wide array of data integration tools, such as Apache Nifi, Kafka, or Flink. These technologies enable organizations to collect, transform, and enrich data from disparate sources, including streaming data, providing real-time insights and decision-making capabilities. By offering seamless data integration, Data Lakehouses help reduces the complexity and cost associated with traditional data integration processes. Metadata management is a critical component of a Data Lakehouse, facilitating data discovery, understanding, and governance. Data cataloging tools like Apache Hive, Apache Atlas, or AWS Glue allow organizations to create a centralized metadata repository about their data assets. A comprehensive view of data lineage, schema, relationships, and usage patterns provided by metadata management tools enhances data accessibility, ensures data quality, and enables better compliance with data governance policies. Data processing and analytics capabilities are also integral to the Data Lakehouse architecture. Unified query engines like Apache Spark, Presto, or Dremio provide a single interface for querying data using SQL or other query languages, integrating batch and real-time processing for both historical and live data. Moreover, Data Lakehouses often support advanced analytics and machine learning capabilities, making it easier for organizations to derive valuable insights from their data and build data-driven applications. Finally, data governance and security are crucial in any data-driven organization. Data Lakehouses address these concerns by providing robust data quality management features like data validation, data lineage tracking, and schema enforcement. Additionally, Data Lakehouses support role-based access control, which enables organizations to define granular access permissions to different data assets, ensuring that sensitive information remains secure and compliant with regulatory requirements. Optimizing Storage Formats for Data Lakehouses In a Data Lakehouse architecture, the storage layer is crucial for delivering high performance, efficiency, and scalability while handling diverse data types. This section will focus on the storage formats and technologies used in Data Lakehouses and their significance in optimizing storage for better performance and cost-effectiveness. Columnar storage formats such as Apache Parquet and ORC are key components of Data Lakehouses. By storing data column-wise, these formats offer improved query performance, enhanced data compression, and support for complex data types. This enables Data Lakehouses to handle diverse data types efficiently without requiring extensive data transformation. Several storage solutions have been developed to cater to the unique requirements of Data Lakehouses. Delta Lake, Apache Hudi, and Apache Iceberg are three notable examples. Each of these technologies has its own set of advantages and use cases, making them essential components of modern Data Lakehouse architectures. Delta Lake is a storage layer project explicitly designed for Data Lakehouses. Built on top of Apache Spark, it integrates seamlessly with columnar storage formats like Parquet. Delta Lake provides ACID transaction support, schema enforcement and evolution, and time travel features, which enhance reliability and consistency in data storage. Apache Hudi is another storage solution that brings real-time data processing capabilities to Data Lakehouses. Hudi offers features such as incremental data processing, upsert support, and point-in-time querying, which help organizations manage large-scale datasets and handle real-time data efficiently. Apache Iceberg is a table format for large, slow-moving datasets in Data Lakehouses. Iceberg focuses on providing better performance, atomic commits, and schema evolution capabilities. It achieves this through a novel table layout that uses metadata more effectively, allowing for faster queries and improved data management. The intricacies of Delta Lake, Apache Hudi, and Apache Iceberg, as well as their unique advantages, are fascinating topics on their own. In one of our upcoming articles, we will delve deeper into these technologies, providing a comprehensive understanding of their role in Data Lakehouse architecture. Optimizing storage formats for Data Lakehouses involves leveraging columnar formats and adopting storage solutions like Delta Lake, Apache Hudi, and Apache Iceberg. These technologies work together to create an efficient and high-performance storage layer that can handle diverse data types and accommodate the growing data needs of modern organizations. Embracing Scalable and Distributed Processing in Data Lakehouses Data Lakehouse architecture is designed to address modern organizations' growing data processing needs. By leveraging distributed processing frameworks and techniques, Data Lakehouses can ensure optimal performance, scalability, and cost-effectiveness. Apache Spark, a powerful open-source distributed computing framework, is a foundational technology in Data Lakehouses. Spark efficiently processes large volumes of data and offers built-in support for advanced analytics and machine learning workloads. By supporting various programming languages, Spark serves as a versatile choice for organizations implementing distributed processing. Distributed processing frameworks like Spark enable parallel execution of tasks, which is essential for handling massive datasets and complex analytics workloads. Data partitioning strategies divide data into logical partitions, optimizing query performance and reducing the amount of data read during processing. Resource management and scheduling are crucial for distributed processing in Data Lakehouses. Tools like Apache Mesos, Kubernetes, and Hadoop YARN orchestrate and manage resources across a distributed processing environment, ensuring tasks are executed efficiently, and resources are allocated optimally. In-memory processing techniques significantly improve the performance of analytics and machine learning tasks by caching data in memory instead of reading it from disk. This reduces latency and results in faster query execution and better overall performance. Data Lakehouses embrace scalable and distributed processing technologies like Apache Spark, partitioning strategies, resource management tools, and in-memory processing techniques. These components work together to ensure Data Lakehouses can handle the ever-growing data processing demands of modern organizations. Harnessing Advanced Analytics and Machine Learning in Data Lakehouses Data Lakehouse architectures facilitate advanced analytics and machine learning capabilities, enabling organizations to derive deeper insights and drive data-driven decision-making. This section discusses the various components and techniques employed by Data Lakehouses to support these essential capabilities. First, the seamless integration of diverse data types in Data Lakehouses allows analysts and data scientists to perform complex analytics on a wide range of structured and unstructured data. This integration empowers organizations to uncover hidden patterns and trends that would otherwise be difficult to discern using traditional data management systems. Second, the use of distributed processing frameworks such as Apache Spark, which is equipped with built-in libraries for machine learning and graph processing, enables Data Lakehouses to support advanced analytics workloads. By leveraging these powerful tools, Data Lakehouses allows data scientists and analysts to build and deploy machine learning models and perform sophisticated analyses on large datasets. Additionally, Data Lakehouses can be integrated with various specialized analytics tools and platforms. For example, integrating Jupyter Notebooks and other interactive environments provides a convenient way for data scientists and analysts to explore data, develop models, and share their findings with other stakeholders. To further enhance the capabilities of Data Lakehouses, machine learning platforms like TensorFlow, PyTorch, and H2O.ai can be integrated to support the development and deployment of custom machine learning models. These platforms provide advanced functionality and flexibility, enabling organizations to tailor their analytics and machine-learning efforts to their specific needs. Lastly, real-time analytics and stream processing play an important role in Data Lakehouses. Technologies like Apache Kafka and Apache Flink enable organizations to ingest and process real-time data streams, allowing them to respond more quickly to market changes, customer needs, and other emerging trends. Ensuring Robust Data Governance and Security in Data Lakehouses Data Lakehouses prioritize data governance and security, addressing the concerns of organizations regarding data privacy, regulatory compliance, and data quality. This section delves into the various components and techniques that facilitate robust data governance and security in Data Lakehouses. Data cataloging and metadata management tools play a crucial role in establishing effective data governance within a Data Lakehouse. Tools such as Apache Atlas, AWS Glue, and Apache Hive provide centralized repositories for metadata, enabling organizations to track data lineage, discover data assets, and enforce data governance policies. Fine-grained access control is essential for maintaining data privacy and security in Data Lakehouses. Role-based access control (RBAC) and attribute-based access control (ABAC) mechanisms allow organizations to define and enforce user access permissions, ensuring that sensitive data remains secure and available only to authorized users. Data encryption is another key component of Data Lakehouse security. By encrypting data both at rest and in transit, Data Lakehouses ensure that sensitive information remains protected against unauthorized access and potential breaches. Integration with key management systems like AWS Key Management Service (KMS) or Azure Key Vault further enhances security by providing centralized management of encryption keys. Data Lakehouses also incorporate data quality and validation mechanisms to maintain the integrity and reliability of the data. Data validation tools like Great Expectations, data profiling techniques, and automated data quality checks help identify and address data inconsistencies, inaccuracies, and other issues that may impact the overall trustworthiness of the data. Auditing and monitoring are essential for ensuring compliance with data protection regulations and maintaining visibility into Data Lakehouse operations. Data Lakehouses can be integrated with logging and monitoring solutions like Elasticsearch, Logstash, Kibana (ELK Stack), or AWS CloudTrail, providing organizations with a comprehensive view of their data management activities and facilitating effective incident response. By prioritizing data privacy, regulatory compliance, and data quality, Data Lakehouses enables organizations to confidently manage their data assets and drive data-driven decision-making in a secure and compliant manner. Embracing the Data Lakehouse Revolution The Data Lakehouse architecture is a game-changing approach to data management, offering organizations the scalability, flexibility, and advanced analytics capabilities necessary to thrive in the era of big data. By combining the strengths of traditional data warehouses and Data Lakes, Data Lakehouses empowers businesses to harness the full potential of their data, driving innovation and informed decision-making. In this article, we have explored the key components and technologies that underpin the Data Lakehouse architecture, from data ingestion and storage to processing, analytics, and data governance. By understanding the various elements of a Data Lakehouse and how they work together, organizations can better appreciate the value that this innovative approach brings to their data management and analytics initiatives. As we continue our series on Data Lakehouses, we will delve deeper into various aspects of this revolutionary data management solution. In upcoming articles, we will cover topics such as the comparison of Delta Lake, Apache Hudi, and Apache Iceberg – three storage solutions that are integral to Data Lakehouse implementations – as well as best practices for Data Lakehouse design, implementation, and operation. Additionally, we will discuss the technologies and tools that underpin Data Lakehouse architecture, examine real-world use cases that showcase the transformative power of Data Lakehouses, and explore the intricacies and potential of this groundbreaking approach. Stay tuned for more insights and discoveries as we navigate the exciting journey of Data Lakehouse architectures together!
Monitoring data stream applications is a critical component of enterprise operations, as it allows organizations to ensure that their applications are functioning optimally and delivering value to their customers. In this article, we will discuss in detail the importance of monitoring data stream applications and why it is critical for enterprises. Data stream applications are those that handle large volumes of data in real-time, such as those used in financial trading, social media analytics, or IoT (Internet of Things) devices. These applications are critical to the success of many businesses, as they allow organizations to make quick decisions based on real-time data. However, these applications can be complex, and any issues or downtime can have significant consequences. By monitoring data stream applications, enterprises can proactively identify and address issues before they impact the business. This includes identifying performance issues, detecting errors and anomalies, and ensuring that the application is meeting its service level agreements (SLAs). Monitoring also allows organizations to track key metrics, such as data throughput, latency, and error rates, and to make adjustments to optimize the application's performance. Reference data steam system: Unlocking the Potential of IoT Applications. In addition to these benefits, monitoring data stream applications is critical for ensuring regulatory compliance. Many industries, such as finance and healthcare, have strict regulations governing data privacy and security. By monitoring these applications, organizations can ensure that they are meeting these regulatory requirements and avoid costly fines and legal penalties. Another key benefit of monitoring data stream applications is that it allows organizations to optimize their infrastructure and resource usage. By monitoring resource utilization, enterprises can identify areas of inefficiency, such as overprovisioned resources or bottlenecks, and make adjustments to improve performance and reduce costs. Prometheus: Prometheus is an open-source monitoring system that is designed for collecting and querying time-series data. It can be used to monitor metrics from a variety of sources, including data stream applications. Prometheus provides a range of tools for data visualization and alerting and integrates with a variety of popular tools and platforms. Splunk: Splunk is a popular data analytics and monitoring platform that can be used to monitor data stream applications. It provides real-time monitoring and alerting and can be used to track metrics such as data volume, latency, and error rates. Splunk also includes a range of machine learning and data analysis tools that can be used to identify anomalies and optimize performance. Amazon CloudWatch: Amazon CloudWatch is a monitoring and management service offered by Amazon Web Services (AWS). It can be used to monitor a variety of AWS resources, including data stream applications running on AWS. CloudWatch provides a range of metrics, logs, and alerts and can be integrated with other AWS tools, such as AWS Lambda. if your data streams running from AWS CloudWatch is the best option. DataDog: DataDog is a cloud-based monitoring and analytics platform that can be used to monitor data stream applications. It provides real-time monitoring and alerting and can be used to track a wide range of metrics, including data volume, latency, and error rates. DataDog also includes a range of visualization and collaboration tools that can be used to improve communication and collaboration across teams. Finally, monitoring data stream applications is critical for maintaining customer satisfaction. In today's fast-paced, digital world, customers expect instant responses and seamless experiences. Any issues or downtime can have a significant impact on customer satisfaction and brand reputation. By proactively monitoring these applications, organizations can ensure that their customers are receiving the expected level of service and address any issues quickly and efficiently. In conclusion, monitoring data stream applications is critical for enterprise success. It allows organizations to proactively identify and address issues, ensure regulatory compliance, optimize resource utilization, and maintain customer satisfaction. By investing in monitoring tools and processes, enterprises can ensure that their applications are delivering value to their customers and stay ahead of the competition in today's fast-paced digital landscape.
Today I have one of those moments where I am absolutely sure if I do not write this down, I will forget how to do this next time. For one of the projects I am working on, we need to do SHACL validation of RDF data that will be stored in Ontotext GraphDB. Here are the 10 things I needed to learn in doing this. Some of these are rather obvious, but some were less than obvious to me. Number 1: To be able to do SHACL validation, your repository needs to be configured for SHACL when you create your repository. This cannot be done after the fact. Number 2: It seems to be better to import your ontology (or ontologies) and data into different graphs. This is useful when you want to re-import your ontology (or ontologies) or your data, because then you can replace a specific named graph completely. This was very useful for me while prototyping. Screenshot below: Number 3: SHACL shapes are imported into this named graph: http://rdf4j.org/schema/rdf4j#SHACLShapeGraph ...by default. At configuration time, you can provide a different named graph or graphs for your SHACL shapes. Number 4: To find the named graphs in your repository, you can do the following SPARQL query: select distinct ?g where { graph ?g {?s ?p ?o } } You can then query a specific named graph as follows: select * from <myNamedGraph> where { ?s ?p ?o . } Number 5: However, getting the named graphs does not return the SHACL named graph. On StackOverflow someone suggested SHACL shapes can be retrieved using: http://address:7200/repositories/myRepo/rdf-graphs/service?graph=http://rdf4j.org/schema/rdf4j#SHACLShapeGraph However, this did not work for me. Instead, the following code worked reliably: import org.eclipse.rdf4j.model.Model; import org.eclipse.rdf4j.model.impl.LinkedHashModel; import org.eclipse.rdf4j.model.vocabulary.RDF4J; import org.eclipse.rdf4j.repository.RepositoryConnection; import org.eclipse.rdf4j.repository.http.HTTPRepository; import org.eclipse.rdf4j.rio.RDFFormat; import org.eclipse.rdf4j.rio.Rio; import org.eclipse.rdf4j.rio.WriterConfig; import org.eclipse.rdf4j.rio.helpers.BasicWriterSettings; import java.util.stream.Collectors; public class RetrieveShaclShapes { public static void main(String[] args) { String address = args[0]; /* i.e. http://localhost/ */ String repositoryName = args[1]; /* i.e. myRepo */ HTTPRepository repository = new HTTPRepository(address, repositoryName); try (RepositoryConnection connection = repository.getConnection()) { Model statementsCollector = new LinkedHashModel( connection.getStatements(null, null,null, RDF4J.SHACL_SHAPE_GRAPH) .stream() .collect(Collectors.toList())); Rio.write(statementsCollector, System.out, RDFFormat.TURTLE, new WriterConfig().set( BasicWriterSettings.INLINE_BLANK_NODES, true)); } catch (Throwable t) { t.printStackTrace(); } } } ...using the following dependencies in the pom.xml: <dependency> <groupId>org.eclipse.rdf4j</groupId> <artifactId>rdf4j-client</artifactId> <version>4.2.3</version> <type>pom</type> </dependency> Number 6: Getting the above code to run was not obvious since I opted to using a fat jar. I encountered an "org.eclipse.rdf4j.rio.UnsupportedRDFormatException: Did not recognise RDF format object" error. RFD4J uses the Java Service Provider Interface (SPI) which uses a file in the META-INF/services of the jar to register parser implementations. The maven-assembly-plugin I used, to generate the fat jar, causes different jars to overwrite META-INF/services thereby loosing registration information. The solution is to use the maven-shade-plugin which merge META-INF/services rather overwrite it. In your pom you need to add the following to your plugins configuration: <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.4.1</version> <executions> <execution> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> </transformers> </configuration> </execution> </executions> </plugin> You can avoid this problem by using the separate jars rather than a single fat jar. Number 7: Importing a new shape into the SHACL shape graph will cause new shape information to be appended. It will not replace the existing graph even when you have both the "Enable replacement of existing data" and "I understand that data in the replaced graphs will be cleared before importing new data." options enabled, as seen in the next screenshot: To replace the SHACL named graph, you need to clear it explicitly by running the following SPARQL command: clear graph <http://rdf4j.org/schema/rdf4j#SHACLShapeGraph> For myself, I found it easier to update the SHACL shapes programmatically. Note that I made use of the default SHACL named graph: import org.eclipse.rdf4j.model.vocabulary.RDF4J; import org.eclipse.rdf4j.repository.RepositoryConnection; import org.eclipse.rdf4j.repository.http.HTTPRepository; import org.eclipse.rdf4j.rio.RDFFormat; import java.io.File; public class UpdateShacl { public static void main(String[] args) { String address = args[0]; /* i.e. http://localhost/ */ String repositoryName = args[1]; /* i.e. myRepo */ String shacl = args[2]; File shaclFile = new File(shacl); HTTPRepository repository = new HTTPRepository(address, repositoryName); try (RepositoryConnection connection = repository.getConnection()) { connection.begin(); connection.clear(RDF4J.SHACL_SHAPE_GRAPH); connection.add(shaclFile, RDFFormat.TURTLE, RDF4J.SHACL_SHAPE_GRAPH); connection.commit(); } catch (Throwable t) { t.printStackTrace(); } } } Number 8: Programmatically you can delete a named graph using this code and the same Maven dependency as we used above: import org.eclipse.rdf4j.model.IRI; import org.eclipse.rdf4j.model.ValueFactory; import org.eclipse.rdf4j.model.impl.SimpleValueFactory; import org.eclipse.rdf4j.repository.RepositoryConnection; import org.eclipse.rdf4j.repository.http.HTTPRepository; public class ClearGraph { public static void main(String[] args) { String address = args[0]; /* i.e. http://localhost/ */ String repositoryName = args[1]; /* i.e. myRepo */ String graph = args[2]; /* i.e. http://rdf4j.org/schema/rdf4j#SHACLShapeGraph */ ValueFactory valueFactory = SimpleValueFactory.getInstance(); IRI graphIRI = valueFactory.createIRI(graph); HTTPRepository repository = new HTTPRepository(address, repositoryName); try (RepositoryConnection connection = repository.getConnection()) { connection.begin(); connection.clear(graphIRI); connection.commit(); } } } Number 9: If you update the shape graph with constraints that are violated by your existing data, you will need to first fix your data before you can upload your new shape definition. Number 10: When uploading SHACL shapes, unsupported features fails silently. I had this idea to add human readable information to the shape definition to make it easier for users to understand validation errors. Unfortunately "sh:name" and "sh:description" are not supported by GraphDB version 10.0.2. and 10.2.0. Moreover, it fails silently. In the Workbench, it will show that it loaded successfully as seen in the next screenshot: However, in the logs I have noticed the following warnings: As these are logged as warnings, I was expecting my shape to have loaded fine, except that triples pertaining to "sh:name" and "sh:description" are skipped. However, my shape did not load at all. You find the list of supported SHACL features here. Conclusion This post may come across as being critical of GraphDB. However, this is not the intention. I think it is rather a case of growing pains that are still experienced around SHACL (and Shex, I suspect) adoption. Resources that have been helpful for me in resolving issues are: GraphDB documentation, and RDF4J on which GraphDB is built.
First, the JPA Criteria backend is shown. How to select the results of a request with logically nested conditions like a request with these parameters: “(quarters=’Q1′ or quarters=’Q2′) and (concept like ‘%Income%’ or (a concept like ‘%revenue%’ and value > 1000000″))” then the result is displayed in an Angular frontend with an Angular Component Tree that displays a Table component in each leaf, and how to make such a Tree perform well. Flexible Queries in the Backend The Financial Data Controller receives the post request. It is mapped into a SymbolFinancialQueryParamDto and sent to the FinancialDataService, which ensures a transaction wrapper. The SymbolFinancialsRepositoryBean creates the query in the ‘findSymbolFinancials(…)’ method: Java @Override public List<SymbolFinancials> findSymbolFinancials(SymbolFinancialsQueryParamsDto symbolFinancialsQueryParams) { List<SymbolFinancials> result = List.of(); ... return the result if only financial elements are queried ... final CriteriaQuery<SymbolFinancials> createQuery = this.entityManager.getCriteriaBuilder() .createQuery(SymbolFinancials.class); final Root<SymbolFinancials> root = createQuery.from(SymbolFinancials.class); final List<Predicate> predicates = createSymbolFinancialsPredicates( symbolFinancialsQueryParams, root); root.fetch(FINANCIAL_ELEMENTS); Path<FinancialElement> fePath = root.get(FINANCIAL_ELEMENTS); this.createFinancialElementClauses(symbolFinancialsQueryParams.getFinancialElementParams(), fePath, predicates); if (!predicates.isEmpty()) { createQuery.where(predicates.toArray(new Predicate[0])).distinct(true) .orderBy(this.entityManager.getCriteriaBuilder().asc(root.get(SYMBOL))); } else { return new LinkedList<>(); } LocalTime start1 = LocalTime.now(); result = this.entityManager.createQuery(createQuery) .getResultStream().map(mySymbolFinancials -> removeDublicates(mySymbolFinancials)) .limit(100).collect(Collectors.toList()); LOGGER.info("Query1: {} ms", Duration.between(start1, LocalTime.now()).toMillis()); return result; } private List<Predicate> createSymbolFinancialsPredicates( SymbolFinancialsQueryParamsDto symbolFinancialsQueryParams, final Root<SymbolFinancials> root) { final List<Predicate> predicates = new ArrayList<>(); if (symbolFinancialsQueryParams.getSymbol() != null && !symbolFinancialsQueryParams.getSymbol().isBlank()) { predicates.add(this.entityManager.getCriteriaBuilder().equal( this.entityManager.getCriteriaBuilder().lower(root.get(SYMBOL)), symbolFinancialsQueryParams.getSymbol().trim().toLowerCase())); } if (symbolFinancialsQueryParams.getQuarters() != null && !symbolFinancialsQueryParams.getQuarters() .isEmpty()) { predicates.add(this.entityManager.getCriteriaBuilder().in(root.get(QUARTER)) .value(symbolFinancialsQueryParams.getQuarters())); } if (symbolFinancialsQueryParams.getYearFilter() != null && symbolFinancialsQueryParams.getYearFilter().getValue() != null && 0 >= BigDecimal.valueOf(1800).compareTo( symbolFinancialsQueryParams.getYearFilter().getValue()) && symbolFinancialsQueryParams.getYearFilter().getOperation() != null) { switch (symbolFinancialsQueryParams.getYearFilter().getOperation()) { case SmallerEqual -> predicates.add(this.entityManager.getCriteriaBuilder() .lessThanOrEqualTo(root.get(FISCAL_YEAR), symbolFinancialsQueryParams.getYearFilter().getValue())); case LargerEqual -> predicates.add(this.entityManager.getCriteriaBuilder() .greaterThanOrEqualTo(root.get(FISCAL_YEAR), symbolFinancialsQueryParams.getYearFilter().getValue())); case Equal -> predicates.add(this.entityManager.getCriteriaBuilder() .equal(root.get(FISCAL_YEAR), symbolFinancialsQueryParams.getYearFilter().getValue())); } } return predicates; } The method gets the ‘SymbolFinancialQueryParamDto’ with the query parameters. It creates a CriteriaQuery, Root object, and Metamodel for the SymbolFinancials entity. Then the method ‘createSymbolFinancialsPredicates(…)’ is used to create the query predicates for the SymbolFinancials entity. First, the symbol parameter is checked, and a Predicate is created that adds a where clause for an equal lowercase match. Then the Quarter parameters are checked, and a Predicate is created that adds a where clause with the ‘in’ operator to match one of the quarters. After the check of the ‘YearFilter’ with a check of operator and value (including a range check), the predicate is created. The supported operator values are implemented in the switch for the Operator Enum. For matching an Enum value, the predicate with an equals match is created and added. Then the created predicates are returned to the ‘findSymbolFinancials(…)’ method. Then the FinancialElement entities of the SymbolFinancials entity are joined and fetched, and a path for the FinancialElement entities is created. The method called ‘createFinancialElementClauses(…)’ creates the predicates for the nested FinancialElement entities and is described later. Then it is checked if the predicates are empty, and a where clause is created with the predicates from the ‘predicates’ list. A ‘distinct’ and an ‘order by’ of the symbol property are added. Then the query is executed and timed with a limit on the result stream to protect the database resources. The stream is filtered for duplicate FinancialElements entries. Some look like duplicates because they are added to several different ‘FinancialElementType’ entities, while others are duplicates of the imported data file. Nested Conditions in the Query To support nested conditions for the FinancialElement entities in the backend, the method ‘createFiancialElementClauses(…)’ is used: Java private <T> void createFinancialElementClauses(List<FinancialElementParamDto> financialElementParamDtos, final Path<FinancialElement> fePath, final List<Predicate> predicates) { record SubTerm(DataHelper.Operation operation, Collection<Predicate> subTerms) {} final LinkedBlockingQueue<SubTerm> subTermCollection = new LinkedBlockingQueue<>(); final Collection<Predicate> result = new LinkedList<>(); if (financialElementParamDtos != null) { financialElementParamDtos.forEach(myDto -> { switch (myDto.getTermType()) { case TermStart -> { try { subTermCollection.put(new SubTerm(myDto.getOperation(), new ArrayList<>())); } catch (InterruptedException e) { new RuntimeException(e); } } case Query -> { Collection<Predicate> localResult = subTermCollection.isEmpty() ? result : subTermCollection.peek().subTerms(); Optional<Predicate> conceptClauseOpt = financialElementConceptClause(fePath, myDto); Optional<Predicate> valueClauseOpt = financialElementValueClause(fePath, myDto); List<Predicate> myPredicates = List.of(conceptClauseOpt, valueClauseOpt).stream() .filter(Optional::isPresent).map(Optional::get).toList(); if (myPredicates.size() > 1) { localResult.add( this.entityManager.getCriteriaBuilder().and(myPredicates.toArray(new Predicate[0]))); } else { localResult.addAll(myPredicates); } } case TermEnd -> { if (subTermCollection.isEmpty()) { throw new RuntimeException(String.format("subPredicates: %d", subTermCollection.size())); } SubTerm subTermColl = subTermCollection.poll(); Collection<Predicate> myPredicates = subTermColl.subTerms(); Collection<Predicate> baseTermCollection = subTermCollection.peek() == null ? result : subTermCollection.peek().subTerms(); DataHelper.Operation operation = subTermColl.operation(); Collection<Predicate> resultPredicates = operation == null ? myPredicates : switch (operation) { case And -> List.of(this.entityManager.getCriteriaBuilder() .and(myPredicates.toArray(new Predicate[0]))); case AndNot -> List.of(this.entityManager.getCriteriaBuilder() .not(this.entityManager.getCriteriaBuilder().and(myPredicates.toArray(new Predicate[0])))); case Or -> List.of(this.entityManager.getCriteriaBuilder() .or(myPredicates.toArray(new Predicate[0]))); case OrNot -> List.of(this.entityManager.getCriteriaBuilder() .not(this.entityManager.getCriteriaBuilder().or(myPredicates.toArray(new Predicate[0])))); }; baseTermCollection.addAll(resultPredicates); } } }); } // validate terms if (!subTermCollection.isEmpty()) { throw new RuntimeException(String.format("subPredicates: %d", subTermCollection.size())); } predicates.addAll(result); } First, the SubTerm Record is declared, and the Fifo Queue that provides a stack for the nested terms is created. Then the ‘financialElementParamDtos’ are checked and iterated. A switch on the TermType enum handles the ‘TermStart,’ ‘Query,’ and ‘TermEnd.’ The TermStart case inserts a new SubTerm record into the Fifo Queue with the term operation (‘and,’ ‘or,’…). The InterruptedException is not relevant in this use case. The ‘Query’ case checks if entries are in the ‘subTermCollection’ and returns the current one or the result list. The methods of the SymbolFinancialsRepositoryBean create the optional predicates for the concept and value where clauses/predicates appear. Then a list with the values of the Optionals is created. If the list size is greater than 1, an ‘and’ predicate is created, the values are added to the predicate, and the predicate is returned. Otherwise, the single-element list is added to the result list. The ‘TermEnd’ case checks if the ‘subTermCollection’ is empty to verify the term structure. Then the SubTerm is polled from the fifo queue, and the predicates are read. Then the base term is set from the subterm collection or the result list, and the operation is read from the subterm collection record. Then the switch handles the operation enum entry of the ‘TermStart’ SubTerm Record. The cases created matching logical predicates to which the SubTerm predicates/clauses were added. The predicate is then added to the resultPredicates. In the end, the subTermQueue is checked again for the term structure, and the result predicates are added to the predicate list. Conclusion Backend Code that has to support flexible queries will not look trivial. The support of the JPA Criteria API is good, but studying the documentation in-depth first would have saved some time. The predicates are wrapped differently (wrap them in ‘and,’ ‘or,’… predicates) compared to a Jql query. The Criteria API works well and is very flexible once it is understood. Angular Frontend The frontend shows a tree of symbols, then years, and then a table of concepts with values and more: The template for the ‘result-tree’ component looks like this: TypeScript <mat-tree [dataSource]="dataSource" [treeControl]="treeControl" class="example-tree" > <!-- This is the tree node template for leaf nodes --> <!-- There is inline padding applied to this node using styles. This padding value depends on the mat-icon-button width. --> <mat-tree-node *matTreeNodeDef="let node" matTreeNodeToggle> <div *ngIf="!node?.finanicalElementExts">{{ node.name }</div> <mat-table *ngIf="node.isOpen" [dataSource]="node.finanicalElementExts" class="mytable"> <ng-container matColumnDef="concept"> <mat-header-cell *matHeaderCellDef i18n="@@queryResultsConcept" >Concept</mat-header-cell> <mat-cell *matCellDef="let element" matTooltip="{{ element.concept }"> {{ element.concept } </mat-cell> </ng-container> <ng-container matColumnDef="quarter"> <mat-header-cell *matHeaderCellDef i18n="@@queryResultsQuarter" >Quarter</mat-header-cell> <mat-cell *matCellDef="let element"> {{ element.quarter } </mat-cell> </ng-container> <ng-container matColumnDef="currency"> <mat-header-cell *matHeaderCellDef i18n="@@queryResultsCurrency" >Currency</mat-header-cell> <mat-cell *matCellDef="let element"> {{ element.currency } </mat-cell> </ng-container> <ng-container matColumnDef="value"> <mat-header-cell *matHeaderCellDef i18n="@@queryResultsValue" >Value</mat-header-cell> <mat-cell *matCellDef="let element"> {{ element.value } </mat-cell> </ng-container> <mat-header-row *matHeaderRowDef="displayedColumns"></mat-header-row> <mat-row *matRowDef="let row; columns: displayedColumns"></mat-row> </mat-table> </mat-tree-node> <!-- This is the tree node template for expandable nodes --> <mat-nested-tree-node *matTreeNodeDef="let node; when: hasChild"> <div class="mat-tree-node"> <button mat-icon-button (click)="toggleNode(node)" [attr.aria-label]="'Toggle ' + node.name"> <mat-icon class="mat-icon-rtl-mirror"> {{ treeControl.isExpanded(node) ? "expand_more" : "chevron_right" } </mat-icon> </button> {{ node.name } </div> <!-- There is inline padding applied to this div using styles. This padding value depends on the mat-icon-button width. --> <div [class.example-tree-invisible]="!treeControl.isExpanded(node)" role="group"> <ng-container matTreeNodeOutlet></ng-container> </div> </mat-nested-tree-node> </mat-tree> First, construct a “mat-tree” tree of angular material components using the “dataSource” and “treeControl.” The ‘mat-tree-node’ is used as a tree leaf and shows the Angular Material Components table if the flag ‘node.isOpen’ is set with the node values in the ‘datasource.’ That means the table is only created if the year is opened, which enables the good performance of the tree component in the browser. The columns are created with the ‘mat-header-cell’ for the header and the ‘mat-cell’ for the content. The ‘mat-row’ and ‘mat-header-row’ tags define the displayed columns of the table. The ‘mat-nested-tree-node’ is used to open/display the tree branches like the symbols and years and is shown if the node is not a tree leaf. The button displays the chevrons and triggers the open and close of the children with the method ‘toggleNode(…)’. The child nodes are displayed in the ‘ng-container matTreeNodeOutlet.’ The visibility is controlled with the ‘example-tree-invisible’ CSS class. The result-tree component looks like this: TypeScript @Component({ selector: "app-result-tree", templateUrl: "./result-tree.component.html", styleUrls: ["./result-tree.component.scss"], }) export class ResultTreeComponent { private _symbolFinancials: SymbolFinancials[] = []; protected treeControl = new NestedTreeControl<ElementNode>( (node) => node.children ); protected dataSource = new MatTreeNestedDataSource<ElementNode>(); protected displayedColumns: string[] = [ "concept", "value", "currency", "quarter", ]; protected hasChild = (_: number, node: ElementNode) => !!node.children && node.children.length > 0; toggleNode(node: ElementNode): void { this.treeControl.toggle(node); node?.children?.forEach((childNode) => { if (!childNode || !childNode?.children?.length) { const myByElements = childNode as ByElements; myByElements.isOpen = this.treeControl.isExpanded(node); } }); } get symbolFinancials(): SymbolFinancials[] { return this._symbolFinancials; } @Input() set symbolFinancials(symbolFinancials: SymbolFinancials[]) { this._symbolFinancials = symbolFinancials; //console.log(symbolFinancials); this.dataSource.data = this.createElementNodeTree(symbolFinancials); } private createElementNodeTree( symbolFinancials: SymbolFinancials[] ): BySymbolElements[] { const bySymbolElementExtsMap = FinancialsDataUtils.groupByKey< FinancialElementExt, string >(FinancialsDataUtils.toFinancialElementsExt(symbolFinancials), "symbol"); //console.log(bySymbolElementExtsMap); const myBySymbolElements: BySymbolElements[] = []; bySymbolElementExtsMap.forEach((value, key) => { const byYearElementsMap = FinancialsDataUtils.groupByKey< FinancialElementExt, number >(value, "year"); const byYearElements: ByYearElements[] = []; byYearElementsMap.forEach((value, key) => { const myByElements = { name: "Elements", isOpen: false, finanicalElementExts: value, } as ByElements; const myByYearElement = { year: key, name: key.toString(), children: [myByElements], byElements: [myByElements], } as ByYearElements; byYearElements.push(myByYearElement); }); const myBySymbolElement = { name: key, symbol: key, byYearElements: byYearElements, children: byYearElements, } as BySymbolElements; myBySymbolElements.push(myBySymbolElement); }); console.log(myBySymbolElements); return myBySymbolElements; } } The ‘ResultTreeComponent’ has the treeControl function to toggle the tree children. The ‘displayedColumns’ array contains the names of the columns of the Angular Components table. The ‘hasChild’ function is used to check for child nodes. The ‘toggleNode(…)’ method uses the toggle function to toggle the child nodes. Then the child nodes are checked to see if the tree leaf nodes are opened, and the ‘isOpen’ property of the leaf nodes is set accordingly. That triggers the rendering of the tables in the opened tree leaf nodes. The parent component’s symbolFinancials are then inserted in the “set symbolFinancials(…)” method. They are set in the ‘_symbolFinanacials’ property, and the ‘createElementNodeTree(…)’ method is called to create the value for the ‘MatTreeNestedDataSource()’ of the tree component. The method ‘createElementNodeTree(…)’ returns a ‘BySymbolElements’ array that contains the tree structure for the Angular Components Tree datasource. The method contains nested iterations for the symbols, years, and elements. First, the ‘bySymbolExtsMap’ Map is created with ‘SymbolFinancials’ objects grouped by the symbol key. Then the entries of the ‘bySymbolExtsMap’ Map are iterated, and the ‘byYearElementsMap’ is created, where the entries are grouped by the year key. Then the entries of the ‘bySymbolExtsMap’ Map are iterated, and the tree leaf objects ‘myByElement’ are created with the ‘isOpen’ property set to false. The ‘financialElementExts’ property contains the elements of the table. Then the ‘myByYearElement’ objects are created that contain the ‘name’ property with the year as a key string. The ‘myByElement’ array is added as a child. The ‘myBySymbolElement’ objects are created that contain the ‘name’ property with the symbol as a key string, and the ‘myByYearElement’ array is added as children. This tree data structure array is then returned to be used in the tree data source. Conclusion Frontend The Tree/Table Angular Components are very good and easy to use. Using the Tree components was surprisingly easy, and combining them with the table component worked well. The performance has to be considered because the Tree branches and leaves are rendered and hidden when the component is created. To get good performance, the table components need to be created after a tree leaf has been opened.
In this article, we will discuss some of the most popular algorithm problems using arrays and hashing approaches. Some of these problems I received during interviews. Let's start with a problem: Contains Duplicate Description: Given an integer array nums, return true if any value appears at least twice in the array, and return false if every element is distinct. Solution: What if we add an additional data structure like a HashSet and put elements inside? If we have the same elements in Set before insert, we will return true, and that is it. So simple, isn't it? Java public boolean containsDuplicate(int[] nums) { Set<Integer> set = new HashSet<>(); for(int n : nums){ if(set.contains(n)){ return true; } else { set.add(n); } } return false; } Moving on to our next task : Valid Anagram Description: Given two strings s and t, return true if t is an anagram of s, and false otherwise. An Anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. Example 1: Input: s = "anagram", t = "nagaram" Output: true Example 2: Input: s = "rat", t = "car" Output: false Solution: First of all, we should understand what an anagram is. Two words will be anagrams only if they have the same characters. That means that we should compare characters. Characters can be in a different order. We can use a few approaches how to handle it. In the first variant, we can sort characters in each word and then compare them. Or we can create a HashMap and, for one word, add characters, and for another, substruct them. Below is the variant with the sorting algorithm. Java public boolean isAnagram(String s, String t) { if(s == null && t == null){ return true; } else if(s == null || t == null){ return false; } if(s.length() != t.length()){ return false; } char[] sCh = s.toCharArray(); char[] tCh = t.toCharArray(); Arrays.sort(sCh); Arrays.sort(tCh); for(int i = 0; i < s.length(); i ++){ if(sCh[i] != tCh[i]){ return false; } } return true; } Is it clear? Please, let me know in the comments. Our next problem: Two Sum Description: Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: Because nums[0] + nums[1] == 9, we return [0, 1]. Example 2: Input: nums = [3,2,4], target = 6 Output: [1,2] Example 3: Input: nums = [3,3], target = 6 Output: [0,1] Solution: This is one of the basic Hash problems. Let's find a brut force solution. We can prepare two for each loop, and iterate over elements and compare their sums. It works, but the time complexity will be O(N^2), and it could be very, very slow. But what if, instead of the second loop, we save all previous elements into HashMap? Will it be checked with current elements? For example, we have array [3,3] and target = 6. In the first iteration, we will put into map 3 as the key and 0(index) as the value. And then, on the next iteration, we check the map with target - cur In our case, it will be 6 - 3 = 3. We have to pair it in our map with element 3 and map it to get the response. Let's take a look at the code: Java public int[] twoSum(int[] nums, int target) { int[] rez = new int[2]; Map<Integer, Integer> map = new HashMap<>(); for (int i = 0; i < nums.length; i++){ int rest = target - nums[i]; if(map.containsKey(rest)){ rez[0] = map.get(rest); rez[1] = i; return rez; } else { map.put(nums[i], i); } } return rez; } For some of you, these problems may look easy, but not for me. I spent a lot of time trying to find a correct solution. Now we will look at the hardest problem in this article: Group Anagrams Description: Given an array of strings strs, group the anagrams together. You can return the answer in any order. An Anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. Example 1: Input: strs = ["eat","tea","tan","ate","nat","bat"] Output: [["bat"],["nat","tan"],["ate","eat","tea"]] Example 2: Input: strs = [""] Output: [[""]] Example 3: Input: strs = ["a"] Output: [["a"]] Solution: Do you remember the previous problem with Anagrams? I want to use the same approach. We remember that anagrams are words with the same characters, and the same characters count. What if we sort characters in the word and create a string from it? For example, we have [nat, tna]. We sort "nat" and receive "ant." We sort "tan" and again receive "ant." We can sort and put words into a map. And the key will be a sorted string, and the value will be the original word. Smart, isn't it? Time to look at the code: Java public List<List<String>> groupAnagrams(String[] strs) { Map<String, List<String>> map = new HashMap<>(); for (String s : strs) { char[] chars = s.toCharArray(); Arrays.sort(chars); String sorted = String.valueOf(chars); if (map.containsKey(sorted)) { map.get(sorted).add(s); } else { List<String> list = new ArrayList<>(); list.add(s); map.put(sorted, list); } } return new ArrayList<>(map.values()); } I hope you are enjoying this topic. Next time, I'm going to solve more complicated topics. Feel free to add your thoughts in the comments. I really appreciate your time and want to hear your feedback.
Sorting can take a lot of time when dealing with large amounts of data. It would be great if, instead of sorting the data every time, we could directly write them into memory in the correct position where they would already be sorted. This would allow us to always know in advance where to search for them, for example, starting from the center. We would know exactly where to go; to the left, discarding half of the data on the right, or to the right, discarding half of the data on the left. This means that the number of elements on each search operation would be halved. This gives us nothing but fast logarithmic complexity. But if we really want to insert it into the correct position, it will work quite slowly. After all, we first need to allocate new space in memory for the required number of elements. Then copy a part of the old data there, then put the new data, and then copy all remaining elements to the new locations. In other words, we get a fast search but slow insertion. In such a situation, a linked list is more suitable. The insertion in it really happens instantly in O(1) time without any copying. However, before we can do this, we will have to find the place for insertion by going through most of the nodes of the linked list, which will again lead us to the linear complexity of O(N). To make these operations fast, a binary search tree was invented. It is a hierarchical data structure consisting of nodes, each of which stores data, a key-value pair, and can have a maximum of two children. In a simple binary tree, data can be stored in any order. In a binary search tree, however, data can only be added according to special rules that allow the operations mentioned above to work quickly. To understand these rules, let's first create a foundation for this tree and its nodes. C# public class BinarySearchTree<T> where T : IComparable<T> { private Node? _head; private class Node { public readonly T Value; private Node? _left; private Node? _right; public Node(T value) { Value = value; } } } Insertion of a New Value So, let's implement the insertion of a new value. First of all, if the root does not exist initially, that is, the tree is completely empty, we simply add a new node, which becomes the root. Next, if the new element being added is less than the current node we are standing on, we recursively move to the left. If it is greater than or equal to the current node, we recursively move to the right. When we find the place where the child's parent is NULL, that is, it points to NULL, we perform the insertion. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public void Add(T value) { if (_head is null) { _head = new Node(value); } else { _head.Insert(value); } } ... } private class Node { ... public void Insert(T value) { ref var branch = ref value.CompareTo(Value) < 0 ? ref _left : ref _right; if (branch is null) { branch = new Node(value); } else { branch.Insert(value); } } } Thanks to recursion, all the elements that we traverse while searching for the correct location are stored in a stack. Once a new node is added and recursion begins to unwind, we will ascend through each ancestor node until we reach the root. This feature greatly simplifies the code in the future. Therefore, keep this in mind. If we continue to insert new elements in this way, we will notice that they will all eventually be sorted. Small nodes will always be on the left, while larger nodes will always be on the right. Thanks to this, we always quickly find the correct path for both insertions of new elements and searching without affecting the other nodes in the tree. So now we have search and insertion happening in O(log(N)) time. This also allows us to quickly find the minimum and maximum elements of the tree. The minimum will always be the lowest element on the left, while the maximum will be the lowest element on the right. Therefore, in the first case, we simply always recursively descend to the left until we hit NULL. In the second case, it is similar, but we descend to the right. It is important to understand that such nodes cannot have more than one child. Otherwise, they would not be the minimum or maximum. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public T Min() { if (_head is null) { throw new InvalidOperationException("The tree is empty."); } return _head.Min().Value; } public T Max() { if (_head is null) { throw new InvalidOperationException("The tree is empty."); } return _head.Max().Value; } ... } private class Node { ... public Node Min() { return _left is null ? this : _left.Min(); } public Node Max() { return _right is null ? this : _right.Max(); } } Item Search We will create a simple search function to find a specific node in the tree based on its value. Due to the structure of the tree, the search process is relatively straightforward. If the current node contains the given value, we will return it. If the value being searched for is less than the value of the current node, we will recursively search in the left subtree. Conversely, if the search value is greater than the current node's value, we will look in the right subtree. If we reach an empty subtree, i.e., it points to NULL, then the node with the searched value does not exist. The algorithm is similar to the insertion algorithm, and we will implement the Contains method for the tree and the Find method for the nodes. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public bool Contains(T value) { return _head?.Find(value) is not null; } ... } private class Node { ... public Node? Find(T value) { var comparison = value.CompareTo(Value); if (comparison == 0) { return this; } return comparison < 0 ? _left?.Find(value) : _right?.Find(value); } } Removing Values Now, in order for our tree not to break in case we want to delete some element from it, we need some special deletion rules. Firstly, if the element being deleted is a leaf node, then we simply replace it with NULL. If this element had one child, then we replace the deleted node not with NULL but with that child. So, in these two cases, deletion simply boils down to replacing the deleted node with its child, which can be either a regular existing node or NULL. Therefore, we simply check that the parent definitely does not have two children and overwrite the deleted node with one of its children, which, I repeat, can turn out to be either an existing node or NULL. The last case is when the deleted node had two children. In this case, the node that will come in place of the deleted node must be greater than all nodes in the left subtree of the deleted node and smaller than all nodes in the right subtree of the deleted node. Therefore, first, we need to find either the largest element in the left subtree or the smallest element in the right subtree. After we have found it, we overwrite the deleted node's data and recursively delete the node that moved to the deleted node's position. It will be deleted according to the same rules: it will either be replaced with NULL or its only child. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public bool Remove(T value) { return _head?.Remove(value, out _head) ?? false; } ... } private class Node { ... public bool Remove(T value, out Node? root) { var comparison = value.CompareTo(Value); if (comparison < 0) { root = this; return _left?.Remove(value, out _left) ?? false; } if (comparison > 0) { root = this; return _right?.Remove(value, out _right) ?? false; } if (_left is null || _right is null) { root = _left ?? _right; return true; } var leftMax = _left.Max(); _left.Remove(leftMax.Value, out _left); Value = leftMax.Value; root = this; return true; } } Tree Traversal To check that the deletion was successful and the order of all nodes has been preserved, there is a special tree traversal method that allows us to output all nodes in ascending order. This method is called an in-order traversal. It involves recursively outputting first the left child, then the parent, and then the right child. Let's convert the tree to a regular list using an in-order traversal. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public List<T> ToList() { var list = new List<T>(); _head?.AddTo(list); return list; } ... } private class Node { ... public void AddTo(ICollection<T> list) { _left?.AddTo(list); list.Add(Value); _right?.AddTo(list); } } Now we have a simple way to output the tree to the console. Let's do that and make sure that the deletion works correctly. C# var tree = new BinarySearchTree<int>(); tree.Add(50); tree.Add(40); tree.Add(30); tree.Add(45); tree.Add(35); Print(tree.ToList()); tree.Remove(35); tree.Remove(40); Print(tree.ToList()); void Print(List<int> list) { foreach (var value in list) { Console.Write(value); Console.Write(" "); } Console.WriteLine(); } Another type of tree traversal is called a pre-order traversal. It involves outputting first the parent, then the left child, and then the right child. This can be useful, for example, when copying a tree in memory, because we traverse the nodes in the exact same order as they were placed in the tree from top to bottom. There are other types of binary search tree traversals, but their implementation differs little. Conclusion Finally, we have a data structure that can do everything quickly. Let's take a moment to think and create a tree from elements 1 to 5. If each subsequent node is always greater or always less than the previous node, then we again get an ordinary linked list with operations of complexity O(N). Therefore, our tree turns out to be completely useless. Fortunately, people quickly realized this and came up with a more advanced tree called AVL with self-balancing, which will again allow us to achieve logarithmic complexity regardless of the incoming data. But we will cover this type of tree and its balancing implementation in the next article Understanding AVL Trees in C#: A Guide to Self-Balancing Binary Search Trees.
This blog post is for folks interested in learning how to use Golang and AWS Lambda to build a serverless solution. You will be using the aws-lambda-go library along with the AWS Go SDK v2 for an application that will process records from an Amazon Kinesis data stream and store them in a DynamoDB table. But that's not all! You will also use Go bindings for AWS CDK to implement "Infrastructure-as-code" for the entire solution and deploy it with the AWS CDK CLI. Introduction Amazon Kinesis is a platform for real-time data processing, ingestion, and analysis. Kinesis Data Streams is a serverless streaming data service (part of the Kinesis streaming data platform, along with Kinesis Data Firehose, Kinesis Video Streams, and Kinesis Data Analytics) that enables developers to collect, process, and analyze large amounts of data in real-time from various sources such as social media, IoT devices, logs, and more. AWS Lambda, on the other hand, is a serverless compute service that allows developers to run their code without having to manage the underlying infrastructure. The integration of Amazon Kinesis with AWS Lambda provides an efficient way to process and analyze large data streams in real time. A Kinesis data stream is a set of shards and each shard contains a sequence of data records. A Lambda function can act as a consumer application and process data from a Kinesis data stream. You can map a Lambda function to a shared-throughput consumer (standard iterator), or to a dedicated-throughput consumer with enhanced fan-out. For standard iterators, Lambda polls each shard in your Kinesis stream for records using HTTP protocol. The event source mapping shares read throughput with other consumers of the shard. Amazon Kinesis and AWS Lambda can be used together to build many solutions including real-time analytics (allowing businesses to make informed decisions), log processing (use logs to proactively identify and address issues in server/applications, etc. before they become critical), IoT data processing (analyze device data in real-time and trigger actions based on the results), clickstream analysis (provide insights into user behavior), fraud detection (detect and prevent fraudulent card transactions) and more. As always, the code is available on GitHub. Prerequisites Before you proceed, make sure you have the Go programming language (v1.18 or higher) and AWS CDK installed. Clone the GitHub repository and change to the right directory: git clone https://github.com/abhirockzz/kinesis-lambda-events-golang cd kinesis-lambda-events-golang Use AWS CDK To Deploy the Solution To start the deployment, simply invoke cdk deploy and wait for a bit. You will see a list of resources that will be created and will need to provide your confirmation to proceed. cd cdk cdk deploy # output Bundling asset KinesisLambdaGolangStack/kinesis-function/Code/Stage... ✨ Synthesis time: 5.94s This deployment will make potentially sensitive changes according to your current security approval level (--require-approval broadening). Please confirm you intend to make the following modifications: //.... omitted Do you wish to deploy these changes (y/n)? y This will start creating the AWS resources required for our application. If you want to see the AWS CloudFormation template which will be used behind the scenes, run cdk synth and check the cdk.out folder. You can keep track of the progress in the terminal or navigate to the AWS console: CloudFormation > Stacks > KinesisLambdaGolangStack. Once all the resources are created, you can try out the application. You should have: A Lambda function A Kinesis stream A DynamoDB table Along with a few other components (like IAM roles, etc.) Verify the Solution You can check the table and Kinesis stream info in the stack output (in the terminal or the Outputs tab in the AWS CloudFormation console for your Stack): Publish a few messages to the Kinesis stream. For the purposes of this demo, you can use the AWS CLI: export KINESIS_STREAM=<enter the Kinesis stream name from cloudformation output> aws kinesis put-record --stream-name $KINESIS_STREAM --partition-key user1@foo.com --data $(echo -n '{"name":"user1", "city":"seattle"}' | base64) aws kinesis put-record --stream-name $KINESIS_STREAM --partition-key user2@foo.com --data $(echo -n '{"name":"user2", "city":"new delhi"}' | base64) aws kinesis put-record --stream-name $KINESIS_STREAM --partition-key user3@foo.com --data $(echo -n '{"name":"user3", "city":"new york"}' | base64) Check the DynamoDB table to confirm that the file metadata has been stored. You can use the AWS console or the AWS CLI aws dynamodb scan --table-name <enter the table name from cloudformation output>. Don’t Forget To Clean Up Once you're done, to delete all the services, simply use: cdk destroy #output prompt (choose 'y' to continue) Are you sure you want to delete: KinesisLambdaGolangStack (y/n)? You were able to set up and try the complete solution. Before we wrap up, let's quickly walk through some of the important parts of the code to get a better understanding of what's going the behind the scenes. Code Walkthrough Some of the code (error handling, logging, etc.) has been omitted for brevity since we only want to focus on the important parts. AWS CDK You can refer to the CDK code here. We start by creating the DynamoDB table: table := awsdynamodb.NewTable(stack, jsii.String("dynamodb-table"), &awsdynamodb.TableProps{ PartitionKey: &awsdynamodb.Attribute{ Name: jsii.String("email"), Type: awsdynamodb.AttributeType_STRING}, }) table.ApplyRemovalPolicy(awscdk.RemovalPolicy_DESTROY) We create the Lambda function (CDK will take care of building and deploying the function) and make sure we provide it with appropriate permissions to write to the DynamoDB table. function := awscdklambdagoalpha.NewGoFunction(stack, jsii.String("kinesis-function"), &awscdklambdagoalpha.GoFunctionProps{ Runtime: awslambda.Runtime_GO_1_X(), Environment: &map[string]*string{"TABLE_NAME": table.TableName()}, Entry: jsii.String(functionDir), }) table.GrantWriteData(function) Then, we create the Kinesis stream and add that as an event source to the Lambda function. kinesisStream := awskinesis.NewStream(stack, jsii.String("lambda-test-stream"), nil) function.AddEventSource(awslambdaeventsources.NewKinesisEventSource(kinesisStream, &awslambdaeventsources.KinesisEventSourceProps{ StartingPosition: awslambda.StartingPosition_LATEST, })) Finally, we export the Kinesis stream and DynamoDB table name as CloudFormation outputs. awscdk.NewCfnOutput(stack, jsii.String("kinesis-stream-name"), &awscdk.CfnOutputProps{ ExportName: jsii.String("kinesis-stream-name"), Value: kinesisStream.StreamName()}) awscdk.NewCfnOutput(stack, jsii.String("dynamodb-table-name"), &awscdk.CfnOutputProps{ ExportName: jsii.String("dynamodb-table-name"), Value: table.TableName()}) Lambda Function You can refer to the Lambda Function code here. The Lambda function handler iterates over each record in the Kinesis stream, and for each of them: Unmarshals the JSON payload in the Kinesis stream into a Go struct Stores the stream data partition key as the primary key attribute (email) of the DynamoDB table The rest of the information is picked up from the stream data and also stored in the table. func handler(ctx context.Context, kinesisEvent events.KinesisEvent) error { for _, record := range kinesisEvent.Records { data := record.Kinesis.Data var user CreateUserInfo err := json.Unmarshal(data, &user) item, err := attributevalue.MarshalMap(user) if err != nil { return err } item["email"] = &types.AttributeValueMemberS{Value: record.Kinesis.PartitionKey} _, err = client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(table), Item: item, }) } return nil } type CreateUserInfo struct { Name string `json:"name"` City string `json:"city"` } Wrap Up In this blog, you saw an example of how to use Lambda to process messages in a Kinesis stream and store them in DynamoDB, thanks to the Kinesis and Lamdba integration. The entire infrastructure life-cycle was automated using AWS CDK. All this was done using the Go programming language, which is well-supported in DynamoDB, AWS Lambda, and AWS CDK. Happy building!
This is the first blog in a series that will focus on Snowflake, where we’ll cover best practices for using Snowflake, explore various Snowflake functionalities, discuss how to maximize the benefits of Snowflake, and address the challenges that come with its implementation or migration. In this blog, we’ll start by discussing setting up a Snowflake account, especially for those new to the Snowflake ecosystem. With a Snowflake account readily available to use and a limited understanding of its system-defined roles, it usually becomes a challenge for a team lead or an admin to set up the environments with proper access controls to its developers or users. To start with the account setup, first, you would be needing a user which has ACCOUNTADMIN role access for the Snowflake account. This can be provided by a user who has ORGADMIN Snowflake account access. This is understood by the example below: An organization has one Snowflake organization-wide account and is managed by ORGADMIN. ORGADMIN can create multiple accounts under the same organization in Snowflake, which can be separately managed by different teams within the organization. Before starting to create users, roles, warehouses, databases, etc., you need to first understand the below System Defined Roles in Snowflake and what Snowflake recommends as a best practice while setting up the account. System-Defined Roles USERADMIN: The initial part of the account creation process is creating users and roles within an account. USERADMIN roles' purpose is users and role creation. This role is granted with CREATE USER and CREATE ROLE security privileges. SECURITYADMIN: A role is incomplete without any grants on it, and SECURITYADMIN role is solely used for granting. Anything relating to grants in Snowflake is completely managed by SECURITYADMIN role. Once users and roles are created by USERADMIN, you can use SECURITYADMIN to grant the users appropriate roles. You can grant warehouses, databases, schemas, integration objects, and access to create tables, stages, views, etc., to a role using SECURITYADMIN role. SECURITYADMIN role inherits the privileges of the USERADMIN role via the system role hierarchy. Note that Snowflake doesn't have the concept of user groups. Instead, the Users are created, and necessary roles are granted to the user. SYSADMIN: SYSADMIN creates the objects like databases, warehouses, schemas, etc., in an account. Although it creates the objects like databases, warehouses, etc., it doesn’t grant access to these objects to the roles. It's done by SECURITYADMIN. ACCOUNTADMIN: ACCOUNTADMIN role encapsulates the SYSADMIN and SECURITYADMIN system-defined roles. It is the top-level role in the system and should be granted only to a limited/controlled number of users in your account. Other than this, ACCOUNTADMIN only has access to CREATE INTEGRATION objects in Snowflake. As a best practice, enable Users with ACCOUNTADMIN roles should have MFA enabled. ORGADMIN: This role is mainly used to create accounts within an organization. Each account acts as a separate entity and will have its own databases, warehouses, and other objects. PUBLIC: As the name suggests, this role can be accessed by every other user in an account. Objects created as a part of a PUBLIC role can be accessed by anyone and used when there is no need for access controls over the objects, and can be shared across the account. Generally, non recommended to use this role for production purposes. Setting up an Account With an Example Since now it's clear what every system-defined role is meant to do in Snowflake, let's see some basic examples of setting up an account using them. Assuming that you have logged in using a user having ACCOUNTADMIN access, let's see below the use case: There are four users named: meghna, adnan, kaushik, and shushant meghna and adnan are from an analytics team who build reports using reporting tools. Hence they only need the read access for the objects created. kaushik and shushant are from the data engineering team, and build pipelines to load the data into the Snowflake databases. Since it's a development environment, they will have read-and-write access to the objects created. So, let's use the usernames as their first names: meghna, adnan, kaushik, shushant Since they are working in an analytics project and dev environment, we can create two roles: One for read named ROLE_DEV_ANALYTICS_RO One for read/write access named ROLE_DEV_ANALYTICS_RW Steps: First, as discussed, let's create the users using the USERADMIN role. SQL use role USERADMIN; – Create the roles create role ROLE_DEV_ANALYTICS_RO; create role ROLE_DEV_ANALYTICS_RW; – create the users create user meghna password='abc123' default_role = ROLE_DEV_ANALYTICS_RO default_secondary_roles = ('ALL') must_change_password = true; create user adnan password='abc123' default_role = ROLE_DEV_ANALYTICS_RO default_secondary_roles = ('ALL') must_change_password = true; create user kaushik password='abc123' default_role = ROLE_DEV_ANALYTICS_RW default_secondary_roles = ('ALL') must_change_password = true; create user shushant password='abc123' default_role = ROLE_DEV_ANALYTICS_RW default_secondary_roles = ('ALL') must_change_password = true; Note that all four users are created using the same password with the argument must_change_password = true, which will force them to change the passwords upon the first login. Use SECURITYADMIN to grant users their respective roles: SQL use role SECURITYADMIN; – Grant the Roles created to SYSADMIN grant role ROLE_DEV_ANALYTICS_RO to role SYSADMIN; grant role ROLE_DEV_ANALYTICS_RW to role SYSADMIN; This is done so that objects like tables, stages, views, etc., created using the roles should be accessible by SYSADMIN as well. If this is not granted, SYSADMIN wouldn’t be able to access or manage the objects created by these roles. SQL use role SECURITYADMIN; – Grant the users to the roles grant ROLE ROLE_DEV_ANALYTICS_RO to user meghna; grant ROLE ROLE_DEV_ANALYTICS_RO to user adnan; grant ROLE ROLE_DEV_ANALYTICS_RW to user kaushik; grant ROLE ROLE_DEV_ANALYTICS_RW to user shushant; Now let's use SYSADMIN to create the warehouse, databases, schemas, etc. SQL use role SYSADMIN; -- Create database and schemas create database analytics_dev; create schema analytics_dev.analytics_master; create schema analytics_dev.analytics_summary; -- Create warehouse create warehouse analytics_small with warehouse_size = 'SMALL' warehouse_type = 'STANDARD' auto_suspend = 60 auto_resume = TRUE ; The above SQL is creating a small warehouse that can suspend in 60 seconds of inactivity and auto resume whenever queries are triggered. Now, since the database, schema, and warehouse is ready, it is time to grant the roles the necessary accesses using SECURITYADMIN. Let's assume that only tables and views are used for this project. SQL use role SECURITYADMIN; – Granting the usage access to ROLE_DEV_ANALYTICS_RO grant usage on database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant usage on all schemas in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on future tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on all tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on future views in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on all views in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; – Granting the usage access to ROLE_DEV_ANALYTICS_RW grant usage on database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant usage on all schemas in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on future tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on all tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on future views in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on all views in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant create table on schema analytics_dev.analytics_master to role ROLE_DEV_ANALYTICS_RW; grant create view on schema analytics_dev.analytics_master to role ROLE_DEV_ANALYTICS_RW; grant create table on schema analytics_dev.analytics_summary to role ROLE_DEV_ANALYTICS_RW; grant create view on schema analytics_dev.analytics_summary to role ROLE_DEV_ANALYTICS_RW; As seen above, ROLE_DEV_ANALYTICS_RO has been granted read access only and ROLE_DEV_ANALYTICS_RW is granted both read and write access. Finally, let's grant the warehouse to the roles. SQL use role SECURITYADMIN; grant USAGE , OPERATE on warehouse analytics_small to role ROLE_DEV_ANALYTICS_RO; grant USAGE , OPERATE on warehouse analytics_small to role ROLE_DEV_ANALYTICS_RW; Users with appropriate permissions should now be able to log in to Snowflake and should be able to use only the roles associated with appropriate permissions. Thank you for reading through the entire article. In the next installment of this series, we will delve into some of the most effective practices for loading files into Snowflake.
What We Use ClickHouse For The music library of Tencent Music contains data of all forms and types: recorded music, live music, audio, videos, etc. As data platform engineers, our job is to distill information from the data, based on which our teammates can make better decisions to support our users and musical partners. Specifically, we do an all-round analysis of the songs, lyrics, melodies, albums, and artists, turn all this information into data assets, and pass them to our internal data users for inventory counting, user profiling, metrics analysis, and group targeting. We stored and processed most of our data in Tencent Data Warehouse (TDW), an offline data platform where we put the data into various tag and metric systems and then created flat tables centering each object (songs, artists, etc.). Then we imported the flat tables into ClickHouse for analysis and Elasticsearch for data searching and group targeting. After that, our data analysts used the data under the tags and metrics they needed to form datasets for different usage scenarios, during which they could create their own tags and metrics. The data processing pipeline looked like this: Why ClickHouse Is Not a Good Fit When working with the above pipeline, we encountered a few difficulties: Partial Update: Partial update of columns was not supported. Therefore, any latency from any one of the data sources could delay the creation of flat tables and thus undermine data timeliness. High storage cost: Data under different tags and metrics was updated at different frequencies. As much as ClickHouse excelled in dealing with flat tables, it was a huge waste of storage resources to just pour all data into a flat table and partition it by day, not to mention the maintenance cost coming with it. High maintenance cost: Architecturally speaking, ClickHouse was characterized by the strong coupling of storage nodes and compute nodes. Its components were heavily interdependent, adding to the risks of cluster instability. Plus, for federated queries across ClickHouse and Elasticsearch, we had to take care of a huge amount of connection issues. That was just tedious. Transition to Apache Doris Apache Doris, a real-time analytical database, boasts a few features that are exactly what we needed to solve our problems: Partial update: Doris supports a wide variety of data models, among which the Aggregate Model supports the real-time partial update of columns. Building on this, we can directly ingest raw data into Doris and create flat tables there. The ingestion goes like this: Firstly, we use Spark to load data into Kafka; then, any incremental data will be updated to Doris and Elasticsearch via Flink. Meanwhile, Flink will pre-aggregate the data so as to release the burden on Doris and Elasticsearch. Storage cost: Doris supports multi-table join queries and federated queries across Hive, Iceberg, Hudi, MySQL, and Elasticsearch. This allows us to split the large flat tables into smaller ones and partition them by update frequency. The benefits of doing so include relief of storage burden and an increase in query throughput. Maintenance cost: Doris is of simple architecture and is compatible with MySQL protocol. Deploying Doris only involves two processes (FE and BE) with no dependency on other systems, making it easy to operate and maintain. Also, Doris supports querying external ES data tables. It can easily interface with the metadata in ES and automatically map the table schema from ES so we can conduct queries on Elasticsearch data via Doris without grappling with complex connections. What's more, Doris supports multiple data ingestion methods, including batch import from remote storage such as HDFS and S3, data reads from MySQL binlog and Kafka, and real-time data synchronization or batch import from MySQL, Oracle, and PostgreSQL. It ensures service availability and data reliability through a consistency protocol and is capable of auto-debugging. This is great news for our operators and maintainers. Statistically speaking, these features have cut our storage cost by 42% and development cost by 40%. During our usage of Doris, we have received lots of support from the open-source Apache Doris community and timely help from the SelectDB team, which is now running a commercial version of Apache Doris. Further Improvements To Serve Our Needs Introduce a Semantic Layer Speaking of the datasets, on the bright side, our data analysts are given the liberty of redefining and combining the tags and metrics at their convenience. But on the dark side, high heterogeneity of the tag and metric systems leads to more difficulty in their usage and management. Our solution is to introduce a semantic layer in our data processing pipeline. The semantic layer is where all the technical terms are translated into more comprehensible concepts for our internal data users. In other words, we are turning the tags and metrics into first-class citizens for data definement and management. Why Would This Help? For data analysts, all tags and metrics will be created and shared at the semantic layer so there will be less confusion and higher efficiency. For data users, they no longer need to create their own datasets or figure out which one is applicable for each scenario but can simply conduct queries on their specified tagset and metricset. Upgrade the Semantic Layer Explicitly defining the tags and metrics at the semantic layer was not enough. In order to build a standardized data processing system, our next goal was to ensure consistent definition of tags and metrics throughout the whole data processing pipeline. For this sake, we made the semantic layer the heart of our data management system: How Does It Work? All computing logics in TDW will be defined at the semantic layer in the form of a single tag or metric. The semantic layer receives logic queries from the application side, selects an engine accordingly, and generates SQL. Then it sends the SQL command to TDW for execution. Meanwhile, it might also send configuration and data ingestion tasks to Doris and decide which metrics and tags should be accelerated. In this way, we have made the tags and metrics more manageable. A fly in the ointment is that since each tag and metric is individually defined, we are struggling with automating the generation of a valid SQL statement for the queries. If you have any idea about this, you are more than welcome to talk to us. Give Full Play to Apache Doris As you can see, Apache Doris has played a pivotal role in our solution. Optimizing the usage of Doris can largely improve our overall data processing efficiency. So, in this part, we are going to share with you what we do with Doris to accelerate data ingestion and queries and reduce costs. What We Want Currently, we have 800+ tags and 1300+ metrics derived from the 80+ source tables in TDW. When importing data from TDW to Doris, we hope to achieve: Real-time availability: In addition to the traditional T+1 offline data ingestion, we require real-time tagging. Partial update: Each source table generates data through its own ETL task at various paces and involves only part of the tags and metrics, so we require support for partial update of columns. High performance: We need a response time of only a few seconds in group targeting, analysis, and reporting scenarios. Low costs: We hope to reduce costs as much as possible. What We Do Generate Flat Tables in Flink Instead of TDW Generating flat tables in TDW has a few downsides: High storage cost: TDW has to maintain an extra flat table apart from the discrete 80+ source tables. That's huge redundancy. Low real-timeliness: Any delay in the source tables will be augmented and retard the whole data link. High development cost: To achieve real-timeliness would require extra development efforts and resources. On the contrary, generating flat tables in Doris is much easier and less expensive. The process is as follows: Use Spark to import new data into Kafka in an offline manner. Use Flink to consume Kafka data. Create a flat table via the primary key ID. Import the flat table into Doris. As is shown below, Flink has aggregated the five lines of data, of which "ID"=1, into one line in Doris, reducing the data writing pressure on Doris. This can largely reduce storage costs since TDW no longer has to maintain two copies of data, and KafKa only needs to store the new data pending for ingestion. What's more, we can add whatever ETL logic we want into Flink and reuse lots of development logic for offline and real-time data ingestion. Name the Columns Smartly As we mentioned, the Aggregate Model of Doris allows for a partial update of columns. Here we provide a simple introduction to other data models in Doris for your reference: Unique Model: This is applicable for scenarios requiring primary key uniqueness. It only keeps the latest data of the same primary key ID. (As far as we know, the Apache Doris community is planning to include partial update of columns in the Unique Model, too.) Duplicate Model: This model stores all original data exactly as it is without any pre-aggregation or deduplication. After determining the data model, we had to think about how to name the columns. Using the tags or metrics as column names was not a choice because: Our internal data users might need to rename the metrics or tags, but Doris 1.1.3 does not support the modification of column names. Tags might be taken online and offline frequently. If that involves the adding and dropping of columns, it will be not only time-consuming but also detrimental to query performance. Instead, we do the following: For flexible renaming of tags and metrics, we use MySQL tables to store the metadata (name, globally unique ID, status, etc.). Any change to the names will only happen in the metadata but will not affect the table schema in Doris. For example, if a song_name is given an ID of 4, it will be stored with the column name of a4 in Doris. Then if the song_nameis involved in a query, it will be converted to a4 in SQL. For the onlining and offlining of tags, we sort out the tags based on how frequently they are being used. The least used ones will be given an offline mark in their metadata. No new data will be put under the offline tags but the existing data under those tags will still be available. For real-time availability of newly added tags and metrics, we prebuild a few ID columns in Doris tables based on the mapping of name IDs. These reserved ID columns will be allocated to the newly added tags and metrics. Thus, we can avoid table schema change and the consequent overheads. Our experience shows that only 10 minutes after the tags and metrics are added, the data under them can be available. Noteworthily, the recently released Doris 1.2.0 supports Light Schema Change, which means that to add or remove columns, you only need to modify the metadata in FE. Also, you can rename the columns in data tables as long as you have enabled Light Schema Change for the tables. This is a big trouble saver for us. Optimize Date Writing Here are a few practices that have reduced our daily offline data ingestion time by 75% and our CUMU compaction score from 600+ to 100. Flink pre-aggregation: as is mentioned above. Auto-sizing of writing batch: To reduce Flink resource usage, we enable the data in one Kafka Topic to be written into various Doris tables and realize the automatic alteration of batch size based on the data amount. Optimization of Doris data writing: fine-tune the sizes of tablets and buckets as well as the compaction parameters for each scenario: max_XXXX_compaction_thread max_cumulative_compaction_num_singleton_deltas Optimization of the BE commit logic: conduct regular caching of BE lists, commit them to the BE nodes batch by batch, and use finer load balancing granularity. Use Dori-on-ES in Queries About 60% of our data queries involve group targeting. Group targeting is to find our target data by using a set of tags as filters. It poses a few requirements for our data processing architecture: Group targeting related to APP users can involve very complicated logic. That means the system must support hundreds of tags as filters simultaneously. Most group targeting scenarios only require the latest tag data. However, metric queries need to support historical data. Data users might need to perform further aggregated analysis of metric data after group targeting. Data users might also need to perform detailed queries on tags and metrics after group targeting. After consideration, we decided to adopt Doris-on-ES. Doris is where we store the metric data for each scenario as a partition table, while Elasticsearch stores all tag data. The Doris-on-ES solution combines the distributed query planning capability of Doris and the full-text search capability of Elasticsearch. The query pattern is as follows: SELECT tag, agg(metric) FROM Doris WHERE id in (select id from Es where tagFilter) GROUP BY tag As is shown, the ID data located in Elasticsearch will be used in the sub-query in Doris for metric analysis. In practice, we find that the query response time is related to the size of the target group. If the target group contains over one million objects, the query will take up to 60 seconds. If it is even larger, a timeout error might occur. After investigation, we identified our two biggest time wasters: When Doris BE pulls data from Elasticsearch (1024 lines at a time by default) for a target group of over one million objects, the network I/O overhead can be huge. After the data pulling, Doris BE needs to conduct Join operations with local metric tables via SHUFFLE/BROADCAST, which can cost a lot. Thus, we make the following optimizations: Add a query session variable es_optimize that specifies whether to enable optimization. In data writing into ES, add a BK column to store the bucket number after the primary key ID is hashed. The algorithm is the same as the bucketing algorithm in Doris (CRC32). Use Doris BE to generate a Bucket Join execution plan, dispatch the bucket number to BE ScanNode, and push it down to ES. Use ES to compress the queried data, turn multiple data fetch into one and reduce network I/O overhead. Make sure that Doris BE only pulls the data of buckets related to the local metric tables and conducts local Join operations directly to avoid data shuffling between Doris BEs. As a result, we reduce the query response time for large group targeting from 60 seconds to a surprising 3.7 seconds. Community information shows that Doris is going to support inverted indexing since version 2.0.0, which is soon to be released. With this new version, we will be able to conduct a full-text search on text types, equivalence or range filtering of texts, numbers, and datetime, and conveniently combine AND, OR, NOT logic in filtering since the inverted indexing supports array types. This new feature of Doris is expected to deliver 3~5 times better performance than Elasticsearch on the same task. Refine the Management of Data Doris' capability of cold and hot data separation provides the foundation of our cost reduction strategies in data processing. Based on the TTL mechanism of Doris, we only store data of the current year in Doris and put the historical data before that in TDW for lower storage cost. We vary the number of copies for different data partitions. For example, we set three copies for data from the recent three months, which is used frequently, one copy for data older than six months, and two copies for data in between. Doris supports turning hot data into cold data, so we only store data of the past seven days in SSD and transfer data older than that to HDD for less expensive storage. Conclusion Thank you for scrolling all the way down here and finishing this long read. We've shared our cheers and tears, lessons learned, and a few practices that might be of some value to you during our transition from ClickHouse to Doris. We really appreciate the help from the Apache Doris community and the SelectDB team, but we might still be chasing them around for a while since we attempt to realize auto-identification of cold and hot data, pre-computation of frequently used tags/metrics, simplification of code logic using Materialized Views, and so on and so forth. (This article is co-written by me and my colleague Kai Dai. We are both data platform engineers at Tencent Music (NYSE: TME), a music streaming service provider with a whopping 800 million monthly active users. To drop the number here is not to brag but to give a hint of the sea of data that my poor coworkers and I have to deal with every day.)
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Kai Wähner
Technology Evangelist,
Confluent
Gilad David Maayan
CEO,
Agile SEO
Grant Fritchey
Product Advocate,
Red Gate Software