Welcome to Tech Mastery, your expert source for insights into technology and digital strategy. Explore topics like Adobe Experience Manager, AWS, Azure, generative AI, and advanced marketing strategies. Delve into MACH architecture, Jamstack, modern software practices, DevOps, and SEO. Our blog is ideal for tech professionals and enthusiasts eager to stay ahead in digital innovations, from Content Management to Digital Asset Management and beyond.
Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms. These terms are used to determine what documents match a query during searching.
An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization).
This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens. Tokens, combined with their associated field name, are terms.
In Lucene, an analyzer is a java class that implements a specific analysis. Analysis occurs any time text needs to be converted into terms, which in Lucene’s core is at two spots: during indexing and when searching.
An analyzer chain starts with a Tokenizer, to produce initial tokens from the characters read from a Reader, then modifies the tokens with any number of chained TokenFilters.
Let’s see the some important built-in analyzer available in Lucene bundle:
WhitespaceAnalyzer, as the name implies, splits text into tokens on whitespace characters and makes no other effort to normalize the tokens. It doesn’t lowercase each token.
SimpleAnalyzer first splits tokens at nonletter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters but keeps all other characters.
StopAnalyzer is the same as SimpleAnalyzer, except it removes common words. By default, it removes common words specific to the English language (the, a, etc.), though you can pass in your own set.
KeywordAnalyzer treats entire text as a single token.
StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and hostnames. It also lowercases each token and removes stop words and punctuation.
The built-in analyzers can be used directly by specifying the class name or the analyzers can be composed with tokenizer and series of filters – there are multiple Tokenizers e.g Standard, Keyword, CharTokenizers and Token Filters e.g Stop, LowerCase,Standard
There are multiple standard analyzers, tokenizers and Token Filters, custom analyzers/Tokenizer/TokenFilters can be created if required.
The analyzers is configured via the analyzers node (of type nt:unstructured) inside the oak:index definition.
The default analyzer for an index is configured in the default child of the analyzer’s node(of type nt:unstructured).
Create a child node with name "tokenizer" of type "nt:untstrutured" under default node and add the property "name" with value "Standard"
Create a child node with name "filters" of type "nt:untstrutured" under default node.
Create a child node with name "LowerCase" of type "nt:untstrutured"
This will index the data by lower casing and also match data by lower casing during the search.
The configurations are ready now, let us index the data. Change the value of reindex property under the custom index to true - this will initiate the re-indexing, the property value will be changed to false once the re-indexing is initiated
Let me now re-execute the query, this time the query returns both nodes with uppercase and lowercase values(TEST and test)
AEM Core components deep dive | How to extend AEM core components | Proxy Components in AEM
This tutorial explains the details on AEM core components
In AEM, we use either custom or OOTB components to build the website, these components are called as WCM(Web Content Management) components.
Core components
introduced in AEM 6.2 but are strongly recommended to use from 6.3
provides robust extensible base components
built on the latest technology and follows the best practices
adhering to accessibility guidelines and compliant with the WCAG 2.0 AA standard
makes page authoring more flexible and customizable
simple to extend to offer custom functionalities
Core components - Features
Core Components - Architecture
Design dialog defines what authors can or cannot do it in the edit dialog
Edit dialog shows authors only options they are allowed to use
Sling model verifies and prepares the content for the view(template)
The result of the sling model can be serialized to JSON for SPA use cases
HTL renders the HTML server-side for traditional server-side rendering
The HTML output is semantic, accessible, search-engine optimized and easy to style.
The Core components follows the MVC (Model View Controller) design pattern.
The Controller(Content) refers the proxy component (View), the proxy component is extended from WCM core components. Proxy Components are site-specific components which are inherited from core components, these are called as proxy component as no implementation is required but refers the core component as the resourceSuperType, can be extended as required.
The core components refers the Sling Model (Model) to retrieve the required data for rendering.
The Controller(Content) also refers the templates and configuration policies.
The Delegate pattern can be used to extend the model class for the core component, the responsibility of the model class is delegated to the implementations based on the resource type.
The core components uses the model interfaces and delegated to the right model implementation at runtime based on the resourceType in the model implementation.
Oak Lucene Index - Improve the query performance in AEM(Adobe Experience Manager) | Configure Oak Lucene Index in AEM
This tutorial explains the details on enabling Oak Lucene Index to improve the query performance in AEM(Adobe Experience Manager)
OAK Lucene Index
For queries to perform well, Oak supports indexing of content that is stored in the repository. When a JCR query gets executed, usually it searches the index first. If there is no index, the query executes for the entire content. This is time consuming and an overhead for the AEM. A query can be executed without an index, but for large datasets, it will execute very slowly, or even abort.
There are three types of indexing mode available that defines how comparing is performed, and when the index content gets updated
Synchronous Indexing - Under synchronous indexing, the index content gets updates as part of the commit itself. Changes to both the main content, as well as the index content, are done atomically in a single commit. The new content is added into the index as soon as available.
Asynchronous Indexing - Asynchronous indexing (also called async indexing) is performed using periodic scheduled jobs. As part of the setup, Oak schedules certain periodic jobs which perform diff of the repository content, and update the index content based on that. This will provide better performance but the new content will not be available immediately to the index.
Near Real Time (NRT) Indexing - This method indicates that index is a near real time index.
Indexing uses Commit Editors. Some of the editors are of type IndexEditor, which are responsible for updating index content based on changes in main content. Currently, Oak has following in built editors:
PropertyIndexEditor
ReferenceEditor
LuceneIndexEditor
SolrIndexEditor
There are 3 main types of indexes available in AEM :
Lucene – asynchronous (full text and property) - Recommended
Property – synchronous [ Prefer only when you need synchronous results ]
Solr – asynchronous
Configure Lucene Index in AEM
Oak supports Lucene based indexes to support both property constraint and full text constraints. Depending on the configuration a Lucene index can be used to evaluate property constraints, full text constraints, path restrictions and sorting.
If multiple indexers are available for a query, each available indexer estimates the cost of executing the query. Oak then chooses the indexer with the lowest estimated cost.
I have a large data sets(12k) under "/content/sampledata" with id property, the id property valy of all nodes starts with '1111'
Let me now execute a query to fetch all the nodes under "/content/sampledata" those id property value start with '1111'
select * from [nt:unstructured] where [jcr:path] like '/content/sampledata/%' and id LIKE '%1111%'
The query execution failed with the following exception "The query read or traversed more than 100000 nodes. To avoid affecting other tasks, processing was stopped"
The 100000 is the queryLimitReads value, queryLimitReads value can be changed but the query fails again after reaching the limit and also this will impact the overall system performance.
The queryLimitReads value can be changed through the following OSGI configuration -http://localhost:4502/system/console/configMgr/org.apache.jackrabbit.oak.query.QueryEngineSettingsService
The query execution behavior can be reviewed through Query Performance Tool
This will display the slow queries and popular queries, also Explain query explains the query execution details.
There is no index defined for the query so executed as a full traversal, the query traversed more than 100000 and aborted to avoid the impact on other activities.
Let us see now how to define a Lucene index to improve the query performance
Create a node with name "testindex" under oak:index with the following properties
On Save this will create a node with name "indexRules"
By default, a node with name nt:base created under indexRules. Rename the node to the primaryType of nodes need to be indexed, our case "nt:unstructured"
There a default node with name prop0 created under properties, rename prop0 to the property need to be indexed, our case "id" and enable the below properties
id:
propertyIndex - true
ordered - true
name - id
isRegexp - false
Let us now reindex the data, Change the reindex property to true to initiate the asynchronous indexing
The reindex property value will be changed to false after initiating the index, wait for sometime to the index to complete.
Re-execute the query. The query is now working with out any issues and with better performance, the query is executed with the defined Lucene index.
This is always best practice to review the slow running custom queries and configure the required indexing to improve the performance. If there are multiple indexes defined, oak considers the best index to execute the query.
This tutorial explains the approach to configure efficient error handling in AEM(Adobe Experience Manager).
AEM by default uses Sling’s Error handler to handle the error scenarios(/apps/sling/servlet/errorhandler), the publisher returns the error pages for Not Found Resources.
The publisher sends the default error page with 404 status to the Dispatcher, Dispatcher directly send back the error response with 404 error code to the user without caching the error response(spools the error responses directly to the client)
With the default Sling Error handler, all the websites display the same default error page to the users - minimal standard content.
The default Sling Error handle can be overlayed to customize to enable site-specific, content-rich, and localized error pages.
This can cause performance problems as the 404 pages are not cached in Dispatcher, this makes publishers to process and render the error pages on every not found resources, the multiple requests to the error pages can cause performance issues with AEM publishers. This can lead to a DDOS attack on the platform by sending multiple bad URLs by the attackers, the publisher should process the 404 pages for every request that makes the platform un-responsive for the intended users.
This behavior can be changed by allowing the webserver(Apache) to handle the error scenarios.
The publisher sends the default error page with 404 status to the Dispatcher for every Not Found resource’s, Dispatcher handover the error(404) processing to WebServer(Apache). The WebServer checks if the error page is in the cache, if not available then requests the custom error page from the publisher based on the ErrorDocument configurations and returns the error response to the users with the required 404 error code.
The error page response is cached in the WebServer, web server returns the error pages from the cache for the subsequent Not Found(404) requests.
This will improve the overall performance as the webserver returns the cached 404 pages for the Not Found resource scenarios, only the first time request sent to the publisher.
Let us enable the required configurations to handle the error scenarios from WebServer(Apache)
As a first step, create a site-specific error page 404-page(use the same name across all the website) under the language node e.g /content/we-retail/us/en with the required components and data
As a next step, create an error handler — errorhandler.conf with the required configurations and place it under /etc/httpd/conf/errorhandlers, you can create multiple error handlers as required.
The incoming content path in the above configuration is assumed as a shortened URL e.g /us/en by hiding the /content/we-retail(Apache always receive the page requests without /content/we-retail), change the regex configuration based on your content configurations.
You can define multiple <if> conditions if the URL pattern is completely different for a different sites, the <If> <Else> directive is supported in Apache 2.4 version.
The error handler can be modified to handles other error codes like 500, 403.
Now include the error handler(errorhandler.conf) to the individual virtualhosts
Let us now enable a dispatcher configuration to handover the error handling process to the WebServer(Apache), by default Dispatcher handles the error codes and send the response back to the users. This configuration makes an apache error handler to handle the error scenarios based on the ErrorDocument configurations.
Change the value of DispatcherPassError to 1 in dispatcher module configuration, most of the cases these configurations placed in httpd.conf file.
<IfModule disp_apache2.c> # location of the configuration file. eg: 'conf/dispatcher.any' DispatcherConfig conf/dispatcher.any# location of the dispatcher log file. eg: 'logs/dispatcher.log' DispatcherLog logs/dispatcher.log# log level for the dispatcher log # 0 Errors # 1 Warnings # 2 Infos # 3 Debug DispatcherLogLevel 3# if turned to 1, the dispatcher looks like a normal module DispatcherNoServerHeader 0# if turned to 1, request to / are not handled by the dispatcher # use the mod_alias then for the correct mapping DispatcherDeclineRoot 0# if turned to 1, the dispatcher uses the URL already processed # by handlers preceeding the dispatcher (i.e. mod_rewrite) # instead of the original one passed to the web server. DispatcherUseProcessedURL 1# Defines how to support 40x error codes for ErrorDocument handling: # 0 - the dispatcher spools all error responses to the client. # 1 - the dispatcher does not spool an error response to the client (where the status code is greater or equal than 400) # but passes the status code to Apache, which e.g. allows an ErrorDocument directive to process such a status code. DispatcherPassError 1</IfModule>
Now you should receive the site-specific error pages. Accessing http://localhost/us/en/a.html responds with the 404 pages specific to /us/en website the same way other sites return the corresponding error pages based on the configurations.
The error pages are now cached in the Dispatcher.
This will improve the overall platform performance and also mitigate some of the security issues — DDOS attach by caching the 404 error pages in the dispatcher.