Searches are integral parts of any application. Performing searches on terabytes and petabytes of data can be challenging when speed, performance, and high availability are core requirements. This blog post will pit Solr vs Elasticsearch, two of the most popular open source search engines whose fortunes over the years have gone in different directions.
Both of them are built on top of Apache Lucene, so the features they support are very similar. However, they differ significantly in terms of deployment, scalability, query language, and many other functionalities.
About Apache Solr
Apache Solr is an open-source search server built on top of Lucene that provides all of Lucene’s search capabilities through HTTP requests. It has been around for almost a decade and a half, making it a mature product with a broad user community.
Solr offers powerful features such as distributed full-text search, faceting, near real-time indexing, high availability, NoSQL features, integrations with big data tools such as Hadoop, and the ability to handle rich-text documents such as Word and PDF.
Elasticsearch is also an open-source search engine built on top of Apache Lucene, as the rest of the ELK Stack, including Logstash and Kibana. It extends Lucene’s powerful indexing and search functionalities using RESTful APIs, and it archives the distribution of data on multiple servers using the index and shards concept. Elasticsearch is completely based on JSON and is suitable for time series and NoSQL data.
This tool is much younger than Solr, but it has gained a lot of popularity because of its feature-rich use cases. Some of its primary features include distributed full-text distributed search, high availability, powerful query DSL, multitenancy, Geo Search, and horizontal scaling.
According to DB-Engines, which ranks database management systems and search engines according to their popularity, Elasticsearch is ranked number one, and Solr is ranked number three.
Solr had gained popularity in the first ten years of its existence, but Elasticsearch has been the most popular search engine since 2016.
Figure 1: DB-Engines Ranking—Elasticsearch vs. Solr Popularity (Source: DB-Engines)
Installation and Configuration
Java is the primary prerequisite for installing both of these engines, but the default Elasticsearch configuration requires 1GB of HEAP memory. This can be changed in the jvm.options file inside the config directory.
By default, Solr needs at least 512MB of HEAP memory to allocate to instances. This setting can be changed in either the solr script file or the solr.in.cmd file. Both files are located inside the bin directory of the Solr installation.
Elasticsearch is easy to install and configure, but it’s quite a bit heavier than Solr. The latest version of Elasticsearch (version 7.7.1, released in June 2020) has a compressed size of 314.5MB, whereas Solr (version 8.5.2, released in May 2020) ships at 191.7MB.
Configuration files in Elasticsearch are written in YML format. Solr supports XML-based configuration files.
Indexing and Searching
Both Solr and Elasticsearch write indexes in Lucene. But, since differences exist in sharding and replication (among other features), there are also differences in their files and architectures. Additionally, Elasticsearch has native DSL support while Solr has a robust Standard Query Parser that aligns to Lucene syntax.
Both tools support a wide range of data sources.
Solr uses request handlers to ingest data from XML files, CSV files, databases, Microsoft Word documents, and PDFs. With native support for the Apache Tika library, it supports extraction and indexing from over one thousand file types. Solr ships with a simple command line post. To ingest CSV-based data in a collection named
testcollection, for example, you just need to use the following command:
bin/post -c testcollection *.csv
Elasticsearch, on the other hand, is completely JSON-based. It supports data ingestion from multiple sources using the Beats family (lightweight data shippers available in the ELK Stack) and Logstash.
While both products are document-oriented search engines, Solr has always been more focused on enterprise-directed text searches with advanced information retrieval (IR). Consequently, it’s more suited for search applications that use massive amounts of static data. Solr fits better into enterprise applications that already implement big data ecosystem tools, such as Hadoop and Spark. Additionally, Solr stands out in handling Rich Text Format (RTF) documents. To compete with Elasticsearch, recent Solr releases have offered new features such as Parallel SQL Interface and streaming expressions.
Elasticsearch is focused more on scaling, data analytics, and processing time series data to obtain meaningful insights and patterns. Its large-scale log analytics performance makes it quite popular. Elasticsearch is more suited to modern web applications where data is carried in and out in JSON format. Elasticsearch has also put a lot of development effort into making its tool more resilient. This turns it into a primary data store.
Both Solr and Elasticsearch support NRT (near real-time) searches and take advantage of all of Lucene’s search capabilities. They both have additional search-related feature sets, described below, since they both support JSON-based Query DSL.
Earlier Solr versions had to rely on its Standard Query Parser, but Solr now also supports JSON-based Query DSL. While Solr’s Standard Query Parser allows users to create a variety of structured queries, the chances of making syntax errors while writing these queries is much higher. Nevertheless, you can write very complex search queries in Solr that are unavailable in Elasticsearch. Solr includes a sample search UI, called Velocity Search, that offers powerful features such as searching, faceting, highlighting, autocomplete, and Geo Search.
Elasticsearch’s DSL is native. The aggregation framework in Elasticsearch is powerful with aggregation queries in the APIs with better caching. The more recent releases of the tool offer better management of memory footprints.
Because Elasticsearch is schemaless, it is easy to index unstructured data and dynamic fields without defining the schema of the index in advance. Earlier Solr versions required a defined schema before indexing data. However, Solr now supports a schemaless mode.
Both search engines support custom analyzers, synonym-based indexing, stemming, and various tokenization options.
Scalability and Distribution
Search engines have to quickly process large amounts of data and complex queries on sets of hundreds of millions of records. Sometimes these queries can be so resource-intensive that they can take the whole system down—especially if you haven’t planned for the load in advance and can’t scale quickly. For this reason, a search engine must be scalable and fault-tolerant in nature.
Clusters, Sharding, and Rebalancing
Both Elasticsearch and SolrCloud provide support for sharding. But, since Elasticsearch’s design has horizontal scaling in mind, it has better support for scaling and cluster management. Its disadvantage is that the shards cannot increase once they’ve been created, although you can use a shrink API to reduce the shards of an index. SolrCloud supports further splitting of an existing shard but not the shrinking of shards.
Elasticsearch’s built-in zen discovery module handles cluster coordination. SolrCloud requires Apache Zookeeper, an additional service.
In case of a shard or node failure, Elasticsearch does cluster rebalancing itself and rarely requires a manual intervention. In SolrCloud, rebalancing is complex and hard to manage.
Solr had a broad, open source community. Anyone can still contribute to Solr, and new Solr developers or code committers are elected based on merit only. Elasticsearch is technically open source but not fully. All contributors have access to the source code, and users can make changes and contribute them. But final changes get confirmation from employees of Elastic (the company that runs Elasticsearch and other software). Therefore, Elasticsearch is driven more by a single company rather than a whole community. This is not to mention the number of non-open, premium features Elasticsearch (and the Elastic/ELK Stack in general) offer).
Going back to the mid-2010s, Solr contributors and committers span multiple organizations while Elasticsearch committers are from Elastic only. Solr’s strong community had a healthy project pipeline and many well-known companies that take part. These members also invest in the platform by contributing throughout the entire development and engineering process.
This has changed drastically in the last five years. Elasticsearch’s community of contributors and its user base have grown immensely. It is by far the most popular open source time-series database and search engine in DevOps at the beginning of the 2020s.
Historically, both have had great user bases as well as rich developer communities, but Elasticsearch has overtaken Solr. Solr has been around for a much longer period of time, but its ecosystem has stagnated even after having a well-developed and has a larger user base.
On this, Elasticsearch documentation wins. Not only does Elasticsearch’s official website offer well-organized, high quality documentation with clear examples, the internet is flush with books and guides, thanks to the tool’s popularity. Over the last four years, Elasticsearch enhanced its documentation to go beyond organization. Additionally, it offers good examples and clear configuration instructions.
In comparison, Solr documentation is lacking. The overall coverage of Solr’s APIs is minimal, and it’s hard to find good technical examples and tutorials. It used to be the other way around: Solr was a very well-documented product with clear examples and contexts for API use cases. However, its documentation maintenance has fallen behind, with gaps noted by many users.
Summary: Solr vs Elasticsearch
Selecting a clear winner between these two technologies requires a complete understanding of the use cases they support, their feature sets, the scaling options they offer, and their ease of maintenance.
Here’s a summary of each tool’s attributes:
|Installation and Configuration||Easy to get up and running with and very supportive documentation||Easy to get up and running with with very supportive documentation. Several packages are available for various platforms.|
|Searching and Indexing||Optimal for text search and enterprise applications close to the big data ecosystem||Useful as both a text search and an analytical engine because of its powerful aggregation module|
|Scalability and Clustering||Support from Solr Cloud and Apache Zookeeper dependence for cluster coordination||Better inherent scalability; design optimal for cloud deployments|
|Community||A historically large ecosystem||A thriving ecosystem for the FOSS version of Elasticsearch and the ELK Stack|
Both of these technologies are quite easy to begin working with. Solr offers great functionalities in the field of information retrieval, but Elasticsearch is much easier to take into production and scale. When choosing your tool, make sure to look at your requirements and make the best selection for your specific use case.