What is data ingestion?
Data ingestion refers to collecting and importing data from multiple sources and moving it to a destination to be stored, processed, and analyzed. Typically, the initial destination of ingested data is either a database, data warehouse, or data lake.
If the data is immediately needed for a business purpose—as with real-time and near real-time analytics—it won't remain stored in this initial location for very long. Otherwise, it will stay in repose until the time comes for it to be processed, cleansed, analyzed, and routed to applications and systems. In either case, ingestion also involves the data being adapted to an appropriate format for its storage medium or use case.
The data ingestion layer—and all of the processes associated with it—marks the beginning of any data pipeline. As such, efficient and well-managed ingestion is integral to virtually all other data-based operations within an enterprise, including visualization, integration, analytics, sharing, and more.
Data ingestion vs. ETL
Because these terms are sometimes used interchangeably, it's important to point out how they're different.
- Extract, transform, and load (ETL) is a specific method of data ingestion. The ETL process begins with data extraction from various sources. Afterward, the data is transformed into a format appropriate for either a data warehouse or data lake—which is where the data is loaded, often in conjunction with analytics operations.
- Data ingestion, by contrast, is an umbrella term, referring to any process in which data from various sources is collected and then transformed or restructured as needed. Non-ETL methods include extract, load, and transform (ELT), which is often used with data lakes to ingest data in its raw form, and real-time ingestion, which is explored in detail below.
How data ingestion works
Data ingestion typically takes place either in real time or as a series of carefully scheduled events.
This data ingestion method is tied to real-time data analytics. Real-time processing leaves virtually no time for grouping or categorization. As soon as the ingestion layer recognizes a stream of data en route from a real-time data source, the data is immediately collected, loaded, and processed so it can quickly reach its end user.
Sometimes this method of ingestion occurs nearly in real time, rather than almost instantaneously, but the difference is small—10 or 15 seconds as opposed to two or less.
Any situation in which data must be processed as fast as possible can benefit from data ingestion via real-time processing. Notable use cases include medical diagnostics via wearable devices, advanced driver assistance systems (ADAS), fraud detection platforms, and real-time personalization in e-commerce platforms.
Batch-style data ingestion involves "batches" of ingested data being processed at regular intervals, which are typically scheduled well in advance of the ingestion. The intervals can be hours, days, or even weeks apart, depending on the intended purpose of the batch-processed data, but daily batch ingestion is very common.
Data engineers can also set up batches to be ingested when certain "triggers" occur, such as when a certain number of unprocessed records have been received. For example, to ensure batches of highly detailed records from a customer relationship management (CRM) platform don't get too large, the trigger might be set at 200 records. Less complex data can have much higher trigger thresholds.
The length of the intervals ensures that batch ingestion will always be slower than its real-time counterpart. Also, because everything in a batch is simultaneously processed, it will be cumbersome if the data set is particularly large or in a complex format. That said, batch ingestion is very useful for automating tasks that are essential but not time-sensitive, like daily sales reporting, payroll, and billing. It's also notably less expensive than real-time ingestion because it doesn't require as much compute power and other resources.
Some tools and systems ingest data in different ways. For example, certain open-source applications for streaming analytics use a variation of batch processing known as micro batching: Streaming data is organized into small batches before being almost immediately ingested.
Alternatively, the data processing framework known as lambda architecture—which shouldn't be confused with the serverless computing platform—has layers for both real-time streaming ingestion and batch ingestion.
Micro-batch ingestion is most often seen in organizations with data pipelines that can't accommodate streaming architecture. Lambda architecture, meanwhile, offers extremely low-latency ingestion, but is extremely complex to implement.
Why data ingestion matters
As the first stop at the beginning of any data pipeline, the data ingestion process is extremely important. These are the main reasons why:
Setting the stage for critical data operations
Once data is ingested, it can be cleansed, processed, deduplicated, virtualized, or propagated, based on the needs of a given data operation. These steps are necessary for proper data storage, warehousing, analytics, or application use. An effective data ingestion tool can be configured so that it prioritizes the intake of data from the most business-critical sources, helping this data be processed as efficiently as possible.
Facilitating data integration
Data ingestion also kick-starts the process of data integration—bringing together data from many sources, converting it to a uniform format if necessary, and presenting it as a comprehensive unified view. Ingesting data into a single platform that can be used by all departments also helps to limit the formation of data silos—which continue to be a common problem.
Streamlining data engineering operations
Many aspects of modern data ingestion are automated. Once they've been set up, these multi-step processes will run with little to no human intervention—unlike in years past, when data engineers sometimes had to wrangle data manually. Automation gives these professionals the freedom to address more mission-critical tasks, and also accelerates the overall data engineering process.
Improving analytics—and decision-making
For any data analytics project to be successful, it's critical that data is consistently and readily available to analysts. The data ingestion layer directs this information to whatever storage medium is most appropriate for on-demand access, be that a data warehouse or a more specialized destination like a data mart. Also, ingesting data in the manner most appropriate for specific analyses—e.g., batch processing for daily expense reporting, or real-time ingestion for a vehicle's ADAS data—is also an essential foundation for effective analytics.
Data ingestion on a single platform helps ensure that all business users can access and analyze high-quality data—which is vital for decision-making in the enterprise. The speed of real-time ingestion makes it particularly valuable as a foundation for analytics, leading to more valuable insights and better decisions.
Overcoming potential challenges of data ingestion
Certain complications can arise as part of the data ingestion process, and problems here quickly turn into problems further down the pipeline. It's important to know the most common adverse issues.
Volume and complexity of data: Most of today's enterprises generate and ingest a massive volume of data. Data teams must be prepared for that volume to slow ingestion down and create inefficiencies as a result.
This will be a particularly big challenge for enterprises that rely on many different data sources, like manufacturers with complex Internet of Things (IoT) device deployments. In such cases, ingestion might get expensive—or, at least, be highly time- and/or resource-intensive. Data engineers and analysts will likely need to update ingestion processes periodically to account for emerging data sources and formats.
Security: Data in transit is often considered less secure than stored data at rest. Ingestion relies on data in transit—from source to ingestion layer, and then through an enterprise's data pipeline.
Batch ingestion carries particularly significant risk because a skilled hacker might be able to predict the processing intervals and use those times to strike. Enterprise data teams must work with their cybersecurity co-workers to implement encryption, firewalls, and other protective measures to minimize ingestion-related security risks.
Connectivity: If the connection between a data source and the ingestion layer fails, even just briefly, there's the chance for data to be lost or become corrupted.
Compliance: Data teams should keep data privacy and sovereignty standards in mind when ingesting data, and be sure not to ingest anything that could put them in noncompliance—such as personal information that individuals haven't authorized for sale or tracking.
Improved ingestion with the cloud
As the volume of enterprise data continues to grow, the importance of effective data management at scale grows with it. Using the elastic resources and cost-effective solutions available through the cloud, organizations can establish data ingestion as a firm foundation for data analytics.
Teradata Vantage, the cloud-ready data warehousing and analytics platform, is designed to ingest data in all major enterprise formats—from JSON and XML to Parquet and CSV—regardless of source. The solution ingests and integrates data from across the entire enterprise data ecosystem, creating a single source of truth from which to draw valuable and actionable insights.
Learn more about Vantage and its ingestion capabilities today.
Watch this demo to learn more about Vantage