Monday, December 29, 2025

How Does Data Actually Get Into Fabric?

In Part 1 we answered 'What is Fabric?' and in Part 2 we covered how Fabric organizes data with OneLake, Lakehouses, and Warehouses.

Now the practical question: How does your data actually get into Fabric?

This matters because Fabric only delivers value once data is there, and the ingestion path you choose determines cost, latency, and how much of your existing environment you have to rebuild.

THE INGESTION OPTIONS (with the fine print)

1. Data Factory Pipelines

If you've used Azure Data Factory or SSIS, this will feel familiar. Scheduled pipelines, copy activities, connectors for SQL Server, Oracle, flat files, APIs.

What works: Broad connector support, batch loads -- the mental model you already have.

What they don't make obvious: Fabric Data Factory and Azure Data Factory are separate products with separate roadmaps. Fabric DF is SaaS; ADF is PaaS. Some ADF features — tumbling window triggers, certain orchestration patterns — aren't available in Fabric. Microsoft maintains both without plans to deprecate ADF.

If you're migrating from ADF, you mustn't just assume that your pipelines will lift-and-shift cleanly.

2. Dataflows Gen2

Power Query at scale. Low-code, browser-based, aimed at analysts who want to shape data without writing SQL or Spark.

What works: Business users can own parts of the pipeline. Supports on-prem data sources via the on-premises data gateway. You choose where the data goes — a Lakehouse, Warehouse, or other Fabric destination — and it's saved as Delta tables.

What else to consider: Dataflows are useful when your business users own part of the data pipeline. They handle moderate complexity well but aren't designed for the most demanding or complex scenarios.

3. Mirroring

This is where it gets interesting for SQL Server shops. Mirroring continuously replicates data from supported sources into OneLake — no pipelines, no scheduling, no manual refresh.

What works:
  • Near real-time sync using Change Data Capture (SQL Server 2016–2022) or Change Feed (SQL Server 2025)
  • Zero-ETL model: data shows up in Fabric without you building anything
  • Supports on-premises SQL Server, Azure SQL, Cosmos DB, and Snowflake

What is not emphasized:
  • Requires an on-premises data gateway (or virtual network gateway) for non-Azure sources
  • SQL Server 2025 also requires Azure Arc
  • CDC must be enabled on your source tables, which adds overhead to your production system

For organizations running SQL Server -- on-prem or in Azure -- Mirroring is the fastest path to a unified analytics layer, but you must test it under realistic load before promising anyone real-time analytics.

4. Shortcuts

Shortcuts don't move data at all. They create a pointer from OneLake to external storage.

What works: No data duplication. Supports ADLS Gen2, Amazon S3, S3-compatible storage, Google Cloud Storage, Azure Blob Storage, Dataverse, Iceberg tables, OneDrive/SharePoint, and on-premises sources via gateway.

What not to forget: Performance depends entirely on the source. A well-tuned ADLS container will perform differently than an unoptimized S3 bucket. Governance also gets more complicated, as you are managing permissions across multiple storage systems while presenting everything through OneLake.

Shortcuts are useful for pilots or bridging legacy systems. They're less suitable as a long-term architectural foundation.

5. Notebooks and Spark

Maximum flexibility. Write Spark code, transform data however you want, and land it in OneLake.

What works: If you have data engineers who know Spark, this is powerful. Complex transformations, streaming, custom logic -- all possible.

What they don't make obvious: Notebooks do not support on-premises data gateway connections. If your source data is on-prem, you cannot use notebooks to pull it directly. You must use pipelines or Dataflows to land the data in a Lakehouse first, then use notebooks for transformation.

This is documented, but easy to miss when designing architecture. Notebooks are for transformation, not ingestion from on-prem sources.

WHICH PATH SHOULD YOU TAKE?

Most organizations won't pick just one.

ScenarioRecommended PathKey Consideration
Nightly batch loads from SQL ServerData Factory pipelinesSeparate product from ADF — test your patterns
Real-time sync from SQL ServerMirroringCDC overhead, gateway requirements
Analyst-driven data prep (including on-prem)Dataflows Gen2Moderate complexity; supports gateway
Existing data in ADLS, S3, GCSShortcuts (short-term), pipelines (long-term)Performance depends on source optimization
Complex transformations after data is in FabricNotebooks / SparkNo on-prem gateway support — land data first


PRACTICAL ADVICE

1. Start with one reporting workload, not your whole environment. Find a report that already depends on multiple data sources. Land copies of those sources into a Lakehouse using pipelines or Mirroring. Build the report. Measure whether anything actually improves.

2. Test Mirroring under realistic load. Enable CDC on a representative production table and observe the impact on your transaction log. Measure replication latency during peak hours. The documentation says 'near real-time' -- you should verify what that means for your workload.

3. Understand the gateway requirements. On-prem SQL Server mirroring, Dataflows to on-prem sources, and Shortcuts to on-prem storage all require the on-premises data gateway. Notebooks do not support gateway connections at all. Plan accordingly.

4. Don't migrate everything on day one. Fabric is an analytics destination, not a mandate to rip out everything you already have running. Identify the smallest useful workload that benefits from OneLake, prove it out, and expand from there.

COMING NEXT

In Part 4, we'll cover SQL Database in Microsoft Fabric — an actual transactional database running inside Fabric. What it is, what it isn't, and where it might actually make sense.

More to Read:

No comments:

Post a Comment