The sample operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm isn’t statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize instead.

You can find the sample operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.

For users of other query languages

If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.

In Splunk SPL, the sample command works similarly, returning a subset of data rows randomly. However, the APL sample operator requires a simpler syntax without additional arguments for biasing the randomness.

```sql Splunk example | sample 10 ```
['sample-http-logs'] 
| sample 0.1

In ANSI SQL, there is no direct equivalent to the sample operator, but you can achieve similar results using the TABLESAMPLE clause. In APL, sample operates independently and is more flexible, as it’s not tied to a table scan.

```sql SQL example SELECT * FROM table TABLESAMPLE (10 ROWS); ```
['sample-http-logs'] 
| sample 0.1

Usage

Syntax

| sample ProportionOfRows

Parameters

  • ProportionOfRows: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.

Returns

The operator returns a table containing the specified number of rows, selected randomly from the input dataset.

Use case examples

In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.

Query

['sample-http-logs']
| sample 0.05

Run in Playground

Output

_time req_duration_ms id status uri method geo.city geo.country
2023-10-16 12:45:00 234 user1 200 /index GET New York US
2023-10-16 12:47:00 120 user2 404 /login POST Paris FR
2023-10-16 12:48:00 543 user3 500 /checkout POST Tokyo JP

This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.

In this use case, you sample traces to investigate performance metrics for a particular service across different spans.

Query

['otel-demo-traces']
| where ['service.name'] == 'checkoutservice'
| sample 0.05

Run in Playground

Output

_time duration span_id trace_id service.name kind status_code
2023-10-16 14:05:00 1.34s span5678 trace123 checkoutservice client 200
2023-10-16 14:06:00 0.89s span3456 trace456 checkoutservice server 500

This query returns 5 % of all traces for the checkoutservice to identify potential performance bottlenecks.

In this use case, you sample security log data to spot irregular activity in requests, such as 500-level HTTP responses.

Query

['sample-http-logs']
| where status == '500'
| sample 0.03

Run in Playground

Output

_time req_duration_ms id status uri method geo.city geo.country
2023-10-16 14:30:00 543 user4 500 /payment POST Berlin DE
2023-10-16 14:32:00 876 user5 500 /order POST London GB

This query helps you quickly spot failed requests (HTTP 500 responses) and investigate any potential causes of these errors.

  • take: Use take when you want to return the first N rows in the dataset rather than a random subset.
  • where: Use where to filter rows based on conditions rather than sampling randomly.
  • top: Use top to return the highest N rows based on a sorting criterion.

Good morning

I'm here to help you with the docs.

I
AIBased on your context