Sample Operator — axiomhq/docs

The sample operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm isn’t statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize instead.

You can find the sample operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.

For users of other query languages

If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.

In Splunk SPL, the sample command works similarly, returning a subset of data rows randomly. However, the APL sample operator requires a simpler syntax without additional arguments for biasing the randomness.

```sql Splunk example | sample 10 ```

['sample-http-logs'] 
| sample 0.1

In ANSI SQL, there is no direct equivalent to the sample operator, but you can achieve similar results using the TABLESAMPLE clause. In APL, sample operates independently and is more flexible, as it’s not tied to a table scan.

```sql SQL example SELECT * FROM table TABLESAMPLE (10 ROWS); ```

['sample-http-logs'] 
| sample 0.1

Usage

Syntax

| sample ProportionOfRows

Parameters

ProportionOfRows: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.

Returns

The operator returns a table containing the specified number of rows, selected randomly from the input dataset.

Use case examples

In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.

Query

['sample-http-logs']
| sample 0.05

Run in Playground

Output

_time	req_duration_ms	id	status	uri	method	geo.city	geo.country
2023-10-16 12:45:00	234	user1	200	/index	GET	New York	US
2023-10-16 12:47:00	120	user2	404	/login	POST	Paris	FR
2023-10-16 12:48:00	543	user3	500	/checkout	POST	Tokyo	JP

This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.

In this use case, you sample traces to investigate performance metrics for a particular service across different spans.

Query

['otel-demo-traces']
| where ['service.name'] == 'checkoutservice'
| sample 0.05

Run in Playground

Output

_time	duration	span_id	trace_id	service.name	kind	status_code
2023-10-16 14:05:00	1.34s	span5678	trace123	checkoutservice	client	200
2023-10-16 14:06:00	0.89s	span3456	trace456	checkoutservice	server	500

This query returns 5 % of all traces for the checkoutservice to identify potential performance bottlenecks.

In this use case, you sample security log data to spot irregular activity in requests, such as 500-level HTTP responses.

Query

['sample-http-logs']
| where status == '500'
| sample 0.03

Run in Playground

Output

_time	req_duration_ms	id	status	uri	method	geo.city	geo.country
2023-10-16 14:30:00	543	user4	500	/payment	POST	Berlin	DE
2023-10-16 14:32:00	876	user5	500	/order	POST	London	GB

This query helps you quickly spot failed requests (HTTP 500 responses) and investigate any potential causes of these errors.

take: Use take when you want to return the first N rows in the dataset rather than a random subset.
where: Use where to filter rows based on conditions rather than sampling randomly.
top: Use top to return the highest N rows based on a sorting criterion.

#For users of other query languages

#Usage

#Syntax

#Parameters

#Returns

#Use case examples

#List of related operators

Good morning

For users of other query languages

Usage

Syntax

Parameters

Returns

Use case examples

List of related operators