1. Machine Learning with ClickHousePandas import requests import io import pandas as pd url = 'http://127.0.0.1:8123?query=' query = 'select * from trips limit 1000 format TSVWithNames' resp = requests.get(url, data=query) string_io = io WHERE condition › SAMPLE x OFFSET y 7 / 62 How to sample data LIMIT N SELECT min(pickup_date), max(pickup_date) FROM ( SELECT pickup_date FROM trips_mergetree_third LIMIT 1000 ) ┌─min(pickup_dat sample data WHERE rand() % N < M SELECT trip_id FROM trips WHERE (rand() % 1000) = 0 LIMIT 1 SETTINGS max_threads = 1 ┌───trip_id─┐ │ 960186089 │ └───────────┘ SELECT trip_id FROM trips WHERE (rand()0 码力 | 64 页 | 1.38 MB | 1 年前3
0. Machine Learning with ClickHouse Pandas import requests import io import pandas as pd url = 'http://127.0.0.1:8123?query=' query = 'select * from trips limit 1000 format TSVWithNames' resp = requests.get(url, data=query) string_io = io WHERE condition › SAMPLE x OFFSET y 7 / 62 How to sample data LIMIT N SELECT min(pickup_date), max(pickup_date) FROM ( SELECT pickup_date FROM trips_mergetree_third LIMIT 1000 ) ┌─min(pickup_dat sample data WHERE rand() % N < M SELECT trip_id FROM trips WHERE (rand() % 1000) = 0 LIMIT 1 SETTINGS max_threads = 1 ┌───trip_id─┐ │ 960186089 │ └───────────┘ SELECT trip_id FROM trips WHERE (rand()0 码力 | 64 页 | 1.38 MB | 1 年前3
ClickHouse in ProductionIntrospection in custom headers :) WITH histogram(5)(rand() % 100) AS hist SELECT arrayJoin(hist).3 AS h, bar(h, 0, 6, 5) AS b FROM ( SELECT * FROM system.numbers LIMIT 20) ┌─────h─┬─b─────┐ │ 4.25 │ ███▎ │ 'Parquet') Ok. 0 rows in set. Elapsed: 0.004 sec. 51 / 97 In ClickHouse: Most Clicked Banner SELECT countIf(CounterType='Show') as SumShows, countIf(CounterType='Click') as SumClicks, BannerID FROM EventLogHDFS GROUP BY BannerID ORDER BY SumClicks desc LIMIT 3; 52 / 97 In ClickHouse: Most Clicked Banner SELECT countIf(CounterType='Show') as SumShows, countIf(CounterType='Click') as SumClicks, BannerID FROM0 码力 | 100 页 | 6.86 MB | 1 年前3
Тестирование ClickHouse которого мы заслуживаемсанитайзерами Примеры запросов: SELECT metroHash64(uniqUpTo('\0', '2[Vu'), 'Y&d'); SELECT joinGet(toDateTimeOrNull((CAST(([885455.14523]) AS String)))); SELECT (SELECT 1) FROM remote('127.0.0.{1,2}', system hdfs1 node1:9018 192.168.2.1 node2:9018 192.168.2.2 blocade zoo1 zoo2 zoo3 INSERT INTO tt SELECT * FROM hdfs('hdfs://hdfs1:9000/tt', 'TSV') client 31 / 77 Тестирование ClickHouse тесты: пример hdfs1 node1:9018 192.168.2.1 node2:9018 192.168.2.2 blocade zoo1 zoo2 zoo3 client SELECT COUNT() FROM tt 33 / 77 Тестирование ClickHouse, которого мы заслуживаем Интеграционные тесты:0 码力 | 84 页 | 9.60 MB | 1 年前3
8. Continue to use ClickHouse as TSDB10:00:00 Tom 26 45% 92 ... 20 ... ... ... ... ... ... ... 2019/10/11/ 11:00:01 Tracy 22 45% 90 ... 11 SELECT HeartRate FROM ... WHERE Time BETWEEN ... AND ... AND Name = “Tom” Red : Data needed Green 13091 121.55687 31.31908 ... 116.30101 31.31673 121.54794 31.32318 ... ... ... ... ... ... SELECT HeartRate FROM ... WHERE Time BETWEEN ... AND ... AND Name = “Tom” Red : Data needed Green .. 45% ... ... Tom 26 HeartRate 2019/10/10/ 11 ... 96 ... ... ... ... ... ... ... ... ... ... SELECT HeartRate FROM ... WHERE Time BETWEEN ... AND ... AND Name = “Tom” Red : Data needed Green0 码力 | 42 页 | 911.10 KB | 1 年前3
7. UDF in ClickHousemodules Module = Input + Task + Output Task = Query or external program Query = “CREATE TABLE ... AS SELECT ...” A Database System and A ML Pipeline Begin Content Area = 16,30 10 Why ClickHouse Limited self-joining on time series Ease of Use and Maintainability SELECT skewPop(x) FROM data SELECT centralMoment(3)(x) / pow(stddevPop(x), 3) FROM data SELECT (sum(pow(x, 3)) / count() - 3 * sum(pow(x, 2)) * sum(x) of Our UDF Begin Content Area = 16,30 16 Windowed Aggregate Functions SELECT windowRefer(30)(date, value) FROM data SELECT windowReferEx(30, 'quantile(0.2)')(date, value) FROM data • Computing the0 码力 | 29 页 | 1.54 MB | 1 年前3
5. ClickHouse at Ximalaya for Shanghai Meetup 2019 PDF������������������������������������������ SELECT user, groupArray(page) as pages, groupArray(timestamp) as timestamps, arrayEnumerate(pages) as index FROM (SELECT * FROM client_log_all ORDER BY timestamp) timestamp) GROUP BY user ����������������� ���������� SELECT user, groupArray(page) as pages, groupArray(timestamp) as timestamps, arrayEnumerate(pages) as index, arrayFilter((i, p) -> (pages[i] = 'HomePage' level_3 FROM (SELECT * FROM client_log_all ORDER BY timestamp) GROUP BY user • ����������������������������������������������������������������������� ����������������� ���������� SELECT user, groupArray(timestamp)0 码力 | 28 页 | 6.87 MB | 1 年前3
ClickHouse: настоящее и будущеенесколько таблиц и представлений • Для атомарной вставки на кластер • Для выполнения множества SELECT из одного снапшота В разработке, запланировано на Q2 2022. Недостаточная совместимость SQL 17 = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;0 码力 | 32 页 | 2.62 MB | 1 年前3
ClickHouse: настоящее и будущеенесколько таблиц и представлений • Для атомарной вставки на кластер • Для выполнения множества SELECT из одного снапшота В разработке, запланировано на Q2 2022. Недостаточная совместимость SQL 17 = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE = MergeTree; SELECT data.teams.name[1] FROM games;0 码力 | 32 页 | 776.70 KB | 1 年前3
2. Clickhouse玩转每天千亿数据-趣头条timestamp) 业务场景 1:趣头条和米读的上报数据是按照”事件类型”(eventType)进行区分 2:指标系统分”分时”和”累时”指标 3:指标的一般都是会按照eventType进行区分 select count(1) from table where dt='' and timestamp>='' and timestamp<='' and eventType='' 建表的时候缺乏深度思 GiB 分析: 1:max_memory_usage指定单个SQL查询在该机器上面最大内存使用量 2:除了些简单的SQL,空间复杂度是O(1) 如: select count(1) from table where column=value select column1, column2 from table where column=value 凡是涉及group by, order by, distinct0 码力 | 14 页 | 1.10 MB | 1 年前3
共 16 条
- 1
- 2













