 ClickHouse in ProductionMessage Broker Daemons Pipeline mt-nano mt-log mt-giga Key-Value Store Dictionary Data DB Java API BI Tool Site Owner Data analyst 10 / 97 ClickHouse in Production: Yandex.Metrika Site visitor Message Broker Daemons Pipeline mt-nano mt-log mt-giga Key-Value Store Dictionary Data DB Java API BI Tool Site Owner Data analyst 11 / 97 ClickHouse in Production: Yandex.Metrika Site visitor Message Broker Daemons Pipeline mt-nano mt-log mt-giga Key-Value Store Dictionary Data DB Java API BI Tool Site Owner Data analyst 12 / 97 ClickHouse in Production: Yandex.Metrika Site visitor0 码力 | 100 页 | 6.86 MB | 1 年前3 ClickHouse in ProductionMessage Broker Daemons Pipeline mt-nano mt-log mt-giga Key-Value Store Dictionary Data DB Java API BI Tool Site Owner Data analyst 10 / 97 ClickHouse in Production: Yandex.Metrika Site visitor Message Broker Daemons Pipeline mt-nano mt-log mt-giga Key-Value Store Dictionary Data DB Java API BI Tool Site Owner Data analyst 11 / 97 ClickHouse in Production: Yandex.Metrika Site visitor Message Broker Daemons Pipeline mt-nano mt-log mt-giga Key-Value Store Dictionary Data DB Java API BI Tool Site Owner Data analyst 12 / 97 ClickHouse in Production: Yandex.Metrika Site visitor0 码力 | 100 页 | 6.86 MB | 1 年前3
 1. Machine Learning with ClickHouseresp = requests.get(url, data=query) string_io = io.StringIO(resp.text) table = pd.read_csv(string_io, sep="\t") 5 / 62 Table (part) 6 / 62 How to sample data You already know it! › LIMIT N › WHERE for fixed sample query › Only for MergeTree 11 / 62 How to sample data SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) to store trained model You can store model as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount0 码力 | 64 页 | 1.38 MB | 1 年前3 1. Machine Learning with ClickHouseresp = requests.get(url, data=query) string_io = io.StringIO(resp.text) table = pd.read_csv(string_io, sep="\t") 5 / 62 Table (part) 6 / 62 How to sample data You already know it! › LIMIT N › WHERE for fixed sample query › Only for MergeTree 11 / 62 How to sample data SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) to store trained model You can store model as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount0 码力 | 64 页 | 1.38 MB | 1 年前3
 0. Machine Learning with ClickHouse resp = requests.get(url, data=query) string_io = io.StringIO(resp.text) table = pd.read_csv(string_io, sep="\t") 5 / 62 Table (part) 6 / 62 How to sample data You already know it! › LIMIT N › WHERE for fixed sample query › Only for MergeTree 11 / 62 How to sample data SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) to store trained model You can store model as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount0 码力 | 64 页 | 1.38 MB | 1 年前3 0. Machine Learning with ClickHouse resp = requests.get(url, data=query) string_io = io.StringIO(resp.text) table = pd.read_csv(string_io, sep="\t") 5 / 62 Table (part) 6 / 62 How to sample data You already know it! › LIMIT N › WHERE for fixed sample query › Only for MergeTree 11 / 62 How to sample data SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) to store trained model You can store model as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount0 码力 | 64 页 | 1.38 MB | 1 年前3
 7. UDF in ClickHouseComputing Task Result Table Pipeline = Directed Acyclic Graph (DAG) of modules Module = Input + Task + Output Task = Query or external program Query = “CREATE TABLE ... AS SELECT ...” A Database with few dependencies Customization • The straight-forward code structure and the well-designed API • We maintains a custom build Begin Content Area = 16,30 11 The UDF Magic Begin Content Area provided by the user UDF in ClickHouse • Scalar functions • Aggregate functions & combinators • Table functions & storage engines Usage Examples in Our ML Systems Data Preprocessing Filling invalid0 码力 | 29 页 | 1.54 MB | 1 年前3 7. UDF in ClickHouseComputing Task Result Table Pipeline = Directed Acyclic Graph (DAG) of modules Module = Input + Task + Output Task = Query or external program Query = “CREATE TABLE ... AS SELECT ...” A Database with few dependencies Customization • The straight-forward code structure and the well-designed API • We maintains a custom build Begin Content Area = 16,30 11 The UDF Magic Begin Content Area provided by the user UDF in ClickHouse • Scalar functions • Aggregate functions & combinators • Table functions & storage engines Usage Examples in Our ML Systems Data Preprocessing Filling invalid0 码力 | 29 页 | 1.54 MB | 1 年前3
 2. 腾讯 clickhouse实践 _2019丁晓坤&熊峰高内存,廉价存储: 单机配置: Memory128G CPU核数24 SATA20T,RAID5 万兆网卡 一切以用户价值为依归 5 部署与监控管理 1 生产环境部署方案: Distributed Table Replica1Replica1 Replica1Replica1 Replica1Replica1 Shard01 Shard02 Shard03 Load Balancing 一切以用户价值为依归 Partition0 Data2 Partition2 DataN PartitionM … … app-2 … … app-n RPC DataNode 基于位图的分布式计算引擎 API Server Scheduler SQL-Parser QueryOptimier Column1 DataNode Column2 Column3 ColumnN Column10 码力 | 26 页 | 3.58 MB | 1 年前3 2. 腾讯 clickhouse实践 _2019丁晓坤&熊峰高内存,廉价存储: 单机配置: Memory128G CPU核数24 SATA20T,RAID5 万兆网卡 一切以用户价值为依归 5 部署与监控管理 1 生产环境部署方案: Distributed Table Replica1Replica1 Replica1Replica1 Replica1Replica1 Shard01 Shard02 Shard03 Load Balancing 一切以用户价值为依归 Partition0 Data2 Partition2 DataN PartitionM … … app-2 … … app-n RPC DataNode 基于位图的分布式计算引擎 API Server Scheduler SQL-Parser QueryOptimier Column1 DataNode Column2 Column3 ColumnN Column10 码力 | 26 页 | 3.58 MB | 1 年前3
 3. Sync Clickhouse with MySQL_MongoDBCan’t update/delete table frequently in Clickhouse Possible Solutions 2. MySQL Engine Not suitable for big tables Not suitable for MongoDB Possible Solutions 3. Reinit whole table every day…… Possible PTS Key Features ● Only one config file needed for a new Clickhouse table ● Init and keep syncing data in one app for a table ● Sync multiple data source to Clickhouse in minutes PTS Provider Transform mongodb, redis Listen: binlog, // binlog, kafka DataSource: user:pass@tcp(example.com:3306)/user, Table: user, QueryKeys: [ // usually primary key id ], Pairs: { // field mapping id: id, name: name0 码力 | 38 页 | 7.13 MB | 1 年前3 3. Sync Clickhouse with MySQL_MongoDBCan’t update/delete table frequently in Clickhouse Possible Solutions 2. MySQL Engine Not suitable for big tables Not suitable for MongoDB Possible Solutions 3. Reinit whole table every day…… Possible PTS Key Features ● Only one config file needed for a new Clickhouse table ● Init and keep syncing data in one app for a table ● Sync multiple data source to Clickhouse in minutes PTS Provider Transform mongodb, redis Listen: binlog, // binlog, kafka DataSource: user:pass@tcp(example.com:3306)/user, Table: user, QueryKeys: [ // usually primary key id ], Pairs: { // field mapping id: id, name: name0 码力 | 38 页 | 7.13 MB | 1 年前3
 8. Continue to use ClickHouse as TSDBColumn-Orient Model ► (2) Time-Series-Orient Model How we do ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view ( `Time` DateTime, `Name` String, `Age` UInt8, ..., `HeartRate` PARTITION BY toYYYYMM(Time) ORDER BY (Name, Time, Age, ...); ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view ( `Time` DateTime, `Name` LowCardinality(String), `Age` UInt8 rows, 5.19 GB (168.64 million rows/s., 6.07 GB/s.) ► Time-Series-Orient Model How we do CREATE TABLE demonstration.test ( `time_series_interval` DateTime, `metric_name` String, `Name`0 码力 | 42 页 | 911.10 KB | 1 年前3 8. Continue to use ClickHouse as TSDBColumn-Orient Model ► (2) Time-Series-Orient Model How we do ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view ( `Time` DateTime, `Name` String, `Age` UInt8, ..., `HeartRate` PARTITION BY toYYYYMM(Time) ORDER BY (Name, Time, Age, ...); ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view ( `Time` DateTime, `Name` LowCardinality(String), `Age` UInt8 rows, 5.19 GB (168.64 million rows/s., 6.07 GB/s.) ► Time-Series-Orient Model How we do CREATE TABLE demonstration.test ( `time_series_interval` DateTime, `metric_name` String, `Name`0 码力 | 42 页 | 911.10 KB | 1 年前3
 ClickHouse: настоящее и будущееОбработка графов • Batch jobs • Data Hub Support For Semistructured Data 27 JSO data type: CREATE TABLE games (data JSON) ENGINE = MergeTree; • You can insert arbitrary nested JSONs • Types are automatically games dataset CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE teams.name[1] FROM games; — 0.015 sec. Support For Semistructured Data <-- inferred type DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( `_id.$oid` String, `date0 码力 | 32 页 | 2.62 MB | 1 年前3 ClickHouse: настоящее и будущееОбработка графов • Batch jobs • Data Hub Support For Semistructured Data 27 JSO data type: CREATE TABLE games (data JSON) ENGINE = MergeTree; • You can insert arbitrary nested JSONs • Types are automatically games dataset CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE teams.name[1] FROM games; — 0.015 sec. Support For Semistructured Data <-- inferred type DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( `_id.$oid` String, `date0 码力 | 32 页 | 2.62 MB | 1 年前3
 ClickHouse: настоящее и будущееОбработка графов • Batch jobs • Data Hub Support For Semistructured Data 27 JSO data type: CREATE TABLE games (data JSON) ENGINE = MergeTree; • You can insert arbitrary nested JSONs • Types are automatically games dataset CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE teams.name[1] FROM games; — 0.015 sec. Support For Semistructured Data <-- inferred type DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( `_id.$oid` String, `date0 码力 | 32 页 | 776.70 KB | 1 年前3 ClickHouse: настоящее и будущееОбработка графов • Batch jobs • Data Hub Support For Semistructured Data 27 JSO data type: CREATE TABLE games (data JSON) ENGINE = MergeTree; • You can insert arbitrary nested JSONs • Types are automatically games dataset CREATE TABLE games (data String) ENGINE = MergeTree ORDER BY tuple(); SELECT JSONExtractString(data, 'teams', 1, 'name') FROM games; — 0.520 sec. CREATE TABLE games (data JSON) ENGINE teams.name[1] FROM games; — 0.015 sec. Support For Semistructured Data <-- inferred type DESCRIBE TABLE games SETTINGS describe_extend_object_types = 1 name: data type: Tuple( `_id.$oid` String, `date0 码力 | 32 页 | 776.70 KB | 1 年前3
 2. Clickhouse玩转每天千亿数据-趣头条1:趣头条和米读的上报数据是按照”事件类型”(eventType)进行区分 2:指标系统分”分时”和”累时”指标 3:指标的一般都是会按照eventType进行区分 select count(1) from table where dt='' and timestamp>='' and timestamp<='' and eventType='' 建表的时候缺乏深度思考,由于分时指标的特性,我们的表是order 1:max_memory_usage指定单个SQL查询在该机器上面最大内存使用量 2:除了些简单的SQL,空间复杂度是O(1) 如: select count(1) from table where column=value select column1, column2 from table where column=value 凡是涉及group by, order by, distinct, join这样的SQL内存占用不再是O(1)0 码力 | 14 页 | 1.10 MB | 1 年前3 2. Clickhouse玩转每天千亿数据-趣头条1:趣头条和米读的上报数据是按照”事件类型”(eventType)进行区分 2:指标系统分”分时”和”累时”指标 3:指标的一般都是会按照eventType进行区分 select count(1) from table where dt='' and timestamp>='' and timestamp<='' and eventType='' 建表的时候缺乏深度思考,由于分时指标的特性,我们的表是order 1:max_memory_usage指定单个SQL查询在该机器上面最大内存使用量 2:除了些简单的SQL,空间复杂度是O(1) 如: select count(1) from table where column=value select column1, column2 from table where column=value 凡是涉及group by, order by, distinct, join这样的SQL内存占用不再是O(1)0 码力 | 14 页 | 1.10 MB | 1 年前3
共 15 条
- 1
- 2













