Parquet是一种存储格式,其本身与任何语言、平台都没有关系,也不需要与任何一种数据处理框架绑定。但是一个开源技术的发展,必然需要有合适的生态圈助力才行,Spark便是Parquet的核心助力之一。作为内存型并行计算引擎,Spark被广泛应用在流处理、离线处理等场景,其从1.0.0便开始支持Parquet,方便我们操作数据。 Apache Arrow是Apache基金会下一个全新的开源项目,同时也是顶级项目。它的目的是作为一个跨平台的数据层来加快大数据分析项目的运行速度。
在数据挖掘小组,语言是python,所以parquet的写入自然就选择pyarrow。 所以:
#mermaid-svg-cNo5RJoWbBfumjD8 .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .label text{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .node rect,#mermaid-svg-cNo5RJoWbBfumjD8 .node circle,#mermaid-svg-cNo5RJoWbBfumjD8 .node ellipse,#mermaid-svg-cNo5RJoWbBfumjD8 .node polygon,#mermaid-svg-cNo5RJoWbBfumjD8 .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-cNo5RJoWbBfumjD8 .node .label{text-align:center;fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .node.clickable{cursor:pointer}#mermaid-svg-cNo5RJoWbBfumjD8 .arrowheadPath{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-cNo5RJoWbBfumjD8 .flowchart-link{stroke:#333;fill:none}#mermaid-svg-cNo5RJoWbBfumjD8 .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-cNo5RJoWbBfumjD8 .edgeLabel rect{opacity:0.9}#mermaid-svg-cNo5RJoWbBfumjD8 .edgeLabel span{color:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-cNo5RJoWbBfumjD8 .cluster text{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-cNo5RJoWbBfumjD8 .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-cNo5RJoWbBfumjD8 text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-cNo5RJoWbBfumjD8 .actor-line{stroke:grey}#mermaid-svg-cNo5RJoWbBfumjD8 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-cNo5RJoWbBfumjD8 #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .sequenceNumber{fill:#fff}#mermaid-svg-cNo5RJoWbBfumjD8 #sequencenumber{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 #crosshead path{fill:#333;stroke:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .messageText{fill:#333;stroke:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-cNo5RJoWbBfumjD8 .labelText,#mermaid-svg-cNo5RJoWbBfumjD8 .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-cNo5RJoWbBfumjD8 .loopText,#mermaid-svg-cNo5RJoWbBfumjD8 .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-cNo5RJoWbBfumjD8 .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-cNo5RJoWbBfumjD8 .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-cNo5RJoWbBfumjD8 .noteText,#mermaid-svg-cNo5RJoWbBfumjD8 .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-cNo5RJoWbBfumjD8 .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-cNo5RJoWbBfumjD8 .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-cNo5RJoWbBfumjD8 .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-cNo5RJoWbBfumjD8 .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .section{stroke:none;opacity:0.2}#mermaid-svg-cNo5RJoWbBfumjD8 .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-cNo5RJoWbBfumjD8 .section2{fill:#fff400}#mermaid-svg-cNo5RJoWbBfumjD8 .section1,#mermaid-svg-cNo5RJoWbBfumjD8 .section3{fill:#fff;opacity:0.2}#mermaid-svg-cNo5RJoWbBfumjD8 .sectionTitle0{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .sectionTitle1{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .sectionTitle2{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .sectionTitle3{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-cNo5RJoWbBfumjD8 .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .grid path{stroke-width:0}#mermaid-svg-cNo5RJoWbBfumjD8 .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-cNo5RJoWbBfumjD8 .task{stroke-width:2}#mermaid-svg-cNo5RJoWbBfumjD8 .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .taskText:not([font-size]){font-size:11px}#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-cNo5RJoWbBfumjD8 .task.clickable{cursor:pointer}#mermaid-svg-cNo5RJoWbBfumjD8 .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-cNo5RJoWbBfumjD8 .taskText0,#mermaid-svg-cNo5RJoWbBfumjD8 .taskText1,#mermaid-svg-cNo5RJoWbBfumjD8 .taskText2,#mermaid-svg-cNo5RJoWbBfumjD8 .taskText3{fill:#fff}#mermaid-svg-cNo5RJoWbBfumjD8 .task0,#mermaid-svg-cNo5RJoWbBfumjD8 .task1,#mermaid-svg-cNo5RJoWbBfumjD8 .task2,#mermaid-svg-cNo5RJoWbBfumjD8 .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutside0,#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutside2{fill:#000}#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutside1,#mermaid-svg-cNo5RJoWbBfumjD8 .taskTextOutside3{fill:#000}#mermaid-svg-cNo5RJoWbBfumjD8 .active0,#mermaid-svg-cNo5RJoWbBfumjD8 .active1,#mermaid-svg-cNo5RJoWbBfumjD8 .active2,#mermaid-svg-cNo5RJoWbBfumjD8 .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-cNo5RJoWbBfumjD8 .activeText0,#mermaid-svg-cNo5RJoWbBfumjD8 .activeText1,#mermaid-svg-cNo5RJoWbBfumjD8 .activeText2,#mermaid-svg-cNo5RJoWbBfumjD8 .activeText3{fill:#000 !important}#mermaid-svg-cNo5RJoWbBfumjD8 .done0,#mermaid-svg-cNo5RJoWbBfumjD8 .done1,#mermaid-svg-cNo5RJoWbBfumjD8 .done2,#mermaid-svg-cNo5RJoWbBfumjD8 .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-cNo5RJoWbBfumjD8 .doneText0,#mermaid-svg-cNo5RJoWbBfumjD8 .doneText1,#mermaid-svg-cNo5RJoWbBfumjD8 .doneText2,#mermaid-svg-cNo5RJoWbBfumjD8 .doneText3{fill:#000 !important}#mermaid-svg-cNo5RJoWbBfumjD8 .crit0,#mermaid-svg-cNo5RJoWbBfumjD8 .crit1,#mermaid-svg-cNo5RJoWbBfumjD8 .crit2,#mermaid-svg-cNo5RJoWbBfumjD8 .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-cNo5RJoWbBfumjD8 .activeCrit0,#mermaid-svg-cNo5RJoWbBfumjD8 .activeCrit1,#mermaid-svg-cNo5RJoWbBfumjD8 .activeCrit2,#mermaid-svg-cNo5RJoWbBfumjD8 .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-cNo5RJoWbBfumjD8 .doneCrit0,#mermaid-svg-cNo5RJoWbBfumjD8 .doneCrit1,#mermaid-svg-cNo5RJoWbBfumjD8 .doneCrit2,#mermaid-svg-cNo5RJoWbBfumjD8 .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-cNo5RJoWbBfumjD8 .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-cNo5RJoWbBfumjD8 .milestoneText{font-style:italic}#mermaid-svg-cNo5RJoWbBfumjD8 .doneCritText0,#mermaid-svg-cNo5RJoWbBfumjD8 .doneCritText1,#mermaid-svg-cNo5RJoWbBfumjD8 .doneCritText2,#mermaid-svg-cNo5RJoWbBfumjD8 .doneCritText3{fill:#000 !important}#mermaid-svg-cNo5RJoWbBfumjD8 .activeCritText0,#mermaid-svg-cNo5RJoWbBfumjD8 .activeCritText1,#mermaid-svg-cNo5RJoWbBfumjD8 .activeCritText2,#mermaid-svg-cNo5RJoWbBfumjD8 .activeCritText3{fill:#000 !important}#mermaid-svg-cNo5RJoWbBfumjD8 .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-cNo5RJoWbBfumjD8 g.classGroup text .title{font-weight:bolder}#mermaid-svg-cNo5RJoWbBfumjD8 g.clickable{cursor:pointer}#mermaid-svg-cNo5RJoWbBfumjD8 g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-cNo5RJoWbBfumjD8 g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-cNo5RJoWbBfumjD8 .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-cNo5RJoWbBfumjD8 .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-cNo5RJoWbBfumjD8 .dashed-line{stroke-dasharray:3}#mermaid-svg-cNo5RJoWbBfumjD8 #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 .commit-id,#mermaid-svg-cNo5RJoWbBfumjD8 .commit-msg,#mermaid-svg-cNo5RJoWbBfumjD8 .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-cNo5RJoWbBfumjD8 g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-cNo5RJoWbBfumjD8 g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-cNo5RJoWbBfumjD8 g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-cNo5RJoWbBfumjD8 .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-cNo5RJoWbBfumjD8 .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-cNo5RJoWbBfumjD8 .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-cNo5RJoWbBfumjD8 .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-cNo5RJoWbBfumjD8 .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-cNo5RJoWbBfumjD8 .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-cNo5RJoWbBfumjD8 .edgeLabel text{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-cNo5RJoWbBfumjD8 .node circle.state-start{fill:black;stroke:black}#mermaid-svg-cNo5RJoWbBfumjD8 .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-cNo5RJoWbBfumjD8 #statediagram-barbEnd{fill:#9370db}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-state .divider{stroke:#9370db}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-cNo5RJoWbBfumjD8 .note-edge{stroke-dasharray:5}#mermaid-svg-cNo5RJoWbBfumjD8 .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-cNo5RJoWbBfumjD8 .error-icon{fill:#522}#mermaid-svg-cNo5RJoWbBfumjD8 .error-text{fill:#522;stroke:#522}#mermaid-svg-cNo5RJoWbBfumjD8 .edge-thickness-normal{stroke-width:2px}#mermaid-svg-cNo5RJoWbBfumjD8 .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-cNo5RJoWbBfumjD8 .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-cNo5RJoWbBfumjD8 .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-cNo5RJoWbBfumjD8 .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-cNo5RJoWbBfumjD8 .marker{fill:#333}#mermaid-svg-cNo5RJoWbBfumjD8 .marker.cross{stroke:#333} :root { --mermaid-font-family: "trebuchet ms", verdana, arial;} #mermaid-svg-cNo5RJoWbBfumjD8 { color: rgba(0, 0, 0, 0.75); font: ; } arrow spark SQL SQL spark等 模型数据 parquet-hive 批处理 IMPALA查询 hive查询 下游处理这是基本的结构,但是现在出现一个不兼容的地方: 有一批数据: impala查询
hive查询: spark查询也是一对NULL
参考:https://zhuanlan.zhihu.com/p/113213420 查看数据文件的schema,发现:
{"index_columns": [], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "platform_type", "field_name": "platform_type", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "account_id", "field_name": "account_id", "pandas_type": "int32", "numpy_type": "int64", "metadata": null}, {"name": "identify_id", "field_name": "identify_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N1", "field_name": "N1", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N2", "field_name": "N2", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N3", "field_name": "N3", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N4", "field_name": "N4", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N5", "field_name": "N5", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID1", "field_name": "ID1", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID2", "field_name": "ID2", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID3", "field_name": "ID3", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID4", "field_name": "ID4", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID5", "field_name": "ID5", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "IDA", "field_name": "IDA", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "NA", "field_name": "NA", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "update_time", "field_name": "update_time", "pandas_type": "int32", "numpy_type": "int32", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.17.1"}, "pandas_version": "0.25.3"}然后发现,pyarrow的写入代码 而正确的应该是: 可以发现,impala并未严格读取schema,而是采用和text的方式,顺序对应上就OK。
修复 两者一致(数据中,表中)
